Announcement

Collapse
No announcement yet.

macOS 13 Adding Ability To Use Rosetta In ARM Linux VMs For Speedy x86_64 Linux Binaries

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Ladis View Post
    But race condition is only when you use more than one CPU core. And given how fast Rosetta 2 is and how slow is the generic emulation in Windows for ARM and Linux, it can be the same speed as using 2 cores, sometimes even like 4 cores using a generic emulation.
    Maybe I misunderstood the previous post, but I was talking about running on multiple cores without any "mitigation" for the missing TOS mode. It might just work for older software using library functions, but C++11 for example will not use any explicit stores/barriers.

    Originally posted by Ladis View Post
    Fan fact: What if Windows and Linux had an option to use only one CPU core and thus disabling inserting the memory barriers heavily slowing the code?
    The switch configuring the barriers would come at additional cost, on "uni-cores" the barriers should be as fast as NOPs. And for accessing hardware (ie kernel code) you will need explicit barriers anyway.

    And if you follow some guidelines the barriers will generally cost very little unless contested (ie. when you desperately need them)

    Comment


    • #22
      Originally posted by Ladis View Post
      But race condition is only when you use more than one CPU core.
      This is not entirely true tho. Concurrency is generally enough to trigger data races if operations aren't atomic or dependent on ordering, even when they don't technically run in parallel. It just requires extra bad luck to trigger them because for it to happen the scheduler must switch the racing tasks just at the worst possible moment.

      Originally posted by Ladis View Post
      And given how fast Rosetta 2 is and how slow is the generic emulation in Windows for ARM and Linux, it can be the same speed as using 2 cores, sometimes even like 4 cores using a generic emulation.
      Was it tested that it's that much faster than stuff like FEX tho?

      Originally posted by Ladis View Post
      Fan fact: What if Windows and Linux had an option to use only one CPU core and thus disabling inserting the memory barriers heavily slowing the code?
      AFAIR they do have that option.

      Comment


      • #23
        Originally posted by sinepgib View Post

        1. This is not entirely true tho. Concurrency is generally enough to trigger data races if operations aren't atomic or dependent on ordering, even when they don't technically run in parallel. It just requires extra bad luck to trigger them because for it to happen the scheduler must switch the racing tasks just at the worst possible moment.

        2. Was it tested that it's that much faster than stuff like FEX tho?

        3. AFAIR they do have that option.
        1. In general. But in this case of the strong memory model of x86, we talk about how other CPUs (CPU cores) see the RAM affected by another CPU. Unless you have some root debug process writing to another process's RAM, it's safe when the emulated program uses only one CPU core.

        2. Yes, there were tests linked here in Phoronix (maybe in another article or article's comments). E.g. for 7zip, roughly Native=100%, Rosetta2/Mac=75%, Rosetta2/Linux=70% (because nothing was accelerated, e.g. drawing the GUI terminal window), new exciting Box86=50%, FEX=30-35%, QEMU=25%.

        3. Great.

        Comment


        • #24
          Originally posted by Ladis View Post
          1. In general. But in this case of the strong memory model of x86, we talk about how other CPUs (CPU cores) see the RAM affected by another CPU. Unless you have some root debug process writing to another process's RAM, it's safe when the emulated program uses only one CPU core.
          I don't follow. Threads share memory. Threads are scheduled pretty much exactly as if they were processes. I don't remember the details, but I gather this works as long as being scheduled out guarantees all pending stores have been performed, right?

          Originally posted by Ladis View Post
          2. Yes, there were tests linked here in Phoronix (maybe in another article or article's comments). E.g. for 7zip, roughly Native=100%, Rosetta2/Mac=75%, Rosetta2/Linux=70% (because nothing was accelerated, e.g. drawing the GUI terminal window), new exciting Box86=50%, FEX=30-35%, QEMU=25%.
          Cool.

          Comment


          • #25
            Originally posted by sinepgib View Post

            I don't follow. Threads share memory. Threads are scheduled pretty much exactly as if they were processes. I don't remember the details, but I gather this works as long as being scheduled out guarantees all pending stores have been performed, right?
            As long, as the threads/processes run on the same CPU core, they don't run simultaneously. They just switch fast enough, you feel like they run in parallel.

            Comment


            • #26
              Originally posted by Ladis View Post
              As long, as the threads/processes run on the same CPU core, they don't run simultaneously. They just switch fast enough, you feel like they run in parallel.
              I am fully aware they don't run in parallel, that's not the point. When you interleave processes you can race just as much as you race when running in parallel, unless the instruction is actually atomic and acts like a memory fence (or there's an explicit memory fence in the way or being scheduled out ensures all intended writes to memory happened). So, in the end, the question is if this weak memory ordering affects any of this behaviors or is only relevant when running in parallel.

              Comment


              • #27
                Originally posted by sinepgib View Post

                I am fully aware they don't run in parallel, that's not the point. When you interleave processes you can race just as much as you race when running in parallel, unless the instruction is actually atomic and acts like a memory fence (or there's an explicit memory fence in the way or being scheduled out ensures all intended writes to memory happened). So, in the end, the question is if this weak memory ordering affects any of this behaviors or is only relevant when running in parallel.
                We have the memory protection since Windows 95, so other processes (possibly running really in parallel on other CPU cores) can't write to the emulated program's memory.

                Comment


                • #28
                  Originally posted by Ladis View Post
                  We have the memory protection since Windows 95, so other processes (possibly running really in parallel on other CPU cores) can't write to the emulated program's memory.
                  Again, are you not reading? Threads. Threads. Threads. Threads are an OS construct that shares memory and runs interleaved and is scheduled like processes are.
                  I know what memory protection is, but this memory protection is based on page tables that are shared by threads of a single process.
                  I'm asking what color the table is and you try to explain to me what a table is.

                  Comment


                  • #29
                    For the sake of argument, let's use an example of a data race in single core machines: https://godbolt.org/z/P4Penx5se
                    The imaginary author of this code expects the program to print 1000. But, because in lines 10-13 the increment is not atomic (first read to register, then increment, then store to memory), even on single core machines the thread may be scheduled out after line 10 but before line 11 of the assembly output runs, then another one running. This would result in one or more of the increments to happen on old values of x, thus writing a lesser number than expected to memory in line 13 when the original thread comes back to execution.
                    Now, obviously this is multiple instructions that aren't supposed to be atomic as a whole (this race triggers even on real x86 cores). I'm asking whether something similar can happen when running on a real x86 core wouldn't trigger a race, because of these extra guarantees in x86 aren't implemented in the ARM core.
                    Last edited by sinepgib; 08 June 2022, 11:11 AM. Reason: Bugs in the example code

                    Comment


                    • #30
                      Will Apple make Rosetta available to be able to run Windows in a VM on macOS on Apple hardware? This is my current use case -- I need to run one piece of Windows-only software.

                      Comment

                      Working...
                      X