Announcement

**oiaohm** · 19 June 2021, 06:45 PM

Originally posted by Linuxxx View Post

So, you really think they want to get into the ARM game, too?
Then wouldn't it make sense to further develop a highly performant x86 to ARM binary translator like Apple has done?

Originally posted by Quackdoc View Post

Not necessarily, but it is a barrier to entry. honestly Im not sure. there are certainly benefits to migrating to arm, or even risc-v. but the issues with Cisc -> Risc emulation, take apple, you need hardware able to do it, and I doubt steam wants to get into fabrications. but it is good preparatory work, because eventually we will move that way.

There are different x86 to arm binary translators in existence. Dynamic ones that use native host libraries like box86 are getting to worst being a 50% performance loss and at best almost as good as native of course box86 has a future problem coming at the moment that for 32 bit x86 programs it need 32 bit arm libraries with newer versions of arm cpu going to drop 32 bit support completely. Wine project hangover work with qemu is seeing the same perform cost but its working out all the fine details to get the bridging right for 32 bit x86 to 64 bit arm libraries is not being easy. So there is work to make reasonable highly performance x86 to arm binary translator work in dynamic recompile. Note these get high performance by binary translating the least possible.

Please note qemu and box86 are still using dynamic translation as in at runtime. If you look at Microsoft x86 on arm it uses a mix of static translation and dynamic translation with its caching that able to lift to 50% performance at worst to having a second run at worst performance at around 80% of native again.

The reality is that special hardware instructions can help but they are not required to get somewhere near decent performance here. So valve does not need to move into fabrication here because default instruction set of arm is workable for this and same with risc-v really. Of course to make it profitable for valve we have to see more arm/risc-v chips being made that have quite high performance so a 20% to 50% performance lose running dynamic emulation is not a problem.

Some parties are starting to work on static recompliers for x86 to arm or risc-v these give good as native performance but you can guess anything with copy protection or anti-cheat absolutely hates this because the converted binaries no longer validate what makes this route mostly not useful to valve it is useful for those with internal old custom enterprise applications.

Its really simple to miss that multi parties have done the Cisc-> Risc emulation over time how to-do this and get decent results has become more well known and used. Please note most of the improvement methods turn out to be generic like only emulate the application not the complete system, use native platform libraries and cache dynamic translation as these help for Cisc->Risc, Risc->Cisc and Cisc->Cisc but these also turn out to be quite a pain to implement. Early on there was though to be major problem causing special issues for Cisc->Risc but its turn out not to be the case major issues are fairly much the same every time you have to deal with a non native instruction set. Yes there is some issues with Cisc-> some other instruction set due to the dynamic length of instructions that common in Cisc instruction sets but that not as big of problem as it first appeared and is quite well solved these days. Hooking between binary translated and host native there is quite a few things that can be screwed up there this is one of the hardest problems particularly when there is a bit difference like 32 bit to 64 bit. So the major problems with getting decent performance with emulation was not where they first appeared. Supporting old x86 windows binaries does lead to the 32 bit program on the 64 bit host with arm processors going forwards. Yes 32 bit program on 64 bit host only os is a problem you have with mac os and some linux distributions on x86_64 because no 32 bit runtime is provided so not a unique to Cisc-Risc problem.

**geearf** · 19 June 2021, 08:32 PM

Originally posted by Linuxxx View Post

No, DXVK directly converts Direct3D 9 calls into Vulkan ones.
And yes, it was the same when it was still called D9VK.

Damn, I did not realize that, pretty impressive.
Thank you!

**artivision** · 19 June 2021, 10:28 PM

Originally posted by oiaohm View Post

There are different x86 to arm binary translators in existence. Dynamic ones that use native host libraries like box86 are getting to worst being a 50% performance loss and at best almost as good as native of course box86 has a future problem coming at the moment that for 32 bit x86 programs it need 32 bit arm libraries with newer versions of arm cpu going to drop 32 bit support completely. Wine project hangover work with qemu is seeing the same perform cost but its working out all the fine details to get the bridging right for 32 bit x86 to 64 bit arm libraries is not being easy. So there is work to make reasonable highly performance x86 to arm binary translator work in dynamic recompile. Note these get high performance by binary translating the least possible.

Please note qemu and box86 are still using dynamic translation as in at runtime. If you look at Microsoft x86 on arm it uses a mix of static translation and dynamic translation with its caching that able to lift to 50% performance at worst to having a second run at worst performance at around 80% of native again.

The reality is that special hardware instructions can help but they are not required to get somewhere near decent performance here. So valve does not need to move into fabrication here because default instruction set of arm is workable for this and same with risc-v really. Of course to make it profitable for valve we have to see more arm/risc-v chips being made that have quite high performance so a 20% to 50% performance lose running dynamic emulation is not a problem.

Some parties are starting to work on static recompliers for x86 to arm or risc-v these give good as native performance but you can guess anything with copy protection or anti-cheat absolutely hates this because the converted binaries no longer validate what makes this route mostly not useful to valve it is useful for those with internal old custom enterprise applications.

Its really simple to miss that multi parties have done the Cisc-> Risc emulation over time how to-do this and get decent results has become more well known and used. Please note most of the improvement methods turn out to be generic like only emulate the application not the complete system, use native platform libraries and cache dynamic translation as these help for Cisc->Risc, Risc->Cisc and Cisc->Cisc but these also turn out to be quite a pain to implement. Early on there was though to be major problem causing special issues for Cisc->Risc but its turn out not to be the case major issues are fairly much the same every time you have to deal with a non native instruction set. Yes there is some issues with Cisc-> some other instruction set due to the dynamic length of instructions that common in Cisc instruction sets but that not as big of problem as it first appeared and is quite well solved these days. Hooking between binary translated and host native there is quite a few things that can be screwed up there this is one of the hardest problems particularly when there is a bit difference like 32 bit to 64 bit. So the major problems with getting decent performance with emulation was not where they first appeared. Supporting old x86 windows binaries does lead to the 32 bit program on the 64 bit host with arm processors going forwards. Yes 32 bit program on 64 bit host only os is a problem you have with mac os and some linux distributions on x86_64 because no 32 bit runtime is provided so not a unique to Cisc-Risc problem.

Static Recompilers don't statically convert binaries or libraries. They caching rule code which is based on the fact that 10% of a program is responsible for the 90% of the performance. They still load the original binary or library. They never get native performance except for rare algorithms but never for full apps. Apple Rosetta 2 on their newest Arm Core gets 75% efficiency, but this is partially assisted by the core hardware it self. The end point is that there is no parthenogenesis and everyone reach the same conclusion in near time. So if apple is on 75% then everyone else will be there inside 1-2 years.

**oiaohm** · 20 June 2021, 01:35 AM

Originally posted by artivision View Post

Static Recompilers don't statically convert binaries or libraries. They caching rule code which is based on the fact that 10% of a program is responsible for the 90% of the performance. They still load the original binary or library. They never get native performance except for rare algorithms but never for full apps. Apple Rosetta 2 on their newest Arm Core gets 75% efficiency, but this is partially assisted by the core hardware it self. The end point is that there is no parthenogenesis and everyone reach the same conclusion in near time. So if apple is on 75% then everyone else will be there inside 1-2 years.

No artivision you are wrong on the define of static recompilers. You just defined a caching dynamic recomplier not a static recomplier. Static recomplier is a different family of tools.

GitHub - sleirsgoevy/peshit: Proof-of-concept X86-to-ARM recompiler.

https://github.com/sleirsgoevy/peshit

Proof-of-concept X86-to-ARM recompiler. Contribute to sleirsgoevy/peshit development by creating an account on GitHub.

This here is a prototype static recomplier being done now but there are also historic full static recompliers. Static recompilers are doing Static binary translation that is a different beast to your dynamic binary translation. Starcraft in 2014 was ported to arm based for the Pandora platform by a static recomplier there is something here because none of the original binary engine was on the device when you run Starcraft on it. One of the markings of static re-compilers/static binary translation is that the original binary/library is not used after the conversion is complete.

Static recompliers is not what Apple Rosetta 2 is. Apple Rosetta is a dynamic recomplier with caching and interface to host libraries this route does get decently fast.

Static recompilers route when it works has historic examples of zero overhead yes the x86 starcraft to arm port of 2014 has zero overhead in fact when you work out the speed of the arm core its 15% faster due to being able to apply modern optimisation methods in the conversion. All your forms of dynamic recompliers have some form of runtime overhead.

Zero overhead with x86_64 to arm64 is possible there are rare examples like the 2014 starcraft one that prove it. Reason why its rare its the limitations that true static binary translation causes that applications checksuming themselves will get a different value and throw error because after static binary translation the program binary is truly different. Dynamic recompliers has the advantage the original binary is still present so you can give the correct checksum value in these cases.

**ssokolow** · 20 June 2021, 04:31 AM

Originally posted by oiaohm View Post

Zero overhead with x86_64 to arm64 is possible there are rare examples like the 2014 starcraft one that prove it. Reason why its rare its the limitations that true static binary translation causes that applications checksuming themselves will get a different value and throw error because after static binary translation the program binary is truly different. Dynamic recompliers has the advantage the original binary is still present so you can give the correct checksum value in these cases.

Plus:

From what I remember, notaz's static recompiler was designed to be human-assisted in the tough spots.
Checksumming isn't the only potential problem. As I understand it, it was common for NES-era games and before (and demoscene programs trying to hit a size target) to identify regions of machine code that would work to do double duty if also interpreted as game data in some places. (eg. noise textures in a 3D demoscene game, or 8-bit sound effects)

**oiaohm** · 20 June 2021, 05:28 AM

Originally posted by ssokolow View Post

Plus:

From what I remember, notaz's static recompiler was designed to be human-assisted in the tough spots.
Checksumming isn't the only potential problem. As I understand it, it was common for NES-era games and before (and demoscene programs trying to hit a size target) to identify regions of machine code that would work to do double duty if also interpreted as game data in some places. (eg. noise textures in a 3D demoscene game, or 8-bit sound effects)

There has been some advancements since them. There has been some work using Dynamic Disassembly to get the LLVM IR byte code of what is in the binary for a static recomplier build(this some times fails under the name Dynamic Binary Lifting and Recompilation).

Yes notaz's static recomplier from 2014 need more human assistance than a static recomplier using dynamic disassembly for the complex bits. notaz static recomplier was 100 percent static as in using static decomplier and static complier. Static recomplier with dynamic disassembly is a different beast to a dynamic recomplier with caching.

The second point about checksumming not be the only potential problem absolutely true and it traces to the fact you don't have the original binary when the static recomplier think it finished if the static computer has made a mistake with something 100 for sure add new bugs and issues to the produced program. Proper static recompliers are harder beasts to make than the dynamic recompliers but there is quite a performance if you can make static recomplier work and there is still on going work to perfect static recompliers mostly due to what the performance gains are.

**pal666** · 20 June 2021, 01:49 PM

it should be slower than d9vk(which should be slower than nine with real backend)

**artivision** · 21 June 2021, 05:41 PM

Originally posted by oiaohm View Post

No artivision you are wrong on the define of static recompilers. You just defined a caching dynamic recomplier not a static recomplier. Static recomplier is a different family of tools.

GitHub - sleirsgoevy/peshit: Proof-of-concept X86-to-ARM recompiler.

https://github.com/sleirsgoevy/peshit

Proof-of-concept X86-to-ARM recompiler. Contribute to sleirsgoevy/peshit development by creating an account on GitHub.

This here is a prototype static recomplier being done now but there are also historic full static recompliers. Static recompilers are doing Static binary translation that is a different beast to your dynamic binary translation. Starcraft in 2014 was ported to arm based for the Pandora platform by a static recomplier there is something here because none of the original binary engine was on the device when you run Starcraft on it. One of the markings of static re-compilers/static binary translation is that the original binary/library is not used after the conversion is complete.

Static recompliers is not what Apple Rosetta 2 is. Apple Rosetta is a dynamic recomplier with caching and interface to host libraries this route does get decently fast.

Static recompilers route when it works has historic examples of zero overhead yes the x86 starcraft to arm port of 2014 has zero overhead in fact when you work out the speed of the arm core its 15% faster due to being able to apply modern optimisation methods in the conversion. All your forms of dynamic recompliers have some form of runtime overhead.

Zero overhead with x86_64 to arm64 is possible there are rare examples like the 2014 starcraft one that prove it. Reason why its rare its the limitations that true static binary translation causes that applications checksuming themselves will get a different value and throw error because after static binary translation the program binary is truly different. Dynamic recompliers has the advantage the original binary is still present so you can give the correct checksum value in these cases.

Nope there is no complete static binary. It is controlled from the recompiler at runtime as well, for farther recompiling with fake hashed binary container instead of cache. The only way to do static compiling is when you have the source. Representing the source completely before runtime is impossible for today's standards, as far as i know at least.

**oiaohm** · 21 June 2021, 07:08 PM

Originally posted by artivision View Post

Nope there is no complete static binary. It is controlled from the recompiler at runtime as well, for farther recompiling with fake hashed binary container instead of cache. The only way to do static compiling is when you have the source.

https://www.ics.uci.edu/~dabrowsa/altinaydabrowski-eurosys20-binrec.pdf

Sorry artivision you are wrong there are complete static binary tools out there that don't have a recomplier running at runtime in final produced binary. We are seeing the early dynamic disassembly with new binary reproduction. Please note this does have other usages even x86_64 to x86_64 like security patching application you don't have the original source to. Binrec the recomplier does not run at runtime with the final produced binary. Up until tools like binrec started appearing static recompliers have required a lot of assistance from a human to fill in the bits of binary that static decomplier could not work out.

Only way to-do static compiling is have a source of some form so you are correct on that point. Be this a original source or a generated source. Limitations for static recompliers has been needing human assistance to get a useful generated source. The binrec stuff is quite a breakthrough still quite a bit more work.

Originally posted by artivision View Post

Representing the source completely before runtime is impossible for today's standards, as far as i know at least.

This is true and false. Working out the source completely without running program is close to impossible. Dynamic decomplier as you find in binrec allows you to run the binary to basically debug in a automated way to create a generated source of the program. Yes most of your translation layers are a dynamic decomplier linked dynamic complier of some form. To run a program though addresssanitiser or equal you need the full source not fragments of it be it generated or original.

Basically successful solution need a dynamic decomplier items like binrec suggest you don't need a dynamic compiler instead can get away with a static complier after you have done enough .automated dynamic decomplier passes to generate a source code.

Announcement

Pending Patches Allow Direct3D 9 "Gallium Nine" To Run Over Mesa's Zink Vulkan

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment