Announcement

Collapse
No announcement yet.

Fedora's FESCo Rejects The Idea Of "-fno-omit-frame-pointer" As Default Compiler Flag

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Vorpal View Post

    Ah, I did not think of this. It indeed complicates thing, but I do not believe it to be impossible to solve. Here some ideas I came up with in a couple of minutes of thinking (so, likely it won't work):
    • Construct (in advance, or JIT and cache) a look up table from instruction pointer address to frame address. This trades of memory for a quite possibly slow processing involving the DWARF debug info.
    • This should solve finding the first stack frame in the chain. I have not examined how parent frames are found in the x86-64 ABI, but if that is also hard, you could store the relevant info on the stack. While this would still be more expensive than omitting it entirely, it should at least free up the register. I think it would have to be the caller who saves it, since the current stack frame is no longer in rbp, and only the caller can know the correct offset from rsp. (This obviously is an ABI breakage, as I believe RBP is currently callee-saved?)
      • In fact, thinking about it, in this scheme, rbp can be written to stack once if the compiler only changes rsp on function entry (and uses moves relative to rsp rather than push and pops). The important thing it that rbp is saved at a fixed offset from child functions stack frames to allow it to be found. In practise this means that it should be saved at the lower end (as iirc the stack grows down) of the stack frame.
      • EDIT: If all functions save the RBP at the bottom end of the stack frame all the time, you don't need a look up table based on the instruction pointer either. This does however preclude the use of alloca and similar and forces stack frames to be fixed size. I'm not sure this is a problem (except with alloca), as incrementing rsp is very cheap, and it should be fine to then just do everything relative to rsp. This effectively combines RSP/RBP into a single register.
    Caveats: This is not my area of expertise and I have not done low level programming for several years, and I have never done ABI design.
    This is in fact the usual approach without frame pointers. You can determine the current stack frame by looking at rsp, rip and DWARF and then walk up by looking at the stack and DWARF. With alloca, IIRC the usual approach is to fall back to frame pointers for callers of alloca (but these should be extremely rare in common code). The only downside here is that the kernel does not have a DWARF parser, so if you want to profile using perf, you need to copy the entire stack on each sample. However, this should be a constant factor overhead, and so can be mitigated by reducing the sampling frequency a bit in long-running applications. Another solution would be to use a simpler alternative to DWARF for unwinding only (I think it already exists in the form of ORC, but userspace support seems to be lacking).

    Comment


    • #22
      Originally posted by archkde View Post

      This is in fact the usual approach without frame pointers. You can determine the current stack frame by looking at rsp, rip and DWARF and then walk up by looking at the stack and DWARF. With alloca, IIRC the usual approach is to fall back to frame pointers for callers of alloca (but these should be extremely rare in common code). The only downside here is that the kernel does not have a DWARF parser, so if you want to profile using perf, you need to copy the entire stack on each sample. However, this should be a constant factor overhead, and so can be mitigated by reducing the sampling frequency a bit in long-running applications. Another solution would be to use a simpler alternative to DWARF for unwinding only (I think it already exists in the form of ORC, but userspace support seems to be lacking).
      Right. I think I came up with a variant scheme to this. To clarify my idea from earlier:
      1. Each frame is allocated fully on entry, no adjustments of rsp except in the function prologue/epilogue
      2. A virtual rbp is stored at rsp. This is easy as the compiler knows where rsp was before it adjusted it.
        • Note: It needs to be written out after adjust rsp to prevent interrupt/signal handler issues, however this leaves a one instruction window where this "virtual rbp" on stack will be invalid.
        • However, I think it would be cheaper to write it out before, since you could do something like (using intel syntax with the mov target the first argument): mov rsp+offset,rsp, as opposed to "mov rsp,rsp-offset" which is not a thing. I.e. you can offset the address, but not the value written. So you would end up with move rsp,tmp-reg, sub rsp,offset, mov rsp, tmp-reg. T
        • Future CPUs could have a combined atomic instruction for this to prevent both of these problems if my crazy calling convention is adopted
      3. To unwind, load the value at the address of rsp (*rsp in pseudo-C code). This will contain the "virual rbp", while still freeing the rbp register for other use.
      4. To find the previous "virtual rbp" you do **rsp. (Maybe with an offset of one pointer size applied: *(*rsp + 8), depending on calling convention, I'd have to think this through, but off by one errors are a classic anyway!)
      5. Alloca/variable sized stack allocated arrays break horribly in this scheme. But you shouldn't use them anyway IMO.
      This is cheap to parse for the unwinder. It saves a register. It still needs extra instructions in the function prologue.

      I assume ORC is a saner approach, but I would need to look into it. It would be nice if it can avoid the alloca issue as well as the need for code in the prologue.
      Last edited by Vorpal; 01 December 2022, 06:04 PM.

      Comment


      • #23
        Originally posted by King InuYasha View Post

        It's enabled on AArch64 and POWER (ppc64le), it is not enabled on x86_64 and s390x.
        For POWER as well?

        Is that a Fedora-specific change? GCC seems to enable frame pointers by default for AArch64 only:

        gcc-12.1.0 $ grep -r 'OPT_fomit_frame_pointer.*0'
        gcc/common/config/aarch64/aarch64-common.cc: { OPT_LEVELS_ALL, OPT_fomit_frame_pointer, NULL, 0 },​

        gcc-12.1.0 $

        Comment


        • #24
          Originally posted by King InuYasha View Post

          I assume you missed the GNOME developers saying they wanted this too? Or that the KDE folks would have benefited from this as well? Both big desktops have tooling that massively benefits from working real-time tracing and profiling.

          Also, who do you think makes the software that gets shipped? Developers. And they have to get something nice out of the platform to keep doing it too.
          I am a developer.

          I don't want it.
          If i want to debug something - on the desktop - i recompile the parts needed for that with debug symbols. As every self respecting developer should do in the desktop world in my opinion.

          What was your point again?

          Comment


          • #25
            Shame that this will prevent languages like Java from working properly and being able to generate good stacktraces. I wonder if Red Hat is trying to keep as many corporate Java users away from Fedora and onto Red Hat?

            Comment


            • #26
              People seem to be forgetting that global settings like this should reflect the character of the clear majority of its users/stake holders. That's not the case here, even with certain groups coming out in favor of the change. If the clear majority of Fedora's users were Google, Meta, Gnome, and KDE developers that wanted these changes then it would be appropriate to make such a change, but it's obvious that's not the case from the sheer popularity of the distribution alone.

              What it does appear to me is that there is enough manpower resources to create a Fedora spin and/or SIG with appropriate packages that focus on this particular topic for those interested. Sorry, but the "upstreaming our changes" refers to code changes for maintenance that benefits the broader community, not screwing with compiler and build options that are on balance demonstrably detrimental to the rest of the user base. You can't tell me the billions of dollars in ill gotten wealth from Meta and Google can't be used to create and support the long term maintenance of a profiling friendly Fedora spin. If nothing else just terminate the C-suite bonus packages at either company and you can support such a project indefinitely. Not a bit of sympathy here.

              Comment


              • #27
                Originally posted by coder View Post
                Is this strictly an x86 thing? On AArch64, the performance impact of carrying a frame pointer should be negligible, due to double the general-purpose registers. I would like to see some benchmarks on that, but you'd obviously want to use a more recent core than something like the Pi's A72. Michael, do you have access to an Ampere Altra?

                Having 2 distinct registers, one to be used as stack pointer and one to be used as frame pointer, is a mistake inherited by x86 from Intel 8086. However this is a mistake that has been done in many CPU instruction sets.

                One notable ISA where this is done right is the IBM POWER, where a single register can be used both as a stack pointer and as a frame pointer. Therefore there is never any need to make compromises between performance and debugging convenience.

                The reason why this is possible on IBM POWER, but impossible on x86, is that the former has an instruction that can update atomically both the stack pointer/frame pointer register and the location on the stack where the previous frame pointer is saved, whenever a new stack frame is allocated.

                On x86, a couple of instructions would be needed for the same effect, but they cannot be used because an interrupt can arrive between the 2 instructions, which can corrupt the stack.

                In theory, it would be possible to add to x86 an instruction that would atomically subtract a value from RSP, to allocate a new stack frame where the new value of RSP would be the frame pointer, while the old value of RSP, i.e. the old frame pointer, would be saved at the base of the new stack frame, ensuring that the stack always is a linked list of the stack frames, which can be examined for debugging.

                x86 already has plenty of much more complex and less useful instructions.

                However, it is unlikely that the inertia can be overcome, because there is a huge amount of software tools for x86 that are built upon the model with 2 distinct registers, RSP and RBP, even if this has the consequence of having to choose between higher execution performance and easier debugging.


                Comment


                • #28
                  A better solution would be to adopt shadow stack like clang CFI.

                  This would not only makes profiling easier, but also prevents ROP attacks.

                  Comment


                  • #29
                    Why dont they use gentoo? They can have whatever default flags for their entire system they want.

                    Comment


                    • #30
                      Originally posted by archkde View Post

                      The point here is not general debugging, but profiling. This is (other than in the most extreme cases) not something you can do with printf.
                      Ok. I haven't done profiling, beyond adding some counters here and there. So I am not sure what you actually need for detailed profiling. But I believe there is still only a linear slowdown for recalculating the frame when needed? If you manage to speed up an instrumented debug build after profiling, some percentage of that carries through to production builds as well?

                      Anyway, the purpose of profiling seems to be optimizing for performance. Wasting a register on a convenience frame pointer, runs contrary to that goal. There will always be those inner loops that needs exactly one register more than you have - and then this bites.

                      Comment

                      Working...
                      X