Announcement

**milkylainen** · 01 November 2020, 07:09 AM

I'm a bit confused about when this becomes a real problem.
The more I think about it the more confused I get.

Wouldn't split locks mean like a big no-no in all forms of actual hardware representations in memory as packed structures?
Or in a more generic sense, all packed structs?
But how I am to guarantee that any structure element won't cross a cacheline given a generic machine?
Assuming all ints and pointers are atomic?
It's not like GCC knows the final running arch for generic targets or that the placement of an object won't lead to intra object misalignments?

I'm sure someone with more intricate knowledge about this matter can explain?

**muncrief** · 01 November 2020, 02:27 PM

As an old embedded systems designer without intricate knowledge of the Linux kernel I've done my best to understand split locks. And from what I've been able to glean they just seem like a bad idea that should be forbidden. Am I missing something?

**Zan Lynx** · 02 November 2020, 03:50 PM

Originally posted by muncrief View Post

As an old embedded systems designer without intricate knowledge of the Linux kernel I've done my best to understand split locks. And from what I've been able to glean they just seem like a bad idea that should be forbidden. Am I missing something?

Well, I agree with you anyway. And it seems like that's one of the options for this patch.

I am sure that there is some software out there that will break with this because there is always something. Probably old copies of the Flash plugin. But oh well too bad for them.

I would love to hear the story of this though. There must have been some Intel CPU design engineer who questioned this. "Why are we doing this extra work to support unaligned lock operations?"
Was the answer a serious response based on a customer requirement? Or some kind of "Shut up, that's the specification and we're doing it."

**PCJohn** · 03 November 2020, 10:55 AM

Originally posted by milkylainen View Post

Wouldn't split locks mean like a big no-no in all forms of actual hardware representations in memory as packed structures?
Or in a more generic sense, all packed structs?
But how I am to guarantee that any structure element won't cross a cacheline given a generic machine?
Assuming all ints and pointers are atomic?
It's not like GCC knows the final running arch for generic targets or that the placement of an object won't lead to intra object misalignments?

No harm to usual c/c++ code. You can freely use structures. If you have strong reasons to it, you can force your compiler to make structure members aligned very poorly, resulting in degraded performance. Anyway, you still do not hit the problem of the split lock. You would need to use atomic instructions in your program. If you are C++, one example to get such functionality is https://en.cppreference.com/w/cpp/atomic/atomic . Generally, you will use it mostly in multithreaded applications for synchronization and communication between threads. So, taking atomic int and forcing compiler to align it to cross border between two cache lines would result in split lock problem. I do not see any meaningful reason why anybody would want to do such thing and I understand it is the performance disaster.

That is my understanding of the problem. Anyone, correct me if I am wrong.

**milkylainen** · 04 November 2020, 10:32 AM

Originally posted by PCJohn View Post

No harm to usual c/c++ code. You can freely use structures. If you have strong reasons to it, you can force your compiler to make structure members aligned very poorly, resulting in degraded performance. Anyway, you still do not hit the problem of the split lock. You would need to use atomic instructions in your program. If you are C++, one example to get such functionality is https://en.cppreference.com/w/cpp/atomic/atomic . Generally, you will use it mostly in multithreaded applications for synchronization and communication between threads. So, taking atomic int and forcing compiler to align it to cross border between two cache lines would result in split lock problem. I do not see any meaningful reason why anybody would want to do such thing and I understand it is the performance disaster.

That is my understanding of the problem. Anyone, correct me if I am wrong.

Hmm. I still can't wrap my head around this.
Isn't a basic type of the CPU always atomic?
Like how would you else guarantee that a pointer variable stuck over two cache-lines won't get trashed in a multi core system?
For me the test/set/swap isn't fundamentally different than writing a variable of a fundamental type.
Maybe the CPU sets owner/modified state to all cache-lines of a variable before?
That's where I thought the lock would happen, resolving both cachelines.
But as I said, I'm having difficulties understanding.

**PCJohn** · 04 November 2020, 12:13 PM

Originally posted by milkylainen View Post

Hmm. I still can't wrap my head around this.
Isn't a basic type of the CPU always atomic?
Like how would you else guarantee that a pointer variable stuck over two cache-lines won't get trashed in a multi core system?
For me the test/set/swap isn't fundamentally different than writing a variable of a fundamental type.
Maybe the CPU sets owner/modified state to all cache-lines of a variable before?
That's where I thought the lock would happen, resolving both cachelines.
But as I said, I'm having difficulties understanding.

Basically, following code is not guaranteed to be atomic:
int a;
a++;
I am not expert on x86 assembly, but I expect it would translate to the following machine code:
move [address of a], eax
add eax,1
move eax,[address of a]
In other words, you usually fetch the content to the cpu register, do your work and then write it back to the memory. Any other processor or thread might kick in and modify value of a in the mean between the three instructions. So, there is no atomicity here.
char c;
c=0;
Plain assignment of 1 byte value is atomic. I would guess, that assignment of properly aligned int or even double might be atomic on most 64-bit architectures, e.g. if two threads write to the same variable at the same moment, there will be one or other value, but not few bytes of each. But we are on the thin ice here. If something shall be atomic, you shall use the proper synchronization primitives. What if the compiler and its optimizer decide that the variable will not get place in memory but stays only in a register, and is written to the memory only at the end of our very long function? So another thread might try to read in-progress-values, but it will see nothing until the end of our very long function when the compiler decided to write the value from the register back to the memory. And even worse: you might assign some value to one variable and some value to another variable. If you will be looking from other thread or processor on the two values, you would expect that new value of the first variable will appear first and the value of the second variable as the second one. However, there are no such guarantees. You might see it in any order. The compiler is the first one who might reorder their writes. Even if it keeps the order, there might be other "trolls" around. We might suspect speculative execution to be free to change the order of instructions. Luckily, there is ROB stage at the end of cpu pipeline that orders all the writes to happen exactly as they are written in the code. So, the speculative execution is probably not guilty. But what about caches? The two variables might be in the different cache lines. Are they written back to the main memory in the correct order? As far as I remember, I would guess that current multiprocessor systems are using special cache protocol to keep caches in consistent state. If one processor is writing a cache line, it is invalidated in all the caches of all other processors. As the result, you will probably never see the second write jump before the first one.

Did I got your question right and are the things above giving a little more light on the problem? In short, unless you are multithreading expert, always use locks, atomics, etc. when using data shared between two threads. For example, appending and removing from std::list from two threads concurrently is likely to result in broken list, and very hard to debug magical application crashes.

Announcement

Intel Bus Lock Detection For The Linux Kernel Proceeding

Intel Bus Lock Detection For The Linux Kernel Proceeding

Comment

Comment

Comment

Comment

Comment

Comment