Originally posted by anarki2
View Post
Announcement
Collapse
No announcement yet.
Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC
Collapse
X
-
http://www.dirtcellar.net
-
Originally posted by coder View PostThe chips are interleaved, on single-rank DIMMs. I'm not sure how dual-rank DIMMs are mapped, but it still won't be the case that one bad chip = 1/8th or 1/16 of the address range is unusable. It would either break the whole DIMM or maybe half of its address range, I think.
Yes, one or more bad pages/rows/etc. can be blocked, at boot time. I've never done it, but I know it's possible.
http://www.dirtcellar.net
Comment
-
Originally posted by strtj View Post
This existed ~25 years ago in Tru64 (Digital Unix / OSF/1). I can't remember if it was a paid add-on or if it existed in the base OS, but it did pretty much exactly what you describe. You could set it to do a certain amount of memory in a certain period of time and CPU use would scale accordingly. I'm always surprised that I've never seen it anywhere else.
Anyway if we had such a built in background memory testing running constantly in the idle thread(s) - possibly rate limited. e.g. test 1GB pr. 30 minutes max I am sure that in a couple of months manufacturers would get a not insignificant amount of memory modules returned to them pretty often hell maybe is is a good idea to reimplement this for Linux , that way we can have reliable hardware again!
http://www.dirtcellar.net
Comment
-
Originally posted by coder View PostYes, and I explained why.
I agreed to no such protocol. That's straight out of your imagination.
How can you possibly think that, after seeing how many of your posts I've replied to? I read every post in this thread.
If you feel I've mis-characterized your position, feel free to correct the record. If not, why do you say that?
1. ECC is not a necessity for the average consumer-grade PC, but is a necessity for workstations and servers.
2. Getting ECC support on Intel isn't that expensive and therefore not worth complaining about.
Yes, I'm serious. You made a very specific claim about the relative frequency of two types of errors. I very much doubt it. Let's see your data, or drop the point.
We're not talking about drive failure vs. RAM failure. You said bad sectors were more common than memory errors. Don't muddy the water.
Is it? I specifically said not. I'm more interested in the nuances of which platforms offer what level of ECC support and specifically when does it make sense to use ECC?
This is what happens when you C your way into an A-B discussion.
ECC can save your bacon and tell you that you're getting memory errors. Without it, you're just left to suffer the consequences and guess why your system is unstable (unless/until you reproduce the problem with a memtest, by which point you've already suffered sufficient loss of time or data to even justify running it).
If we're talking lost time in a manner where that time is critical, not going for ECC in the first place was also irresponsible.
Comment
-
Originally posted by schmidtbag View PostAgain, if that were true, it would be the standard. You wouldn't have an option.
Originally posted by darkoverlordofdata View PostI think most would agree that we want ecc on the server - but do we need it on the workstation? I have a 4 year old asus zen book that logs ecc errors; it's never had one.Originally posted by Sonadow View PostThe fact remains that the world has been fine with using non-ECC memory on high-end professional workstations, desktop PCs, gaming PCs and work PCs for decades with no major complications or consequences.
I have a 2990WX workstation with 128GB of standard DDR4 memory for use in compiling software (with the memory being used as a RAMdisk for faster compilations) and have not run into any problems since day one.
Other parts of the file aren't going to be as obvious that they're borked, but the biggest processing (one that takes several weeks) is all RAM shuffling for photogrammetry processing with proprietary software. I don't know what it's doing, other than using the CPU at 100%, all RAM, big disk buffer(400GB on NVMe), and stressing several GPUs with 8GB or more VRAM. That's a lot of data for a long enough period, and as a business, if something borks and crashes that can be costly. It happened once but was related to some NIC driver causing a bluescreen.
I've also witnessed some smaller workloads that take a 4-12 hours crashing on non-ECC RAM. Could be due to other reasons, but otherwise was the same workload it usually performs, and repeating it again worked fine, thus wasn't reproducible. That was using about 80GB RAM to process on.
- Likes 1
Comment
-
ZFS solved this issue via software, not hardware.
Why can't ECC be solved via software checksums?
I know performance would drop, but seems like it should be an option you enable in the BIOS if you're willing to lose performance.
ECC memory can still go bad and doesn't even address the worst errors in the first place
This isn't just system memory. There's GPU memory, CPU L1, L2, and L3 cache, SSD and hard drive cache, RAID card cache, highspeed network card cache.
Comment
-
Originally posted by waxhead View Post
Thanks for that, I was not aware of that. The reason it is not seen elsewhere might perfectly well be that memory chips are worse than we believe. Actually they are , row-hammer is proof of that concept I think. I am still on DDR2 memory which is apparently less vulnerable to breaking by being used than DDR3/4 memory.
Anyway if we had such a built in background memory testing running constantly in the idle thread(s) - possibly rate limited. e.g. test 1GB pr. 30 minutes max I am sure that in a couple of months manufacturers would get a not insignificant amount of memory modules returned to them pretty often hell maybe is is a good idea to reimplement this for Linux , that way we can have reliable hardware again!
This certainly wouldn't be to difficult to re-implement on Linux, but I think you are right that manufacturers weren't scrambling to implement something like this. Especially in the days when the vendor and the OS were generally one and the same, there would have been very little motivation to increase the number of memory returns and/or service calls. If the customer doesn't notice some bad bits and never complains, who cares, right? :-P
As far as I am aware the only significant feature of Tru64 that was carried forward in any way was the AdvFS filesystem which was open sourced and ported to Linux, but I'm not aware of anyone actually using it. The only reason that I can think to have used it was if you had a huge investment in AdvFS storage and for some reason couldn't easily migrate it to something else. AdvFS wasn't a bad filesystem but the management tools were awful and by 2008 there were plenty of reasonable large/clustering filesystem options. As far as I can tell, HP's only interest in Tru64 was killing it and migrating its customers to HP-UX.
- Likes 1
Comment
-
Originally posted by sentry66 View PostZFS solved this issue via software, not hardware.
Why can't ECC be solved via software checksums?
I know performance would drop, but seems like it should be an option you enable in the BIOS if you're willing to lose performance.
ECC memory can still go bad and doesn't even address the worst errors in the first place
This isn't just system memory. There's GPU memory, CPU L1, L2, and L3 cache, SSD and hard drive cache, RAID card cache, highspeed network card cache.
There is not only cpu time overhead, but also memory overhead - and if you try to minimize one, you maximize the other. Do you store an extra bit for every byte, or do you store an extra byte for every word? Every time you use any byte, do you check just that byte, or do you check the entire word?
ECC offers no cpu time overhead and hidden memory overhead, at a theoretical 12.5% cost increase. You cannot improve on that in software, and you'd be foolish to try.
Last, but not least - zfs doesn't really implement ecc - it only implements error detection, if you don't have that block of data elsewhere - it is lost. Zfs blocks are optimally quite big, far too big to be able to restore data from the checksum, it is only used to verify the validity of the block.
This is quite easy to facilitate in software - most of the data my software saves to disk has its checksum embedded as well, no one in the right mind would even attempt to deserialize critical binary data without being sure of its integrity - that's a crash waiting to happen, or even worse.Last edited by ddriver; 05 January 2021, 06:17 AM.
- Likes 1
Comment
-
Originally posted by schmidtbag View PostAgain, if that were true, it would be the standard. You wouldn't have an option.
For instance, "necessity implies standard": false. Companies want profit -> they do what they need for it. If a car is faulty and they are responsible for an accident they may have to pay for it -> no profit. Companies will advertise whatever the consumer wants or thinks is cool (rgb?). Consumers don't know what ECC is (even worse, they don't know they need it because every time there is a failure they attribute it to something else) and if companies would put a sticker with it (ECC) on a case, that would give them no profit. Instead advertising rgb may, in many cases, make them good profit.
Fragmenting the market with a feature people don't know about and charging a lot for it is much more profitable than making it standard. Making it standard would, at some point, make the price go down enough that you wouldn't even know the difference any more.
As for the technical content of your comments, there are enough people here better suited than me that already put you to shame and besides, I am starting to think you are trolling, so I'm not gonna waste any more time.
Have a good one!
- Likes 1
Comment
-
Originally posted by AB@07 View PostWonder if Intel will use their political tools like the $300 million they gave to postmodern feminist to "cancel" Torvalds if he keep blasting Intel with the truth on their harmful anti-consumer practices, and let's not even get on the anti-competitive practices Intel has employed. I can totally see Intel looking at that comment by Torvals and going "Oh no, looks like someone needs another reeducation, I mean soft skills".
Linus set too many people, including senior kernel devs, straight far too many times. He seems to be one of the very few people who have the foresight to minimize the potential technical debt and keep the big picture in mind. even if it means sometimes blocking inclusion of highly requested features into the kernel.
- Likes 2
Comment
Comment