Announcement

**Nite_Hawk** · 20 January 2024, 09:59 AM

Author of the article here. There's been discussion about IOMMU on a couple of different other sites. I just wanted to mention that we've never seen this specific issue before with Ceph. We have previous generation 1U Dell servers with older EPYC processors in the upstream Ceph performance lab and they've never shown any signs of having IOMMU issues. The customer did however tell us that the have seen issues with IOMMU before in other contexts though.

**Mark Rose** · 20 January 2024, 11:32 AM

Originally posted by Nite_Hawk View Post

Author of the article here.

When I was reading this article earlier this morning, I was wondering how you arrived at selecting 48 core/96 thread CPUs. Does Ceph require that much processing power to saturate dual 100 gbe?

**Nite_Hawk** · 20 January 2024, 12:01 PM

Originally posted by Mark Rose View Post

When I was reading this article earlier this morning, I was wondering how you arrived at selecting 48 core/96 thread CPUs. Does Ceph require that much processing power to saturate dual 100 gbe?

It's actually kind of funny. Sequential reads typically require the least CPU of the 4 operations tested here, especially for replicated pools. For purely sequential reads on 3X replicated pools, lower binned CPUs would likely have achieved similar throughput. The CPUs were a much bigger factor in hitting the IOPS numbers in this article than the 1 TiB/s read number.

Typically in Ceph, small random writes use the most CPU, and things like erasure coding, msgr level encryption, and disk (LUKS) encryption all add CPU overhead as well. From a system design perspective, the vast majority of the money being spent on a cluster like this is going into the NVMe drives. You can shave off a percent or two by using lower binned CPUs, but that may limit the kind of future workloads you want to run or options you may want to enable. It's not a terrible idea to slightly over-invest in CPU.

Edit: If you are interested in numbers, I've written a couple of articles exploring this in the upstream Ceph lab:

Ceph OSD CPU Scaling - Part 1 - Ceph

https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/

Ceph is an open source distributed storage system designed to evolve with data.

Ceph Reef - 1 or 2 OSDs per NVMe? - Ceph

https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/

Ceph is an open source distributed storage system designed to evolve with data.

**Veto** · 20 January 2024, 07:04 PM

How refreshing! A commentary thread with actual useful information and even the author participating. Absent of the usual bickering. Thumbs up from me.

**sophisticles** · 20 January 2024, 07:50 PM

Originally posted by Nite_Hawk View Post

Author of the article here.

I am not familiar with Ceph, does it support hardware acceleration?

If no, are there plans to support it?

If yes, then why not go with Intel processors:

Access Denied

https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/storage-engines.html

I have to assume that 68 Dells, each with a 48C/96T EPYC with 192gb of ram each, must be using a lot of juice.

I have to believe that with hardware acceleration you could achieve higher performance with lower power consumption.

Your commentary is appreciated.

**S.Pam** · 21 January 2024, 08:27 AM

Originally posted by sophisticles View Post

I am not familiar with Ceph, does it support hardware acceleration?

If no, are there plans to support it?

If yes, then why not go with Intel processors:

Access Denied

https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/storage-engines.html

I have to assume that 68 Dells, each with a 48C/96T EPYC with 192gb of ram each, must be using a lot of juice.

I have to believe that with hardware acceleration you could achieve higher performance with lower power consumption.

Your commentary is appreciated.

It's an interesting topic with HW acceleration. Most NICs at this speed have hardware queues and hardware offloading of various transforms before data hits the CPU.

**geekinasuit** · 21 January 2024, 08:40 PM

Originally posted by Nite_Hawk View Post

Author of the article here. There's been discussion about IOMMU on a couple of different other sites. I just wanted to mention that we've never seen this specific issue before with Ceph. We have previous generation 1U Dell servers with older EPYC processors in the upstream Ceph performance lab and they've never shown any signs of having IOMMU issues. The customer did however tell us that the have seen issues with IOMMU before in other contexts though.

I've experienced an issue where the file system (BTRFS in this case) would go into read-only mode after some time passed. The only way to resolve it was to force a reboot (hard reset) which only solved the issue temporarily until it reappeared some time later (hours or a few days usually, but sometimes a few weeks). The drive is a 2TB NVMe card, and I considered replacing it, but after a determined effort, I found that some people had the same or similar issue. What seemed to be the solution was disabling IOMMU. What I did was disabled IOMMU from "auto" to "disabled" as a BIOS setting. The system has been fine since, and has long ago passed the time frame where a read-only event would have occurred. Another solution I read about was to disable IOMMU as a kernel setting, but I was uncomfortable making changes that could accidentally mess up the OS from booting, so I changed the BIOS setting instead.

The various user experiences that I read about, was that the problem seemed to happen depending on a combination of factors, for example the model and version of the NVMe storage card, and possibly combined with the CPU/APU (eg Ryzen vs i-Core) installed.

For file systems that do not perform auto error detection, the problem may be present but go noticed. BTRFS for example will go into read-only mode as soon as there's an integrity error detected, however a file system such as Ext4 will not do that. I hope this information helps someone who is experiencing a similar problem.

**geekinasuit** · 15 February 2024, 08:10 PM

UPDATE:

"The system has been fine since, and has long ago passed the time frame where a read-only event would have occurred. "

Unfortunately, the read-only issue has returned after about 2 months remaining stable. Disabling IOMMU did appear to make the system stable for a much longer time frame, however whatever is going wrong appears to be more than an IOMMU issue. I'll order a new NVMe storage card and hope that will fully resolve the issue.

Announcement

Ceph Cluster Hits 1 TiB/s Using AMD EPYC Genoa + NVMe Drives

Ceph Cluster Hits 1 TiB/s Using AMD EPYC Genoa + NVMe Drives

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment