Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by atomsymbol

    No, I do not mean that. Optimizing "REP MOVS" when designing a high-performance x86 CPU with the goal/target "make REP MOVS be 10% faster than hand-coded assembly" is pointless. I mean a 1000% speedup on large blocks. ... When copying 32 MiB of aligned non-overlapping data then it could first write overlapping modified data from CPU caches to memory and then send a few commands to the memory modules to perform the copy operation while the CPU is concurrently running other tasks that do not interfere with the ongoing copy operation within the memory modules. Similarly, when copying a large file on an SSD the operating system could send commands to the SSD device (note: SSD is internally a high-bandwidth memory device, while the PCI-Express x4 interface to which the SSD is connected is a serial interface).
    That 1000% speed exists in some x86 chips with REP MOVS but its not implemented that the data enters the caches. REP MOVS are auto optimised to use the largest copy the cpu core has access to. There is a reason why MMU optimisation is not common there is race condition between cores to over come and a cost in determining if a copy is going to be large enough to use the MMU.

    If the MMU copy is done its not done write overlapping modified data first its done duplicate existing data by page table directions to the MMU then apply modified data.

    MMU really does not have fine grained mapping of memory information the most detailed information it has is the page tables this includes for DMA. Yes when limited to the MMU granularity does limit what kind of operations you can do. This granularity causes another problem. 4k is the smallest page size . The largest pagesize is 2MiB on x86. REP Mov that optimise to MMU can work while page entries are 4k in current x86 implementations you use 2MIB page it not happening as it quite a lot of optimisation processing to say this is going to be a large enough operation to need a 2MiB page copy.

    There are some MMU for arm that do support sending the copy page instruction to the MMU yourself. This is mostly not used unless developer of program goes out of way to code it in. The granularity of the MMU is a real limiting factor.

    Yes I do agree that it could be useful to get that 1000% speedup on large blocks.

    Comment

    Working...
    X