Originally posted by mmstick
View Post
Announcement
Collapse
No announcement yet.
AMD Is Prepared To Release A Complete User-Space Open-Source Stack For HSA
Collapse
X
-
Would be nice for AMD. They need the ... umm, Yuan Renminbi.
Anyway nice addition. Sadly I don't have an HSA ready sytem yet (iirc Kabini ist not HSA) but Kaveri will probably be my next one in my main box.
Thanks for the work and putting speck & sauce to the people AMD devs.Stop TCPA, stupid software patents and corrupt politicians!
Comment
-
Originally posted by uid313 View PostCould HSA be used by Intel and Nvidia too?
I find vendor-specific technology utterly boring and uninteresting.
Comment
-
Originally posted by Adarion View PostAnyway nice addition. Sadly I don't have an HSA ready sytem yet (iirc Kabini ist not HSA) but Kaveri will probably be my next one in my main box.
Comment
-
I think the distinction is between "some HSA features" (Kabini) and "all HSA features" (Kaveri). IIRC the key issue for Kabini is 40-bit GPU virtual addresses versus the 48-bit virtual addresses in Kaveri's GPU.
The 48-bit virtual addresses in the Kaveri GPU match what you get in an AMD64 CPU today, allowing full pointer equivalency between CPU and GPU when accessing system memory via the IOMMUv2.Last edited by bridgman; 10 November 2014, 11:23 AM.Test signature
Comment
-
The HSA-1.0 spec @ http://amd-dev.wpengine.netdna-cdn.c...2/10/hsa10.pdf shows up some light into memory semantics.
3. Memory Model
3.1.Overview
A key architectural feature of HSA is its unified memory model. In the HSA memory model, a combined
latency/throughput application uses a single unified virtual address space. All HSA-accessible memory
regions are mapped into a single virtual address space to achieve Shared Virtual Memory (SVM)
semantics.
Memory regions shared between the LCU and TCU are coherent. This simplifies programming by
eliminating the need for explicit cache coherency management primitives, and it also enables finer-
grained offload and more efficient producer/consumer interaction. The major benefit from coherent
memory comes from eliminating explicit data movement and eliminating the need for explicit heavyweight
synchronization (flushing or cache invalidation). The support of existing programming models that
already use flushing and cache invalidation can also be supported, if needed.
3.2. Virtual Address Space
Not all memory regions need to be accessible by all compute units. For example:
• TCU work-item or work-group private memory need not be accessible to the LCUs. In fact, each work-
item or work-group has its own copy of private memory, all visible in the same virtual address space.
Private memory accesses from different work-items through the same pointer result in accesses to
different memory by each work-item; each work-item accesses its own copy of private memory. This is
similar to thread-local storage in CPU multi-threaded applications. Access to work-item or work-group
memory directly by address from another accessor is not supported in HSA.
• LCU OS kernel memory should not be accessible to the TCUs. The OS kernel must have ownership of
its own private data (process control blocks, scheduling, memory management), so it is to be expected
that TCUs should not have access to this memory. The OS kernel, however, may expose specific
regions of memory to the TCUs, as needed.
When a compute unit dereferences an inaccessible memory location, HSA requires the compute unit to
generate a protection fault. HSA supports full 64-bit virtual addresses, but currently physical addresses
are limited to 48 bits, which is consistent with modern 64-bit CPU architectures.
3.2.1. Virtual Memory Regions
HSA abstracts memory into the following virtual memory regions. All regions support atomic and
unaligned accesses.
• Global: accessible by all work-items and work-groups in all LCUs and TCUs. Global memory embodies the main advantage of the HSA unified memory model: it provides data sharing between CUs and TCUs.
• Group: accessible to all work-items in a work-group.
• Private: accessible to a single work-item.
• Kernarg: read-only memory used to pass arguments into a compute kernel.
• Readonly: global read-only memory.
• Spill: used for load and store register spills. This segment provides hints to the finalizer to allow it to generate better code.
• Arg: read-write memory used to pass arguments into and out of functions.
3.3.Memory Consistency and Synchronization
3.3.1. Latency Compute Unit Consistency
LCU consistency is being dictated by the host processor architecture. Different processor architectures
may have different memory consistency models, and it is not the scope of HSA to define these models.
HSA needs to operate, however, within the constraints of those models.
3.3.2.Work-item Load/Store Consistency
Memory operations within a single work-item to the same address are fully consistent and ordered. As a
consequence, a load executed after a store by the same work-item will never receive stale data, so no
fence operations are needed for single work-item consistency. Memory operations (loads / stores) at
different addresses, however, could be re-ordered by the implementation.
3.3.3. Memory Consistency across Multiple Work-Items
The consistency model across work-items in the same work-group, or work-items across work-groups,
follows a “relaxed consistency model”: from the viewpoint of the threads running on different compute
units, memory operations can be reordered.
• Loads can be reordered after loads.
• Loads can be reordered after stores.
• Stores can be reordered after stores.
• Stores can be reordered after loads.
• Atomics can be reordered with loads.
• Atomics can be reordered with stores.
This relaxed consistency model allows better performance. In cases where a stricter consistency model
is required, explicit fence operations or the use of the special load acquire (ld_acq) and store release
(st_rel) is needed.Last edited by dibal; 23 November 2014, 12:32 PM.
Comment
Comment