Announcement

**johnc** · 09 November 2014, 11:59 AM

Originally posted by mmstick View Post

There's a reason why China made an announcement that they were buying 10,000,000 AMD GPUs

http://www.phoronix.com/scan.php?pag...tem&px=MTEyNTE

Link to statement from Chinese government announcing they're purchasing 10 million AMD GPUs?

**Marc Driftmeyer** · 09 November 2014, 03:58 PM

Originally posted by johnc View Post

Link to statement from Chinese government announcing they're purchasing 10 million AMD GPUs?

They might be referencing this switch from Nvidia to AMD announced in mid 2012.

**Adarion** · 10 November 2014, 06:35 AM

Would be nice for AMD. They need the ... umm, Yuan Renminbi.

Anyway nice addition. Sadly I don't have an HSA ready sytem yet (iirc Kabini ist not HSA) but Kaveri will probably be my next one in my main box.
Thanks for the work and putting speck & sauce to the people AMD devs.

**chithanh** · 10 November 2014, 10:55 AM

Originally posted by uid313 View Post

Could HSA be used by Intel and Nvidia too?

I find vendor-specific technology utterly boring and uninteresting.

From a quick glance at http://www.hsafoundation.com/ it seems that Intel and NVidia are not members. However other companies that have GPU IP are, among them ARM (Mali), Broadcom (VideoCore), Imagination Technologies (PowerVR), Qualcomm (Adreno), and Vivante (GC).

**chithanh** · 10 November 2014, 11:01 AM

Originally posted by Adarion View Post

Anyway nice addition. Sadly I don't have an HSA ready sytem yet (iirc Kabini ist not HSA) but Kaveri will probably be my next one in my main box.

What I find strange is that older AMD roadmaps listed Kabini as having "New HSA Features" while more recent AMD marketing material avoids mentioning Kabini and HSA together. Then again according to AMD's Marc Diana, the PS4 and Xbox One support HSA/hUMA to some degree, and their SoCs are basically bigger Kabinis. Maybe it didn't work out as planned.

**bridgman** · 10 November 2014, 11:21 AM

I think the distinction is between "some HSA features" (Kabini) and "all HSA features" (Kaveri). IIRC the key issue for Kabini is 40-bit GPU virtual addresses versus the 48-bit virtual addresses in Kaveri's GPU.

The 48-bit virtual addresses in the Kaveri GPU match what you get in an AMD64 CPU today, allowing full pointer equivalency between CPU and GPU when accessing system memory via the IOMMUv2.

**pal666** · 10 November 2014, 11:44 AM

Originally posted by Nille View Post

AMD is not well known for vendor specific stuff. But you can ask Nvidia about multivendor CUDA

or g-sync

**dibal** · 23 November 2014, 12:28 PM

The HSA-1.0 spec @ http://amd-dev.wpengine.netdna-cdn.c...2/10/hsa10.pdf shows up some light into memory semantics.

3. Memory Model

3.1.Overview

A key architectural feature of HSA is its unified memory model. In the HSA memory model, a combined
latency/throughput application uses a single unified virtual address space. All HSA-accessible memory
regions are mapped into a single virtual address space to achieve Shared Virtual Memory (SVM)
semantics.
Memory regions shared between the LCU and TCU are coherent. This simplifies programming by
eliminating the need for explicit cache coherency management primitives, and it also enables finer-
grained offload and more efficient producer/consumer interaction. The major benefit from coherent
memory comes from eliminating explicit data movement and eliminating the need for explicit heavyweight
synchronization (flushing or cache invalidation). The support of existing programming models that
already use flushing and cache invalidation can also be supported, if needed.

3.2. Virtual Address Space

Not all memory regions need to be accessible by all compute units. For example:

• TCU work-item or work-group private memory need not be accessible to the LCUs. In fact, each work-
item or work-group has its own copy of private memory, all visible in the same virtual address space.
Private memory accesses from different work-items through the same pointer result in accesses to
different memory by each work-item; each work-item accesses its own copy of private memory. This is
similar to thread-local storage in CPU multi-threaded applications. Access to work-item or work-group
memory directly by address from another accessor is not supported in HSA.

• LCU OS kernel memory should not be accessible to the TCUs. The OS kernel must have ownership of
its own private data (process control blocks, scheduling, memory management), so it is to be expected
that TCUs should not have access to this memory. The OS kernel, however, may expose specific
regions of memory to the TCUs, as needed.

When a compute unit dereferences an inaccessible memory location, HSA requires the compute unit to
generate a protection fault. HSA supports full 64-bit virtual addresses, but currently physical addresses
are limited to 48 bits, which is consistent with modern 64-bit CPU architectures.

3.2.1. Virtual Memory Regions

HSA abstracts memory into the following virtual memory regions. All regions support atomic and
unaligned accesses.

• Global: accessible by all work-items and work-groups in all LCUs and TCUs. Global memory embodies the main advantage of the HSA unified memory model: it provides data sharing between CUs and TCUs.
• Group: accessible to all work-items in a work-group.
• Private: accessible to a single work-item.
• Kernarg: read-only memory used to pass arguments into a compute kernel.
• Readonly: global read-only memory.
• Spill: used for load and store register spills. This segment provides hints to the finalizer to allow it to generate better code.
• Arg: read-write memory used to pass arguments into and out of functions.

3.3.Memory Consistency and Synchronization

3.3.1. Latency Compute Unit Consistency

LCU consistency is being dictated by the host processor architecture. Different processor architectures
may have different memory consistency models, and it is not the scope of HSA to define these models.
HSA needs to operate, however, within the constraints of those models.

3.3.2.Work-item Load/Store Consistency

Memory operations within a single work-item to the same address are fully consistent and ordered. As a
consequence, a load executed after a store by the same work-item will never receive stale data, so no
fence operations are needed for single work-item consistency. Memory operations (loads / stores) at
different addresses, however, could be re-ordered by the implementation.

3.3.3. Memory Consistency across Multiple Work-Items

The consistency model across work-items in the same work-group, or work-items across work-groups,
follows a “relaxed consistency model”: from the viewpoint of the threads running on different compute
units, memory operations can be reordered.

• Loads can be reordered after loads.
• Loads can be reordered after stores.
• Stores can be reordered after stores.
• Stores can be reordered after loads.

• Atomics can be reordered with loads.
• Atomics can be reordered with stores.

This relaxed consistency model allows better performance. In cases where a stricter consistency model
is required, explicit fence operations or the use of the special load acquire (ld_acq) and store release
(st_rel) is needed.

Announcement

AMD Is Prepared To Release A Complete User-Space Open-Source Stack For HSA

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment