# **B3CC: Concurrency**

09: GPGPU

Ivo Gabe de Wolff

#### Mid-term exam next week

- Tuesday 19-12-2023 @ 13:00 - 15:00 in Olympos Hal 2

- Covers all the material up to and including STM
- Excluding Delta-stepping



#### Recap

# Task parallelism

- · Explicit threads
- Synchronise via locks, messages, or STM
- Modest parallelism
- Hard to program



#### Data parallelism

- Operate simultaneously on bulk data
- Implicit synchronisation
- Massive parallelism
- Easy to program

## **Data parallelism**



- The key is a single logical thread of control
- It does not actually require the operations to be executed in parallel!
- Today: let's look at how you would actually implement data-parallel operations, in parallel, on the GPU

# **CPU vs. GPU**

# **CPU vs. GPU**

#### Traditional CPU designs optimise for single-threaded performance

- Branch prediction, out-of-order execution, large caches, etc.
- Much of the available die area is dedicated to non-computation resources
- CPUs are designed to optimise *latency* of an individual thread's results
- Must be good at everything, parallel or not

#### • GPUs are designed to accelerate graphics processing (rasterisation)

- This is an inherently *data-parallel* task
- GPUs are designed to maximise *bandwidth*: the time to process as single pixel is less important than the number of pixels processed per second
- Specialised for compute intensive, highly parallel computation

# **CPU vs. GPU**

# **CPU vs. GPU**

#### • CPU

- Multiple tasks = multiple threads
- Tasks run different instructions
- 10s of complex threads execute on a few cores
- Threads managed explicitly
- Expensive to create & manage threads

#### • GPU

- SIMD: single instruction, multiple data
- 10s of thousands of lightweight threads
- Threads are managed and scheduled by the hardware
- Cheap to create many threads

 Image we need to perform some operation that takes 4 units of time (clock cycles), on values A, B, C and D.



- Horizontal parallelism: increase throughput
  - More execution units working in parallel
- Vertical parallelism: hide latency
- Keep functional units busy S when waiting for S dependencies, memory, etc.





10

9 https://en.wikipedia.org/wiki/IP over Avian Carriers



## **GPU** architecture

#### **GPU** architecture

- The CPU spends a lot of resources to avoid latency
- · The GPU instead uses parallelism to hide latency
- No branch prediction
- One task (kernel) at a time
- No context switching
- Limited super-scalar pipeline
- No out-of-order execution
- Very low clock speed

#### • Each GPU has...

- A number of streaming multiprocessors (comparable to CPU cores)
- Each core executes a number of warps (comparable to a CPU thread)
- Each warp consists of 32 "threads" that run in lockstep\* (comparable to a single lane of a SIMD execution unit)

13 \*not so for Volta architecture and onwards... http://www.catb.org/jargon/html/W/wheel-of-reincarnation.html

#### **GPU** architecture

#### **GPU** architecture

- · There are many similarities between the CPU and GPU
- Multiple cores
- A memory hierarchy
- SIMD vector instructions
- · But there are also fundamental differences
- Each SM executes up to 64 warps, instead of two threads (with SMT2)
- The memory hierarchy is explicit on the GPU (software managed cache)
- CPU uses thread (SMTx) and instruction level parallelism to saturate ALUs

16

- GPU SIMD is implicit (SIMT model)

#### 15

#### Each streaming multiprocessor (SM) executes a number of warps

- The SM has a number of active threads (e.g. Ampere has up to 2048 per SM)
- The core will switch warps whenever there is a stall in execution (e.g. waiting for memory)
- Latency is thus hidden by having many active threads; this is only possible if you can feed the GPU enough work

## **Execution model**

#### · The GPU is a co-processor controlled by a host program

- The host (CPU) and device (GPU) have separate memory spaces
- The host program controls data management on the device (allocation, transfer) as well as launching kernels



## **Execution model**

- · The GPU kernels execute multiple thread blocks over the SMs
- All threads execute the same sequential program
- Thread instructions are executed in logical SIMD groups (warps)



#### **Programming model**

#### **Programming model**

#### The CUDA (and OpenCL, Vulkan and Metal) programming model provides

- A thread abstraction to deal with SIMD
- Synchronisation and data sharing between small groups of threads (100s)
- A scalable programming model to deal with *lots* of threads (10,000s)
- A C-like language for device code
- The similarity is only superficial; it is heavily influenced by the underlying hardware model because people feel more comfortable if there are braces and semicolons ...

- · A GPU program consists of the kernel run on the GPU
- Kernels are functions which are executed n times in parallel by n different threads on the device
- Each thread executes the same sequential program
- We can not execute different code in parallel
- ... together with a program on the CPU to launch the kernel and control GPU device operations

| Kernels                                                                                                                                                                                                             | Threads                                                                                                                                                                                             |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Example: element-wise add two vectors     A     I     Z     3     4      +     +     +     +      B     5     6     7     8                                                                                         |                                                                                                                                                                                                     |
| <pre>- Sequential version:<br/>void vector_add( float* A, float* B, float* C, int n ) {     for ( int i = 0; i &lt; n; ++i ) {         C[i] = A[i] + B[i];     } } - CUDA kernel:</pre>                             | <ul> <li>A kernel consists of multiple copies of the code executed in parallel</li> <li>Each thread has its own registers</li> <li>Each warp or each thread has its own program counter*</li> </ul> |
| <pre>- CODA kerne:<br/>global void vector_add( float* A, float* B, float* C, int n )<br/>{<br/>int i = blockDim.x * blockIdx.x + threadIdx.x;<br/>if ( i &lt; n ) {<br/>C[i] = A[i] + B[i];<br/>}<br/>}<br/>}</pre> | <ul> <li>The order in which threads are executed is not specified</li> <li>Threads are very fine-grained</li> <li>Launching threads on the GPU is cheap compared to on the CPU</li> </ul>           |
| 21                                                                                                                                                                                                                  | * Pre-Volta there is one PC per warp; post-Volta each thread has its own PC                                                                                                                         |

**Threads** 

• Threads execute in a single-instruction multiple-thread model (SIMT)

- In a SIMD model the vector width is explicit
- In SIMT this is left unspecified
- Greatly simplifies the programming model



\_\_m128 b = \_mm\_set\_ps(8, 7, 6, 5);



\_global\_\_ void vector\_add( ... ) {
 // as before
}

**Threads** 

- Threads execute in a single-instruction multiple-thread (SIMT) model
- Understanding how this is mapped to the underlying hardware is important

22

- In CUDA threads execute in groups of 32 called a warp
- This is the *logical* vector width
- Performance considerations
- Threads in a warp share the same program counter
- Good code will try to keep all threads *convergent* within a warp

# **Threads**

## **Threads**

#### · The scalar (kernel) code is mapped onto the hardware SIMD execution

- Hardware handles control flow divergence and convergence
- Divergent control flow between warp threads is handled via an active mask



#### · Divergent control flow is handled by predicated execution

- At each cycle all threads in a warp must execute the same instruction
- Conditional code is handled by temporarily disabling threads for which the condition is not true (alternatively; false)

26

28

- If-then-else blocks are sequentially executing the 'if' and 'else' branches
- · The GPU is therefore a very wide vector processor

#### **Threads**

#### • Divergent control flow is handled by predicated execution

- Can lead to subtle deadlocks...
- Consider the canonical implementation of a spin-lock (for the CPU):
  - do {

old = atomic\_exchange(&lock[i], 1);
} while (old = 1);

#### /\* critical section \*/

atomic\_exchange(&lock[i], 0);

#### **Threads**

- Benefits of SIMT vs. SIMD
- Similar to regular scalar code, easier to read and write
- · Drawbacks of SIMT vs. SIMD
- The (logical) vector width is always 32, regardless of the data size
- Scattered memory access and control flow are not discouraged

#### **Thread hierarchy**

- Parallel kernels are composed of many threads
- Executing the same sequential program
- Each thread has a unique identifier
- Threads are grouped into blocks
- Threads in the same block can cooperate
- A grid of thread blocks is the collection of threads which will execute a given kernel
- Thread blocks will be scheduled onto the SMs of the GPU for execution



# **Thread hierarchy**

#### · Individual threads are grouped into thread blocks

- Each thread block constitutes an independent data-parallel task
- Threads in the same block can cooperate and synchronise with each other
- Threads in different thread blocks can not cooperate
- The program must be valid for any interleaving of thread blocks
- · This independence requirement ensures scalability

# **Thread hierarchy**

#### • Each thread block is mapped onto a SM of the GPU to be executed

- The hardware is free to assign blocks to any processor (SM) at any time
- A kernel scales across any number of parallel processors
- Each block executes in any order relative to other blocks



# **Thread hierarchy**

- · Each GPU thread is individually very weak
- Hardware multithreading is required to hide latency
- This means that performance depends on the number of thread blocks which can be allocated onto each SM

30

32

- This is limited by the set of registers and shared memory on the SM which are shared between all threads executing on that processor
- Therefore, per-thread resource usage costs performance
- More registers  $\Rightarrow$  fewer thread blocks
- More shared (local) memory usage  $\Rightarrow$  fewer thread blocks

## **Occupancy**

# **Thread blocks**

- The multiprocessor *occupancy* is the number of kernel threads which can run simultaneously on each SM, compared to the maximum possible
- Example: Constants for Turing architecture (RTX 2080 and similar)
- Simultaneous thread blocks (B)  $\leq 16$
- Warps per thread block  $(T) \leq 32$
- Maximum resident warps:  $B \times T \le 32$
- 32-bit registers per thread:  $B \times T \times 32 \le 65536$
- Shared memory per block (bytes)  $\times B \le 65536^*$
- Occupancy: B × T / 48

- Threads in a thread block can communicate and synchronise
- Example: reverse a vector
- Question: Does this work?

# **Memory hierarchy**

- A many-core processor is a device for turning a compute bound problem into a memory bound problem
- Lots of processors (ALUs)
- Memory concerns dominate performance tuning
- Only global memory is persistent across kernel launches



33





#### **Memory hierarchy**

- · Global memory is accessed in 32-, 64-, or 128-byte transactions
- Similar to how a CPU reads a cache line at a time
- The GPU has a "coalescer" which examines the memory requests from threads in the warp, and issues one or more global memory transactions
- · To use bandwidth effectively, threads should read/write in dense blocks





34

## **GPGPU**

# **Summary**

#### A typical GPU program

- 1. Set up input data on the CPU
- 2. Transfer input data to the GPU
- 3. Operate on the data
- 4. Transfer results back to the CPU

|          |                                | 0.23 s             | 0.24 s | 0.25 s | 0.26 s      | 0.27 s | 0.28 s            | 0.29 s | 0.3 s | 0.31 s |
|----------|--------------------------------|--------------------|--------|--------|-------------|--------|-------------------|--------|-------|--------|
|          | Process "quickhull-exe 100     |                    |        |        |             |        |                   |        |       |        |
| •••      | Thread 4106338880              |                    |        |        |             |        |                   |        |       |        |
|          | <ul> <li>Driver API</li> </ul> | cuMemcpyHt cuMemcp | pyHt   |        | cuMemcpyDt  |        | cuMemcpyDtoHAsync |        |       |        |
|          | Thread 4061357824              |                    |        |        |             |        |                   |        |       |        |
| . profit | - Driver API                   |                    |        |        | cuEventDest |        | cuEventDestroy    |        |       | 11     |
|          | Thread 4052965120              |                    |        |        |             |        |                   |        |       |        |
|          | <ul> <li>Driver API</li> </ul> |                    |        |        |             |        |                   |        |       | 1      |
|          | Profiling Overhead             |                    |        |        |             |        |                   |        |       |        |
|          | E [0] GeForce RTX 2080 Ti      |                    |        |        |             |        |                   |        |       |        |
|          | Context 1 (CUDA)               |                    |        |        |             |        |                   |        |       |        |
|          | - ThemCpy (HtoD)               | Memopy Hto Memopy  | Hto    |        |             |        |                   |        |       |        |
|          | - 🍸 MemCpy (DtoH)              |                    |        |        |             |        |                   |        |       |        |
|          |                                |                    |        |        |             |        |                   |        | perm  |        |
|          | E Compute                      |                    |        |        |             |        | permute_mutex     |        |       |        |
|          |                                |                    |        |        |             |        |                   |        |       |        |
|          | E Streams                      |                    |        |        |             |        |                   |        |       |        |

- · GPU excels at executing many parallel threads
- Scalable parallel execution
- High bandwidth parallel memory access
- CPU excels at executing a few serial threads
- Fast sequential execution

37

- Low latency cached memory access

# **Summary**

#### · GPUs excel when...

- The calculation is data-parallel and the control-flow is regular
- The calculation is large (compute/memory bound)

#### CPUs excel when...

- The calculation is largely serial and the control-flow is irregular
- The programmer is lazy



# **Extra slides**

NVIDIA programming guides

41

Intel intrinsics guide