

# **GPU Computing with CUDA**

# Part 3: CUDA Performance Tips and Tricks

Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation"

## Dominik Göddeke

Angewandte Mathematik und Numerik TU Dortmund

dominik.goeddeke@math.tu-dortmund.de // http://www.mathematik.tu-dortmund.de/~goeddeke



#### Acknowledgements



- Slides based on previous courses by
  - Mark Harris, Simon Green, Gregory Ruetsch (NVIDIA)
  - Robert Strzodka (MPI Informatik)
  - Dominik Göddeke (TU Dortmund)
  - ARCS 2008 GPGPU and CUDA Tutorials http://www.mathematik.tu-dortmund.de/~goeddeke/arcs2008/
  - University of New South Wales Workshop on GPU Computing with CUDA

http://www.cse.unsw.edu.au/~pls/cuda-workshop09/



- Overview
- Hardware
- Memory optimizations
- Execution configuration optimizations
- Instruction optimizations
- Summary

## **Optimize** algorithms



- Maximize independent parallelism
- Maximize arithmetic intensity
  - Math per bandwidth
- Sometimes it's better to recompute than to cache
  - GPU spends its transistors on ALUs, not memory
- Do more computation on the GPU to avoid costly data transfers
  - Even low parallelism computations can sometimes be faster than transferring back and forth to host

#### **Optimize memory access**



- Coalesced vs. non-coalesced = order of magnitude
  - Global / local device memory
- Optimize for spatial locality in cached texture memory
- In shared memory, avoid high-degree bank conflicts
- Partition camping
  - When global memory access not evenly distributed among partitions
  - Problem-size dependent

### Take advantage of shared memory



- Hundreds of times faster than global memory
  - Sometimes as fast as registers
- Threads can cooperate via shared memory
  - Per thread block
- Use one (a few) threads to load or compute data shared by all threads
- Use it to avoid non-coalesced access
  - Stage loads and stores in shared memory to re-order non-coalesceable addressing

## **Use parallelism efficiently**



- Partition computation to keep the multiprocessors equally busy
  - Many threads, many thread blocks
  - Scalability on future devices
- Keep resource usage low enough to support multiple active threads blocks per multiprocessor
  - Registers, shared memory
  - Occupancy



- Overview
- Hardware
- Memory optimizations
- Execution configuration optimizations
- Instruction optimizations
- Summary



- 240 thread processors execute kernel threads
- 30 multiprocessors, each contains
  - 8 (single precision and integer) thread processors
  - 1 double precision unit
  - Shared memory enables thread cooperation



U technische universität dortmund

Software



Thread Processor

Threads are executed by thread processors

Thread

Thread

Block



Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)



Processor array

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time







A half-warp of 16 threads can coordinate global memory accesses into a single transaction called coalescing

U technische universität dortmund





| Memory   | Location | Cached | Access | Scope                  | Lifetime    |
|----------|----------|--------|--------|------------------------|-------------|
| Register | On-chip  | N/A    | R/W    | One thread             | Thread      |
| Local    | Off-chip | No     | R/W    | One thread             | Thread      |
| Shared   | On-chip  | N/A    | R/W    | All threads in a block | Block       |
| Global   | Off-chip | No     | R/W    | All threads + host     | Application |
| Constant | Off-chip | Yes    | R      | All threads + host     | Application |
| Texture  | Off-chip | Yes    | R      | All threads + host     | Application |



- Overview
- Hardware
- Memory optimizations
  - Data transfers between host and device
  - Device memory optimizations
- Execution configuration optimizations
- Instruction optimizations
- Summary

#### Host device data transfers



- Device to host bandwidth much lower than device to device bandwidth
  - 8 GB/s peak (PCIe x16 Gen 2) vs. 160 GB/s (GTX 285)
- Minimize transfers
  - Intermediate data can be allocated, operated on, and deallocated without even copying them to host memory

#### Group transfers

• One large transfer is better than many small ones

#### **Page-locked** data transfers



- cudaMallocHost() allows allocation of page-locked ("pinned") host memory
- Enables highest cudaMemcpy() performance
  - 3.2 GB/s on PCIe x16 Gen 1
  - 5.2 GB/s on PCIe x16 Gen 2
- Use with caution!!
  - Allocating too much page-locked memory can reduce overall system performance and stability
  - Test systems and learn their limits

#### Live demo

BandwidthTest CUDA SDK example

#### **Overlap data transfers and computation**

- Async and stream APIs allow overlap of H2D or D2H data transfer with computation
  - CPU computation can overlap data transfers on all CUDA capable devices
  - Kernel computation can overlap data transfers on devices with "Concurrent copy and execution" (roughly compute capability 1.1)
- Stream = sequence of operations that execute in order on GPU
  - Operations from different streams can be interleaved
  - Stream ID used as argument to async calls and kernel launches
  - If not used, everything happens in stream 0





- Asynchroneous host-device memory copy returns control immediately to CPU
  - cudaMemcpyAsync(dst, src, size, dir, stream);
  - Requires pinned host memory (allocated with cudaMallocHost())
- Overlap CPU computation with data transfer
  - cudaMemcpyAsync(a\_d, a\_h, size, cudaMemcpyHostToDevice, 0);
  - cpuFunction();
  - cudaThreadSynchronize();
  - kernel<<<grid, block>>>(dst);
- Live demo
  - streamTest

overlapped



- Kernel based
  - Implicit barrier between kernel invocations in the same stream
- Context based
  - cudaThreadSynchronize();
    - Blocks until all previously issued CUDA calls from a CPU thread complete
- Stream based
  - cudaStreamSynchronize(streamID);
    - Blocks until all CUDA calls issued to given stream complete
  - cudaStreamQuery(streamID);
    - Indicates whether stream is idle
    - Returns cudaSuccess, cudaErrorNotReady, ...
    - Does not block CPU thread

## **GPU/CPU** synchronization



- Stream based using events
  - Event = simple label created by cudaEventCreate(&(cudaEvent\_t e));
  - Events can be inserted into streams
    - cudaEventRecord(event, streamID);
  - Event is recorded then GPU reaches it in a stream
    - Recorded = assigned a timestamp (GPU clocktick)
    - Useful for fine-granular timing
  - cudaEventSynchronize(event);
    - Blocks until given event is recorded
  - cudaEventQuery(event);
    - Indicates whether event has recorded
    - Returns cudaSuccess, cudaErrorNotReady, ...
    - Does not block CPU thread

#### Outline



- Overview
- Hardware
- Memory optimizations
  - Data transfers between host and device
  - Device memory optimizations
    - Matrix transpose study

Measuring performance - effective bandwidth Coalescing Shared memory bank conflicts Partition camping

- Execution configuration optimizations
- Instruction optimizations
- Summary



- Transpose 2048x2048 matrix of floats
- Performed out-of-place
  - Separate input and output matrices
- Use tile of 32x32 elements, block of 32x8 threads
  - Each thread processes 4 matrix elements
  - In general tile and block size are fair game for optimization

#### • Process

- Get the right answer
- Measure effective bandwidth (relative to theoretical or reference case)
- Address global memory coalescing, shared memory bank conflicts, and partition camping while repeating above steps

#### **Theoretical** bandwidth



• Device bandwidth of GTX 280



- Specs report 141 GB/s
  - Use 10^9 B/GB conversion rather than 1024^3
  - Whichever you use, be consistent



• Transpose effective bandwidth



- Reference case matrix copy
  - Transpose operates on tiles need better comparison than raw device bandwidth
  - Look at effective bandwidth of copy that uses tiles



```
_global__ void copy(float *odata, float *idata, int width,
int height)
```

```
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index = xIndex + width*yIndex;
```

```
for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) {
    odata[index+i*width] = idata[index+i*width];</pre>
```

idata

odata



Elements copied by a half-warp of threads

TILE\_DIM = 32 BLOCK\_ROWS = 8

32x32 tile 32x8 thread block

> idata and odata in global memory

## Matrix copy kernel timing



- Measure elapsed time over loop
- Looping/timing done in two ways:
  - Over kernel launches (nreps = 1)
    - Includes launch/indexing overhead
  - Within the kernel over loads/stores (nreps > 1)
    - Amortizes launch/indexing overhead



#### Similar to copy

Input and output matrices have different indices

\_global\_\_ void transposeNaive(float \*odata, float\* idata, int width, int height, int nreps)

```
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
```

```
int index_in = xIndex + width * yIndex;
int index_out = yIndex + height * xIndex;
```

```
for (int r=0; r < nreps; r++) {
  for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
     odata[index_out+i] = idata[index_in+i*width];
  }
}</pre>
```





Effective Bandwidth (GB/s) 2048x2048, GTX 280

|                 | Loop over<br>kernel | Loop in kernel |
|-----------------|---------------------|----------------|
| Simple Copy     | 96.9                | 81.6           |
| Naïve Transpose | 2.2                 | 2.2            |

#### Outline



- Overview
- Hardware
- Memory optimizations

- Data transfers between host and device
- Device memory optimizations
  - Matrix transpose study Measuring performance - effective bandwidth Coalescing Shared memory bank conflicts Partition camping
- Execution configuration optimizations
- Instruction optimizations
- Summary

#### Coalescing



 Global memory access of 32, 64, or 128-bit words by a half-warp of threads can result in as few as one (or two) transaction(s) if certain access requirements are met

64B aligned segment (16 floats)

128B aligned segment (32 floats)

- Depends on compute capability
  - 1.0 and 1.1 have stricter access requirements

Examples - float (32-bit) data

**Global Memory** 



Half-warp of threads

#### Coalescing

U technische universität dortmund

- Compute capability 1.0 and 1.1
  - K-th thread must access k-th word in the segment (or k-th word in two contiguous 128B segments for 128-bit words)
  - Not all threads need to participate



Out of sequence – 16 transactions



Coalesces - 1 transaction

Misaligned – 16 transactions





J technische universität dortmund

- Compute capability 1.2 and higher
  - Coalescing is achieved for any pattern of addresses that fits into a segment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32and 64-bit words
  - Smaller transactions may be issued to avoid wasted bandwidth due to unused words



1 transaction - 64B segment

2 transactions - 64B and 32B segments



1 transaction - 128B segment





• Naïve transpose coalesces reads, but not writes



Elements transposed by a half-warp of threads

### Take advantage of shared memory



- Hundreds of times faster than global memory
- Threads can cooperate via shared memory
- Use one (a few) threads to load or compute data shared by all threads
- Use it to avoid non-coalesced access
  - Stage loads and stores in shared memory to re-order non-coalesceable addressing



- Access columns of a tile in shared memory to write contiguous data to global memory
- Requires \_\_\_\_\_syncthreads() since threads write data read by other threads



Elements transposed by a half-warp of threads



\_global\_\_ void transposeCoalesced(float \*odata, float \*idata, int width, int height, int nreps)

```
__shared__ float tile[TILE_DIM][TILE_DIM];
```

```
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
```

```
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;
```

```
for (int r=0; r < nreps; r++) {
  for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
    tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
  }</pre>
```

```
_syncthreads();
```

```
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
    odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
}</pre>
```

{


Effective Bandwidth (GB/s) 2048x2048, GTX 280

|                     | Loop over kernel | Loop in kernel |
|---------------------|------------------|----------------|
| Simple Copy         | 96.9             | 81.6           |
| Shared Memory Copy  | 80.9             | 81.1           |
| Naïve Transpose     | 2.2              | 2.2            |
| Coalesced Transpose | 16.5             | 17.1           |

### Outline



- Overview
- Hardware
- Memory Optimizations
  - Data transfers between host and device
  - Device memory optimizations
    - Matrix transpose study Measuring performance - effective bandwidth Coalescing Shared memory bank conflicts Partition camping
- Execution Configuration Optimizations
- Instruction Optimizations
- Summary

- Many threads accessing memory
  - Therefore, memory is divided in banks
  - Successive 32-bit words assigned to successive banks
- Each bank can serve one address per cycle
  - A memory can service as many simultaneous accesses as it has banks
- Multiple simultaneous accesses to a bank result in a bank conflict
  - Conflicting addresses are serialized













# **Shared memory bank conflicts**



- Shared memory is ~ as fast as registers
  - If there are no bank conflicts
  - warp\_serialize profiler signal
- The fast case
  - If all threads of a half-warp access different banks, there are no bank conflicts
  - If all threads of a half-warp read the identical address, there is no bank conflict (broadcast)

#### • The slow case

- Bank conflict: multiple threads in the same half-warp access the same bank
- Must serialize the accesses
- Cost = max # of simultaneous accesses to a single bank

# **Bank conflicts in transpose**



- 32x32 shared memory tile of floats
  - Data in columns k and k+16 are in same bank
  - 16-way bank conflict reading half columns in tile
- Solution pad shared memory array
  - \_\_\_\_shared\_\_\_ float tile[TILE\_DIM][TILE\_DIM+1];
  - Data in anti-diagonals are in same bank



Elements transposed by a half-warp of threads



# Effective Bandwidth (GB/s) 2048x2048, GTX 280

|                              | Loop over<br>kernel | Loop in kerne |
|------------------------------|---------------------|---------------|
| Simple Copy                  | 96.9                | 81.6          |
| Shared Memory Copy           | 80.9                | 81.1          |
| Naïve Transpose              | 2.2                 | 2.2           |
| Coalesced Transpose          | 16.5                | 17.1          |
| Bank Conflict Free Transpose | 16.6                | 17.2          |

### Outline



- Overview
- Hardware
- Memory optimizations
  - Data transfers between host and device
  - Device memory optimizations
    - Matrix transpose study

Measuring performance - effective bandwidth Coalescing Shared memory bank conflicts Partition camping

- Execution configuration optimizations
- Instruction optimizations
- Summary



- Global memory accesses go through partitions
  - 6 partitions on 8-series GPUs, 8 partitions on 10-series GPUs
  - Successive 256-byte regions of global memory are assigned to successive partitions
- For best performance:
  - Simultaneous global memory accesses GPU-wide should be distributed evenly amongst partitions
- Partition camping occurs when global memory accesses at an instant use a subset of partitions
  - Directly analogous to shared memory bank conflicts, but on a larger scale

# Partition camping in transpose



- Partition width = 256 bytes = 64 floats
  - Twice size of tile
- On GTX 280 (8 partitions), data 2K apart map to same partition
  - 2048 floats divides evenly by 2kB => columns of matrices map to same partition



# **Partition camping solutions**



- Pad matrices (by two tiles)
  - In general might be expensive (prohibitive) memory-wise
- Diagonally (virtually) reorder blocks
  - Interpret blockIdx.y as different diagonal slices and blockIdx.x as distance along a diagonal



blockId = gridDim.x \* blockIdx.y + blockIdx.x

### **Diagonal** transpose

{



\_\_global\_\_ void transposeDiagonal(float \*odata, float \*idata, int width, int height, int nreps)

```
__shared__ float tile[TILE_DIM][TILE_DIM+1];
```

```
int blockIdx_y = blockIdx.x;
int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;
```

```
int xIndex = blockIdx_x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx_y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
```

```
xIndex = blockIdx_y * TILE_DIM + threadIdx.x;
yIndex = blockIdx_x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;
```

```
for (int r=0; r < nreps; r++) {
  for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
    tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
  }
  ___syncthreads();
  for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
    odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
}</pre>
```

Add lines to map diagonal to Cartesian coordinates

Replace blockldx.x with blockldx\_x, blockldx.y with blockldx\_y



- Previous slide for square matrices
- More generally

```
if (width == height) {
    blockldx_y = blockldx.x;
    blockldx_x = (blockldx.x+blockldx.y)%gridDim.x;
} else {
    int bid = blockldx.x + gridDim.x*blockldx.y;
    blockldx_y = bid%gridDim.y;
    blockldx_x = ((bid/gridDim.y)+blockldx_y)%gridDim.x;
}
```



# Effective Bandwidth (GB/s) 2048x2048, GTX 280

|                              | Loop over kernel | Loop in kernel |
|------------------------------|------------------|----------------|
| Simple Copy                  | 96.9             | 81.6           |
| Shared Memory Copy           | 80.9             | 81.1           |
| Naïve Transpose              | 2.2              | 2.2            |
| Coalesced Transpose          | 16.5             | 17.1           |
| Bank Conflict Free Transpose | 16.6             | 17.2           |
| Diagonal                     | 69.5             | 78.3           |



- Coalescing and shared memory bank conflicts are small-scale phenomena
  - Deal with memory access within half-warp
  - Problem-size independent
- Partition camping is a large-scale phenomena
  - Deals with simultaneous memory accesses by warps on different multiprocessors
  - Problem size dependent
    - Wouldn't see in (2048+32)^2 matrix
- Coalescing is generally the most critical

### Outline



- Overview
- Hardware
- Memory optimizations
  - Data transfers between host and device
  - Device memory optimizations
    - Matrix transpose study
    - Textures
- Execution configuration optimizations
- Instruction optimizations
- Summary





Read-only access!

### Wrap

Out-of-bounds coordinate is wrapped (modulo arithmetic)



# Clamp

Out-of-bounds coordinate is replaced with the closest boundary



### Two CUDA texture types



- Bound to linear memory
  - Standard way to get cached access to 1D arrays
  - Global memory address is bound to a texture
  - Only 1D
  - Integer addressing
  - No filtering, no addressing modes
- Bound to CUDA arrays
  - Full graphics functionality
  - CUDA array is bound to a texture
  - 1D, 2D, 3D
  - Float addressing (size-based or normalized)
  - Filtering
  - Address modes (clamping, repeat)



#### • Host (CPU) code

- Allocate/obtain memory (global linear or CUDA array)
- Create a texture reference object
  - Currently must be at file scope
- Bind the texture reference to memory/array
- Compute
- Unbind the texture reference, free resources

#### • Device (kernel) code

- Fetch using texture reference
- Linear memory textures
  - tex1Dfetch()
- Array textures
  - tex1D() or tex2D() or tex3D()



- Overview
- Hardware
- Memory optimizations
- Execution configuration optimizations
- Instruction optimizations
- Summary





- Thread instructions are executed sequentially (in order)
  - So executing other warps is the only way to hide latencies and keep the hardware busy
- Occupancy
  - Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently

#### Limited by resource usage

- Registers
- Shared memory

# **Grid/Block size heuristics**



- # of blocks > # of multiprocessors
  - So all multiprocessors have at least one block to execute
- # of blocks / # of multiprocessors > 2
  - Multiple blocks can run concurrently in a multiprocessor
  - Blocks that aren't waiting at a \_\_\_\_syncthreads() barrier keep the hardware busy
  - Subject to resource availability (registers, shared memory)
- # of blocks > 100 to scale to future devices
  - Blocks executed in pipeline fashion
  - 1000 blocks per grid will scale across multiple generations

### **Register** dependency



- Read after write register dependency
  - Instruction's result can be read ~11 cycles later
  - Scenarios: CUDA PTX

| x = y + 5;      | add.f32 \$f3, \$f1, \$f2               |
|-----------------|----------------------------------------|
| z = x + 3;      | add.f32 \$f5, <mark>\$f3</mark> , \$f4 |
|                 |                                        |
| s_data[0] += 3; | ld.shared.f32 \$f3, [\$r31+0]          |
|                 | add.f32 \$f3, \$f3, \$f4               |

- To completely hide the latency
  - Run at least 192 threads (6 warps) per multiprocessor
  - Equivalent: at least 25% occupancy
  - Threads do not have to belong to the same thread block



• Hide latencies by using more threads per multiprocessor

#### • Limiting factors

- Number of registers per kernel
  - 16K (8K on G8x) per SM, partitioned among concurrent threads
- Amount of shared memory
  - 16kB per SM, partitioned among concurrent thread blocks
- Compile with –ptxas-options=v flag
  - Verbose mode, study carefully
- Use –maxregcount=N flag
  - N = desired maximum registers per kernel
  - At some point spilling into local memory may occur
  - Reduces performance, local memory is slow (implemented in global memory)

### **Occupancy calculator**





Dominik Göddeke | TU Dortmund

# **Optimizing threads per block**



- Choose threads per block as a multiple of warp size
  - Avoid wasting computation on under-populated warps (SIMD)
- More threads per block = better memory latency hiding
  - But: fewer registers per thread
  - Kernel invocations can fail if too many registers are used
- Heuristics
  - Minimum: 64 threads per block
    - Only if multiple concurrent blocks
  - 192 or 256 threads a better choice
    - Usually still enough registers to compile and invoke successfully
  - This all depends on your computation, so experiment

# **Occupancy** != performance



• Increasing occupancy does not necessarily increase performance



- Low occupancy microprocessors cannot adequately hide latency on memory-bound kernels
  - It all comes down to arithmetic intensity and available parallelism

## **Parameterize** your application



- Parameterization helps adaptation to different GPUs
- GPUs vary in many ways
  - # of multiprocessors
  - Memory bandwidth
  - Shared memory size
  - Register file size
  - Max. Threads per block
- You can even make apps self-tuning
  - Like FFTW or ATLAS
  - Experiment mode discovers and saves optimal configuration
    - Recall transpose example



- Overview
- Hardware
- Memory optimizations
- Execution configuration optimizations
- Instruction optimizations
- Summary

# **CUDA instruction performance**



- Instruction cycles (per warp) = sum of
  - Operand read cycles
  - Instruction execution cycles
  - Result update cycles
- Therefore instruction throughput depends on
  - Nominal instruction throughput
  - Memory latency
  - Memory bandwidth
- Cycle refers to the multiprocessor clock rate
  - 1.3 GHz on GTX 280

# **Maximizing instruction throughput**



- Maximize use of high-bandwidth memory
  - Maximize use of shared memory
  - Minimize accesses to global memory
  - Maximize coalescing of global memory accesses
- Optimize performance by overlapping memory accesses with hardware computations
  - High arithmetic intensity programs
    - High ratio of math to memory transactions
  - Many concurrent threads

# **Arithmetic instruction throughput**



- int and float add, shift, min, max and float mul, mad
  - 4 cycles per warp
  - int multiply is by default 32-bit
  - Requires multiple cycles per warp
  - Use \_\_mul24(), \_\_umul24() intrinsics for 4-cycle 24-bit int multiply
- Integer divide and modulo are more expensive
  - Compiler tries to convert literal power-of-two divides to shifts
  - Be explicit in cases where compiler can't tell that divisor is power of 2
  - Useful trick: foo % n == foo & (n-1) if n is a power of two



- Intrinsics reciprocal, reciprocal square root, sin/cos, log, exp prefixed with "\_\_\_\_"
  - 16 cycles per warp
  - Example: \_\_rcp()
- Other functions are combinations of the above
  - y/x == rcp(x) \* y takes 20 cycles per warp
  - Sqrt(x) == x\*rsqrt(x) takes 20 cycles per warp

# Runtime math library



- There are two types of runtime math operations
  - \_\_\_func(): direct mapping to hardware ISA
    - Fast
    - But lower accuracy (see progguide)
    - Example: \_\_\_\_sin(x)
  - func(): compiles to multiple instructions
    - Slower but higher accuracy (5 ULP or less)
    - Example: sin(x)
- \_use-fast-math compiler flag
  - Forces every func() to compile to \_\_\_func()
- Double precision always IEEE-754 compliant



- Many, many variables
  - Hardware, compiler, optimization flags...
- CPU operations aren't strictly limited to 0.5 ulp
  - Sequences of operations can be more accurate due to 80-bit extended precision ALUs
  - CPU-SSE code usually closest to GPU code
- Floating point arithmetic is not associative and commutative!


- Symbolic
  - (x+y)+z = x+(y+z)
- Not necessarily true for floating-point addition
  - Try x=10^30, y =  $-10^30$  and z=1 in the above equation
- Parallelizing computations
  - Potentially changes the order of operations
  - Results may not exactly match sequential results
  - This is not specific to CUDA or GPU
  - Inherent part of parallel computation

## **Control flow instructions**



- Main performance concern with branching is divergence
  - Overhead of simple branch: ~4 cycles per warp
  - Divergence: Threads within a single warp take different paths
  - Different execution paths must be serialized
- Avoid divergence when branch condition is a function of the thread ID
  - Example with divergence
    - If (threadIdx.x > 2) { ... }
    - Branch granularity < warp size
  - Example without divergence
    - If (threadIdx.x / WARP\_SIZE > 2) { ... }
    - Branch granularity is a a whole multiple of warp size



- GPU hardware can achieve great performance on data-parallel computations if you follow a few simple guidelines
  - Use parallelism efficiently
  - Coalesce memory accesses if possible
  - Take advantage of shared memory
  - Explore other memory spaces
    - Texture
    - Constant
  - Reduce bank conflicts
  - Avoid partition camping