Parallel Computing & SIMD Basics

Why parallel? Because single-core performance improvements have slowed dramatically since 2005. The only path to higher performance is parallelism - doing multiple things simultaneously. Let's master the fundamentals! 🚀

The Parallelism Landscape

Two Fundamental Types of Parallelism

                    PARALLELISM
                        │
            ┌───────────┴───────────┐
            ↓                       ↓
      DATA PARALLELISM        TASK PARALLELISM
    (Same operation,         (Different operations,
     different data)          independent data)

Data Parallelism: Same Operation, Multiple Data

The most common and easiest to exploit!

// Sequential (scalar)
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];  // Same ADD operation on different data
}
 
// Data parallel (vector)
// Process 8 elements simultaneously!
for (int i = 0; i < n; i += 8) {
    vec_c[i:i+7] = vec_a[i:i+7] + vec_b[i:i+7];  // 8 ADDs at once
}

Characteristics:

✅ Easy to parallelize (no dependencies between iterations)
✅ Scales well (more data → more parallelism)
✅ Hardware-friendly (SIMD, GPUs love this!)

Where you find it:

Array operations
Image processing (same filter on each pixel)
ML inference (matrix multiply, activation functions)

Task Parallelism: Different Operations, Independent Tasks

Different operations running concurrently

// Task 1: Process incoming requests
void handleRequests() { ... }
 
// Task 2: Update database
void syncDatabase() { ... }
 
// Task 3: Send notifications
void sendNotifications() { ... }
 
// Run all three tasks in parallel on different cores
thread t1(handleRequests);
thread t2(syncDatabase);  
thread t3(sendNotifications);

Characteristics:

⚠️ Harder to parallelize (need independent tasks)
⚠️ Limited by number of distinct tasks
⚠️ Requires synchronization if tasks share data

Where you find it:

Web servers (handle multiple requests)
Game engines (physics, rendering, AI in parallel)
Video encoding (different frames in parallel)

SIMD: Single Instruction, Multiple Data

SIMD is the foundation of modern high-performance computing!

The Core Concept

One instruction operates on multiple data elements in parallel

Scalar (Traditional):          SIMD (Vector):
┌──────────┐                  ┌──────────────────────────┐
│   ADD    │                  │   VADD (8-way)           │
│  a → c   │                  │  [a0..a7] → [c0..c7]     │
│  b → c   │                  │  [b0..b7] → [c0..c7]     │
└──────────┘                  └──────────────────────────┘
   1 cycle                         1 cycle, 8 operations!

SIMD Hardware

Vector Registers (Wide Registers)

// x86-64 SIMD evolution
MMX:  64-bit  (2 × 32-bit floats)     [Obsolete]
SSE:  128-bit (4 × 32-bit floats)     [Common]
AVX:  256-bit (8 × 32-bit floats)     [Modern Intel/AMD]
AVX-512: 512-bit (16 × 32-bit floats) [Server CPUs]
 
// ARM NEON
NEON: 128-bit (4 × 32-bit floats)     [Mobile/Apple]
SVE:  Scalable (up to 2048-bit)       [Server ARM]

Example: AVX-512 register

┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │ 8  │ 9  │ 10 │ 11 │ 12 │ 13 │ 14 │ 15 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
     16 × 32-bit floats in ONE register

Vector ALUs (Parallel Execution Units)

Traditional CPU ALU:         SIMD ALU:
      ┌────┐                    ┌────┬────┬────┬────┐
a →   │ +  │ → c              a │ +  │ +  │ +  │ +  │ c
b →   └────┘                  b └────┴────┴────┴────┘
                                  4 parallel additions

SIMD Operations

Basic Arithmetic

// C code (scalar)
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}
 
// x86 SSE intrinsics (4-way SIMD)
__m128 *va = (__m128*)a;
__m128 *vb = (__m128*)b;
__m128 *vc = (__m128*)c;
 
for (int i = 0; i < n/4; i++) {
    vc[i] = _mm_add_ps(va[i], vb[i]);  // 4 adds in 1 instruction
}
 
// AVX (8-way SIMD)
__m256 *va = (__m256*)a;
__m256 *vb = (__m256*)b;
__m256 *vc = (__m256*)c;
 
for (int i = 0; i < n/8; i++) {
    vc[i] = _mm256_add_ps(va[i], vb[i]);  // 8 adds in 1 instruction
}

Speedup: Up to 8x (AVX) or 16x (AVX-512) for compute-bound code!

Fused Multiply-Add (FMA)

The most important SIMD operation for ML!

// FMA: a = a + (b × c) in one instruction
// Crucial for matrix multiply, convolutions
 
// Scalar (2 operations)
a = a + b * c;
 
// AVX FMA (8 operations in 1 instruction!)
__m256 va, vb, vc;
va = _mm256_fmadd_ps(vb, vc, va);  // va = va + vb * vc (8-way)

Why FMA matters:

✅ 2x throughput (multiply + add in 1 cycle)
✅ Higher precision (no intermediate rounding)
✅ Enables 2x peak FLOPS vs separate ops

Horizontal Operations

// Reduction: sum all elements
__m256 v = _mm256_load_ps(data);  // [a0, a1, a2, a3, a4, a5, a6, a7]
 
// Horizontal add: sum adjacent pairs
v = _mm256_hadd_ps(v, v);  // [a0+a1, a2+a3, a4+a5, a6+a7, ...]
v = _mm256_hadd_ps(v, v);  // [a0+a1+a2+a3, a4+a5+a6+a7, ...]
// Continue until single sum
 
float result = _mm256_cvtss_f32(v);  // Extract final sum

SIMD Challenges

1. Data Alignment

SIMD works best with aligned data:

// Bad: Unaligned (slower)
float *a = malloc(n * sizeof(float));  // May not be aligned
 
// Good: Aligned (faster)
float *a = aligned_alloc(32, n * sizeof(float));  // 32-byte aligned (AVX)
 
// Or use compiler attributes
alignas(32) float a[1024];

Why alignment matters: Unaligned loads may cross cache lines → 2x slower!

2. Control Flow (Branches)

SIMD executes the same instruction on all lanes → branches are problematic!

// Problematic: conditional per element
for (int i = 0; i < n; i++) {
    if (a[i] > 0) {        // Different per element!
        c[i] = a[i];
    } else {
        c[i] = -a[i];
    }
}
 
// SIMD solution: predicated execution (mask)
__m256 va = _mm256_load_ps(a);
__m256 zero = _mm256_setzero_ps();
__m256 mask = _mm256_cmp_ps(va, zero, _CMP_GT_OQ);  // a > 0?
 
__m256 pos = va;              // Positive path
__m256 neg = _mm256_sub_ps(zero, va);  // Negative path
 
// Blend based on mask (no branch!)
__m256 vc = _mm256_blendv_ps(neg, pos, mask);

Cost: Both paths execute, then select → Still faster than scalar if branch is unpredictable!

3. Gather/Scatter (Non-contiguous Access)

// Bad: Indirect/strided access
for (int i = 0; i < n; i++) {
    c[i] = a[indices[i]];  // Random access!
}
 
// SIMD gather (AVX2+)
__m256i vidx = _mm256_load_si256((__m256i*)indices);  // Load 8 indices
__m256 va = _mm256_i32gather_ps(a, vidx, 4);  // Gather from 8 locations
 
// Still slower than sequential, but better than scalar

Performance: Gather/scatter are much slower than contiguous loads (10x+ latency)

Why SIMD Matters for Machine Learning

ML Workloads are SIMD-Friendly

Matrix Multiply (Core of neural networks):

// C = A × B
for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < K; k++) {
            C[i][j] += A[i][k] * B[k][j];  // Perfect for FMA!
        }
    }
}
 
// AVX-512 FMA (16-way)
// 16 multiply-adds per instruction
// 2 FLOPs per element × 16 elements = 32 FLOPS per instruction

Activation Functions:

// ReLU: max(0, x) - element-wise operation
for (int i = 0; i < n; i++) {
    output[i] = max(0.0f, input[i]);
}
 
// SIMD ReLU (8-way with AVX)
__m256 zero = _mm256_setzero_ps();
for (int i = 0; i < n; i += 8) {
    __m256 v = _mm256_load_ps(&input[i]);
    v = _mm256_max_ps(v, zero);  // 8 max operations
    _mm256_store_ps(&output[i], v);
}

Softmax:

// Exponential + normalization - vectorizable
for (int i = 0; i < n; i++) {
    sum += exp(x[i]);
}
for (int i = 0; i < n; i++) {
    output[i] = exp(x[i]) / sum;  // All SIMD-friendly!
}

Modern CPU SIMD Performance

Intel Skylake-X (AVX-512):

Peak FLOPS (per core):
- Clock: 3 GHz
- FMA units: 2
- Vector width: 512-bit (16 floats)
- FLOPs per FMA: 2 × 16 = 32
 
Peak = 3 GHz × 2 units × 32 FLOPS = 192 GFLOPS/core
 
48-core server: 192 × 48 = 9.2 TFLOPS!

Compare to scalar (no SIMD):

Peak = 3 GHz × 2 units × 1 FLOP = 6 GFLOPS/core
Speedup: 32x from SIMD alone!

Throughput-Oriented Architectures

Latency vs Throughput Architectures

┌─────────────────────┬──────────────────────┬────────────────────┐
│                     │  Latency-Oriented    │ Throughput-Oriented│
│                     │  (CPU)               │ (GPU)              │
├─────────────────────┼──────────────────────┼────────────────────┤
│ Goal                │ Minimize latency     │ Maximize throughput│
│ Design Philosophy   │ Few powerful cores   │ Many simple cores  │
│ Clock Speed         │ 3-5 GHz              │ 1-2 GHz            │
│ Out-of-Order        │ Yes (aggressive)     │ No (in-order)      │
│ Branch Prediction   │ Sophisticated        │ Simple/None        │
│ Cache               │ Large (32 MB L3)     │ Small (explicit)   │
│ Memory Latency Hide │ OoO + prefetch       │ Thread switching   │
│ Thread Count        │ 2-4 per core         │ 1000s              │
│ Best For            │ Serial, branchy code │ Parallel, SIMD code│
└─────────────────────┴──────────────────────┴────────────────────┘

Throughput Design Principles

1. Many Simple Cores > Few Complex Cores

Latency CPU:
┌──────────────────────────────┐
│  Complex Core (200mm²)       │
│  - OoO execution             │
│  - Branch prediction         │
│  - Large cache               │
│  Performance: 1 task fast    │
└──────────────────────────────┘
 
Throughput GPU (same area):
┌────┬────┬────┬────┬────┬────┐
│ C1 │ C2 │ C3 │ C4 │ C5 │ ... │ 100 simple cores
│ C7 │ C8 │ C9 │ C10│ C11│ ... │ Each core: 2mm²
│ ...│ ...│ ...│ ...│ ...│ ... │ In-order, simple
└────┴────┴────┴────┴────┴────┘
Performance: 100 tasks at once (if parallel!)

2. Hide Latency with Thread Switching

// CPU approach: Out-of-order execution
// Issue instructions from instruction window
// Execute when ready, reorder completion
 
// GPU approach: Massive threading
// When thread stalls (memory access):
//   → Switch to another ready thread instantly
//   → Keep compute units busy
//   → Original thread resumes when data arrives
 
// Example: 32 warps per SM
// Each warp: 32 threads
// Total: 1024 threads per SM
// Memory latency: 200 cycles? No problem - 31 other warps to run!

3. Explicit Memory Hierarchy

// CPU: Implicit caching (hardware managed)
data = array[i];  // Automatic L1/L2/L3 cache management
 
// GPU: Explicit shared memory (programmer managed)
__shared__ float shared_data[256];  // On-chip scratchpad
// Programmer loads data to shared memory
// Much faster than repeated global memory access

Parallelism Metrics

Speedup

Definition: How much faster is parallel version compared to serial?

Speedup = Time(serial) / Time(parallel)
 
Example:
Serial: 100 seconds
8-core parallel: 15 seconds
Speedup = 100 / 15 = 6.67x

Ideal speedup (linear): Speedup = Number of processors

2 cores → 2x speedup
4 cores → 4x speedup
8 cores → 8x speedup  (rare in practice!)

Real-world speedup: Usually sublinear due to:

Serial portions (Amdahl's Law)
Communication overhead
Synchronization costs
Load imbalance

Efficiency

Definition: How effectively are processors utilized?

Efficiency = Speedup / Number of Processors
 
Example (from above):
Speedup = 6.67x on 8 cores
Efficiency = 6.67 / 8 = 83.4%
 
Interpretation: 83.4% of ideal performance

Strong scaling: Fixed problem size, more processors

1 core:  100s (baseline)
2 cores: 52s  (Speedup 1.92x, Efficiency 96%)
4 cores: 28s  (Speedup 3.57x, Efficiency 89%)
8 cores: 15s  (Speedup 6.67x, Efficiency 83%)
 
→ Efficiency decreases as overhead dominates

Weak scaling: Increase problem size with processors

1 core:  10K elements, 10s
2 cores: 20K elements, 10s (Perfect weak scaling!)
4 cores: 40K elements, 10s
8 cores: 80K elements, 10s
 
→ Maintains efficiency by keeping work per core constant

Scalability

How well does speedup grow with more processors?

┌────────────┬──────────┬──────────┬────────────┐
│ Processors │ Speedup  │ Efficiency│ Scalability│
├────────────┼──────────┼──────────┼────────────┤
│ 1          │ 1.0x     │ 100%     │            │
│ 2          │ 1.9x     │ 95%      │ Excellent  │
│ 4          │ 3.6x     │ 90%      │ Good       │
│ 8          │ 6.4x     │ 80%      │ Decent     │
│ 16         │ 10.7x    │ 67%      │ Poor       │
│ 32         │ 14.2x    │ 44%      │ Very Poor  │
└────────────┴──────────┴──────────┴────────────┘

Good scalability: Efficiency stays above 75% up to target scale

Example Calculation

Matrix multiply: 4096×4096 matrices

// Serial implementation
Time(serial) = 120 seconds
 
// Parallel on 8 cores with AVX-512 SIMD
Time(parallel) = 2.5 seconds
 
// Metrics
Speedup = 120 / 2.5 = 48x
Efficiency = 48 / 8 = 600%
 
// Wait, >100% efficiency?
// Yes! SIMD gives 16x boost per core
// Combined: 8 cores × 16-way SIMD = 128x theoretical
// Achieved: 48x / 128x theoretical = 37.5% of peak

SIMD Programming Models

1. Intrinsics (Explicit SIMD)

#include &lt;immintrin.h&gt;  // Intel intrinsics
 
void add_avx(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        __m256 vb = _mm256_load_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_store_ps(&c[i], vc);
    }
}

Pros: Full control, predictable performance Cons: Not portable, verbose, manual optimization

2. Auto-Vectorization (Compiler)

// Simple C code
void add(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // Compiler vectorizes this!
    }
}
 
// Compile with optimization
// gcc -O3 -march=native -ftree-vectorize
// Compiler generates AVX/SSE instructions automatically

Pros: Portable, simple code Cons: Compiler may miss opportunities, less control

3. SIMD Libraries

// Eigen (C++)
VectorXf a(n), b(n), c(n);
c = a + b;  // Automatically uses SIMD!
 
// NumPy (Python)
c = a + b  # NumPy uses optimized BLAS with SIMD

Pros: High-level, portable, well-optimized Cons: Less control, potential overhead

Key Takeaways

✅ Two types of parallelism:

Data parallelism: same op, different data (SIMD-friendly)
Task parallelism: different ops, independent data

✅ SIMD processes multiple data elements per instruction

Modern CPUs: 8-16 wide (AVX/AVX-512)
Crucial for ML: matrix multiply, activations

✅ Throughput architectures (GPUs) trade single-thread speed for massive parallelism

✅ Speedup and efficiency measure parallel performance

Speedup = Serial time / Parallel time
Efficiency = Speedup / Number of processors

✅ SIMD challenges: alignment, branches, non-contiguous access

✅ ML loves SIMD: Matrix ops, element-wise ops are naturally vectorizable

Practice Problems

1. SIMD Speedup

Scalar code: 100 cycles
AVX-256 SIMD (8-way): ? cycles
Assume perfect vectorization, what's the speedup?

2. Efficiency Calculation

16-core CPU, parallel speedup = 10x
What is the efficiency? What might cause less than ideal speedup?

3. SIMD Suitability

// Which loops can be efficiently SIMD-vectorized?
// Loop A
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}
 
// Loop B  
for (int i = 1; i < n; i++) {
    c[i] = c[i-1] + a[i];  // Dependency!
}

Next Steps

Now that you understand parallel computing basics:

→ GPU Architecture: Dive deep into throughput-oriented design

→ Matrix Multiplication: Apply SIMD to the most important ML kernel

→ TPU Architecture: See how Google optimized for tensor operations

→ Performance Analysis: Measure SIMD effectiveness in real code

Remember: Modern performance is all about exploiting parallelism - both data and task level! 🚀

Prerequisites

Learning Objectives

Table of Contents