Parallel Computing & SIMD Basics
Introduction to parallelism fundamentals: data vs task parallelism, SIMD vector operations, throughput-oriented architectures, and essential parallelism metrics.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
Parallel Computing & SIMD Basics
Why parallel? Because single-core performance improvements have slowed dramatically since 2005. The only path to higher performance is parallelism - doing multiple things simultaneously. Let's master the fundamentals! š
The Parallelism Landscape
Two Fundamental Types of Parallelism
PARALLELISM
ā
āāāāāāāāāāāāā“āāāāāāāāāāāā
ā ā
DATA PARALLELISM TASK PARALLELISM
(Same operation, (Different operations,
different data) independent data)Data Parallelism: Same Operation, Multiple Data
The most common and easiest to exploit!
// Sequential (scalar)
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // Same ADD operation on different data
}
// Data parallel (vector)
// Process 8 elements simultaneously!
for (int i = 0; i < n; i += 8) {
vec_c[i:i+7] = vec_a[i:i+7] + vec_b[i:i+7]; // 8 ADDs at once
}Characteristics:
- ā Easy to parallelize (no dependencies between iterations)
- ā Scales well (more data ā more parallelism)
- ā Hardware-friendly (SIMD, GPUs love this!)
Where you find it:
- Array operations
- Image processing (same filter on each pixel)
- ML inference (matrix multiply, activation functions)
Task Parallelism: Different Operations, Independent Tasks
Different operations running concurrently
// Task 1: Process incoming requests
void handleRequests() { ... }
// Task 2: Update database
void syncDatabase() { ... }
// Task 3: Send notifications
void sendNotifications() { ... }
// Run all three tasks in parallel on different cores
thread t1(handleRequests);
thread t2(syncDatabase);
thread t3(sendNotifications);Characteristics:
- ā ļø Harder to parallelize (need independent tasks)
- ā ļø Limited by number of distinct tasks
- ā ļø Requires synchronization if tasks share data
Where you find it:
- Web servers (handle multiple requests)
- Game engines (physics, rendering, AI in parallel)
- Video encoding (different frames in parallel)
SIMD: Single Instruction, Multiple Data
SIMD is the foundation of modern high-performance computing!
The Core Concept
One instruction operates on multiple data elements in parallel
Scalar (Traditional): SIMD (Vector):
āāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ADD ā ā VADD (8-way) ā
ā a ā c ā ā [a0..a7] ā [c0..c7] ā
ā b ā c ā ā [b0..b7] ā [c0..c7] ā
āāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāā
1 cycle 1 cycle, 8 operations!SIMD Hardware
Vector Registers (Wide Registers)
// x86-64 SIMD evolution
MMX: 64-bit (2 Ć 32-bit floats) [Obsolete]
SSE: 128-bit (4 Ć 32-bit floats) [Common]
AVX: 256-bit (8 Ć 32-bit floats) [Modern Intel/AMD]
AVX-512: 512-bit (16 Ć 32-bit floats) [Server CPUs]
// ARM NEON
NEON: 128-bit (4 Ć 32-bit floats) [Mobile/Apple]
SVE: Scalable (up to 2048-bit) [Server ARM]Example: AVX-512 register
āāāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā
ā 0 ā 1 ā 2 ā 3 ā 4 ā 5 ā 6 ā 7 ā 8 ā 9 ā 10 ā 11 ā 12 ā 13 ā 14 ā 15 ā
āāāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā
16 Ć 32-bit floats in ONE registerVector ALUs (Parallel Execution Units)
Traditional CPU ALU: SIMD ALU:
āāāāāā āāāāāā¬āāāāā¬āāāāā¬āāāāā
a ā ā + ā ā c a ā + ā + ā + ā + ā c
b ā āāāāāā b āāāāāā“āāāāā“āāāāā“āāāāā
4 parallel additionsSIMD Operations
Basic Arithmetic
// C code (scalar)
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
// x86 SSE intrinsics (4-way SIMD)
__m128 *va = (__m128*)a;
__m128 *vb = (__m128*)b;
__m128 *vc = (__m128*)c;
for (int i = 0; i < n/4; i++) {
vc[i] = _mm_add_ps(va[i], vb[i]); // 4 adds in 1 instruction
}
// AVX (8-way SIMD)
__m256 *va = (__m256*)a;
__m256 *vb = (__m256*)b;
__m256 *vc = (__m256*)c;
for (int i = 0; i < n/8; i++) {
vc[i] = _mm256_add_ps(va[i], vb[i]); // 8 adds in 1 instruction
}Speedup: Up to 8x (AVX) or 16x (AVX-512) for compute-bound code!
Fused Multiply-Add (FMA)
The most important SIMD operation for ML!
// FMA: a = a + (b Ć c) in one instruction
// Crucial for matrix multiply, convolutions
// Scalar (2 operations)
a = a + b * c;
// AVX FMA (8 operations in 1 instruction!)
__m256 va, vb, vc;
va = _mm256_fmadd_ps(vb, vc, va); // va = va + vb * vc (8-way)Why FMA matters:
- ā 2x throughput (multiply + add in 1 cycle)
- ā Higher precision (no intermediate rounding)
- ā Enables 2x peak FLOPS vs separate ops
Horizontal Operations
// Reduction: sum all elements
__m256 v = _mm256_load_ps(data); // [a0, a1, a2, a3, a4, a5, a6, a7]
// Horizontal add: sum adjacent pairs
v = _mm256_hadd_ps(v, v); // [a0+a1, a2+a3, a4+a5, a6+a7, ...]
v = _mm256_hadd_ps(v, v); // [a0+a1+a2+a3, a4+a5+a6+a7, ...]
// Continue until single sum
float result = _mm256_cvtss_f32(v); // Extract final sumSIMD Challenges
1. Data Alignment
SIMD works best with aligned data:
// Bad: Unaligned (slower)
float *a = malloc(n * sizeof(float)); // May not be aligned
// Good: Aligned (faster)
float *a = aligned_alloc(32, n * sizeof(float)); // 32-byte aligned (AVX)
// Or use compiler attributes
alignas(32) float a[1024];Why alignment matters: Unaligned loads may cross cache lines ā 2x slower!
2. Control Flow (Branches)
SIMD executes the same instruction on all lanes ā branches are problematic!
// Problematic: conditional per element
for (int i = 0; i < n; i++) {
if (a[i] > 0) { // Different per element!
c[i] = a[i];
} else {
c[i] = -a[i];
}
}
// SIMD solution: predicated execution (mask)
__m256 va = _mm256_load_ps(a);
__m256 zero = _mm256_setzero_ps();
__m256 mask = _mm256_cmp_ps(va, zero, _CMP_GT_OQ); // a > 0?
__m256 pos = va; // Positive path
__m256 neg = _mm256_sub_ps(zero, va); // Negative path
// Blend based on mask (no branch!)
__m256 vc = _mm256_blendv_ps(neg, pos, mask);Cost: Both paths execute, then select ā Still faster than scalar if branch is unpredictable!
3. Gather/Scatter (Non-contiguous Access)
// Bad: Indirect/strided access
for (int i = 0; i < n; i++) {
c[i] = a[indices[i]]; // Random access!
}
// SIMD gather (AVX2+)
__m256i vidx = _mm256_load_si256((__m256i*)indices); // Load 8 indices
__m256 va = _mm256_i32gather_ps(a, vidx, 4); // Gather from 8 locations
// Still slower than sequential, but better than scalarPerformance: Gather/scatter are much slower than contiguous loads (10x+ latency)
Why SIMD Matters for Machine Learning
ML Workloads are SIMD-Friendly
Matrix Multiply (Core of neural networks):
// C = A Ć B
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < K; k++) {
C[i][j] += A[i][k] * B[k][j]; // Perfect for FMA!
}
}
}
// AVX-512 FMA (16-way)
// 16 multiply-adds per instruction
// 2 FLOPs per element Ć 16 elements = 32 FLOPS per instructionActivation Functions:
// ReLU: max(0, x) - element-wise operation
for (int i = 0; i < n; i++) {
output[i] = max(0.0f, input[i]);
}
// SIMD ReLU (8-way with AVX)
__m256 zero = _mm256_setzero_ps();
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_load_ps(&input[i]);
v = _mm256_max_ps(v, zero); // 8 max operations
_mm256_store_ps(&output[i], v);
}Softmax:
// Exponential + normalization - vectorizable
for (int i = 0; i < n; i++) {
sum += exp(x[i]);
}
for (int i = 0; i < n; i++) {
output[i] = exp(x[i]) / sum; // All SIMD-friendly!
}Modern CPU SIMD Performance
Intel Skylake-X (AVX-512):
Peak FLOPS (per core):
- Clock: 3 GHz
- FMA units: 2
- Vector width: 512-bit (16 floats)
- FLOPs per FMA: 2 Ć 16 = 32
Peak = 3 GHz Ć 2 units Ć 32 FLOPS = 192 GFLOPS/core
48-core server: 192 Ć 48 = 9.2 TFLOPS!Compare to scalar (no SIMD):
Peak = 3 GHz Ć 2 units Ć 1 FLOP = 6 GFLOPS/core
Speedup: 32x from SIMD alone!Throughput-Oriented Architectures
Latency vs Throughput Architectures
āāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāā
ā ā Latency-Oriented ā Throughput-Orientedā
ā ā (CPU) ā (GPU) ā
āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāā¤
ā Goal ā Minimize latency ā Maximize throughputā
ā Design Philosophy ā Few powerful cores ā Many simple cores ā
ā Clock Speed ā 3-5 GHz ā 1-2 GHz ā
ā Out-of-Order ā Yes (aggressive) ā No (in-order) ā
ā Branch Prediction ā Sophisticated ā Simple/None ā
ā Cache ā Large (32 MB L3) ā Small (explicit) ā
ā Memory Latency Hide ā OoO + prefetch ā Thread switching ā
ā Thread Count ā 2-4 per core ā 1000s ā
ā Best For ā Serial, branchy code ā Parallel, SIMD codeā
āāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāThroughput Design Principles
1. Many Simple Cores > Few Complex Cores
Latency CPU:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Complex Core (200mm²) ā
ā - OoO execution ā
ā - Branch prediction ā
ā - Large cache ā
ā Performance: 1 task fast ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Throughput GPU (same area):
āāāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā¬āāāāā
ā C1 ā C2 ā C3 ā C4 ā C5 ā ... ā 100 simple cores
ā C7 ā C8 ā C9 ā C10ā C11ā ... ā Each core: 2mm²
ā ...ā ...ā ...ā ...ā ...ā ... ā In-order, simple
āāāāāā“āāāāā“āāāāā“āāāāā“āāāāā“āāāāā
Performance: 100 tasks at once (if parallel!)2. Hide Latency with Thread Switching
// CPU approach: Out-of-order execution
// Issue instructions from instruction window
// Execute when ready, reorder completion
// GPU approach: Massive threading
// When thread stalls (memory access):
// ā Switch to another ready thread instantly
// ā Keep compute units busy
// ā Original thread resumes when data arrives
// Example: 32 warps per SM
// Each warp: 32 threads
// Total: 1024 threads per SM
// Memory latency: 200 cycles? No problem - 31 other warps to run!3. Explicit Memory Hierarchy
// CPU: Implicit caching (hardware managed)
data = array[i]; // Automatic L1/L2/L3 cache management
// GPU: Explicit shared memory (programmer managed)
__shared__ float shared_data[256]; // On-chip scratchpad
// Programmer loads data to shared memory
// Much faster than repeated global memory accessParallelism Metrics
Speedup
Definition: How much faster is parallel version compared to serial?
Speedup = Time(serial) / Time(parallel)
Example:
Serial: 100 seconds
8-core parallel: 15 seconds
Speedup = 100 / 15 = 6.67xIdeal speedup (linear): Speedup = Number of processors
2 cores ā 2x speedup
4 cores ā 4x speedup
8 cores ā 8x speedup (rare in practice!)Real-world speedup: Usually sublinear due to:
- Serial portions (Amdahl's Law)
- Communication overhead
- Synchronization costs
- Load imbalance
Efficiency
Definition: How effectively are processors utilized?
Efficiency = Speedup / Number of Processors
Example (from above):
Speedup = 6.67x on 8 cores
Efficiency = 6.67 / 8 = 83.4%
Interpretation: 83.4% of ideal performanceStrong scaling: Fixed problem size, more processors
1 core: 100s (baseline)
2 cores: 52s (Speedup 1.92x, Efficiency 96%)
4 cores: 28s (Speedup 3.57x, Efficiency 89%)
8 cores: 15s (Speedup 6.67x, Efficiency 83%)
ā Efficiency decreases as overhead dominatesWeak scaling: Increase problem size with processors
1 core: 10K elements, 10s
2 cores: 20K elements, 10s (Perfect weak scaling!)
4 cores: 40K elements, 10s
8 cores: 80K elements, 10s
ā Maintains efficiency by keeping work per core constantScalability
How well does speedup grow with more processors?
āāāāāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāāāā
ā Processors ā Speedup ā Efficiencyā Scalabilityā
āāāāāāāāāāāāāā¼āāāāāāāāāāā¼āāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā 1 ā 1.0x ā 100% ā ā
ā 2 ā 1.9x ā 95% ā Excellent ā
ā 4 ā 3.6x ā 90% ā Good ā
ā 8 ā 6.4x ā 80% ā Decent ā
ā 16 ā 10.7x ā 67% ā Poor ā
ā 32 ā 14.2x ā 44% ā Very Poor ā
āāāāāāāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāāāāGood scalability: Efficiency stays above 75% up to target scale
Example Calculation
Matrix multiply: 4096Ć4096 matrices
// Serial implementation
Time(serial) = 120 seconds
// Parallel on 8 cores with AVX-512 SIMD
Time(parallel) = 2.5 seconds
// Metrics
Speedup = 120 / 2.5 = 48x
Efficiency = 48 / 8 = 600%
// Wait, >100% efficiency?
// Yes! SIMD gives 16x boost per core
// Combined: 8 cores Ć 16-way SIMD = 128x theoretical
// Achieved: 48x / 128x theoretical = 37.5% of peakSIMD Programming Models
1. Intrinsics (Explicit SIMD)
#include <immintrin.h> // Intel intrinsics
void add_avx(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_store_ps(&c[i], vc);
}
}Pros: Full control, predictable performance Cons: Not portable, verbose, manual optimization
2. Auto-Vectorization (Compiler)
// Simple C code
void add(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // Compiler vectorizes this!
}
}
// Compile with optimization
// gcc -O3 -march=native -ftree-vectorize
// Compiler generates AVX/SSE instructions automaticallyPros: Portable, simple code Cons: Compiler may miss opportunities, less control
3. SIMD Libraries
// Eigen (C++)
VectorXf a(n), b(n), c(n);
c = a + b; // Automatically uses SIMD!
// NumPy (Python)
c = a + b # NumPy uses optimized BLAS with SIMDPros: High-level, portable, well-optimized Cons: Less control, potential overhead
Key Takeaways
ā Two types of parallelism:
- Data parallelism: same op, different data (SIMD-friendly)
- Task parallelism: different ops, independent data
ā SIMD processes multiple data elements per instruction
- Modern CPUs: 8-16 wide (AVX/AVX-512)
- Crucial for ML: matrix multiply, activations
ā Throughput architectures (GPUs) trade single-thread speed for massive parallelism
ā Speedup and efficiency measure parallel performance
- Speedup = Serial time / Parallel time
- Efficiency = Speedup / Number of processors
ā SIMD challenges: alignment, branches, non-contiguous access
ā ML loves SIMD: Matrix ops, element-wise ops are naturally vectorizable
Practice Problems
1. SIMD Speedup
Scalar code: 100 cycles
AVX-256 SIMD (8-way): ? cycles
Assume perfect vectorization, what's the speedup?2. Efficiency Calculation
16-core CPU, parallel speedup = 10x
What is the efficiency? What might cause less than ideal speedup?3. SIMD Suitability
// Which loops can be efficiently SIMD-vectorized?
// Loop A
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
// Loop B
for (int i = 1; i < n; i++) {
c[i] = c[i-1] + a[i]; // Dependency!
}Next Steps
Now that you understand parallel computing basics:
ā GPU Architecture: Dive deep into throughput-oriented design
ā Matrix Multiplication: Apply SIMD to the most important ML kernel
ā TPU Architecture: See how Google optimized for tensor operations
ā Performance Analysis: Measure SIMD effectiveness in real code
Remember: Modern performance is all about exploiting parallelism - both data and task level! š