Parallel Computing & SIMD Basics
Introduction to parallelism fundamentals: data vs task parallelism, SIMD vector operations, throughput-oriented architectures, and essential parallelism metrics.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
Parallel Computing & SIMD Basics
Why parallel? Because single-core performance improvements have slowed dramatically since 2005. The only path to higher performance is parallelism - doing multiple things simultaneously. Let's master the fundamentals! π
1. The Parallelism Landscape
1.1 Two Fundamental Types of Parallelism
PARALLELISM
β
βββββββββββββ΄ββββββββββββ
β β
DATA PARALLELISM TASK PARALLELISM
(Same operation, (Different operations,
different data) independent data)
1.2 Data Parallelism: Same Operation, Multiple Data
The most common and easiest to exploit!
// Sequential (scalar)
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // Same ADD operation on different data
}
// Data parallel (vector)
// Process 8 elements simultaneously!
for (int i = 0; i < n; i += 8) {
vec_c[i:i+7] = vec_a[i:i+7] + vec_b[i:i+7]; // 8 ADDs at once
}
Characteristics:
- β Easy to parallelize (no dependencies between iterations)
- β Scales well (more data β more parallelism)
- β Hardware-friendly (SIMD, GPUs love this!)
Where you find it:
- Array operations
- Image processing (same filter on each pixel)
- ML inference (matrix multiply, activation functions)
1.3 Task Parallelism: Different Operations, Independent Tasks
Different operations running concurrently
// Task 1: Process incoming requests
void handleRequests() { ... }
// Task 2: Update database
void syncDatabase() { ... }
// Task 3: Send notifications
void sendNotifications() { ... }
// Run all three tasks in parallel on different cores
thread t1(handleRequests);
thread t2(syncDatabase);
thread t3(sendNotifications);
Characteristics:
- β οΈ Harder to parallelize (need independent tasks)
- β οΈ Limited by number of distinct tasks
- β οΈ Requires synchronization if tasks share data
Where you find it:
- Web servers (handle multiple requests)
- Game engines (physics, rendering, AI in parallel)
- Video encoding (different frames in parallel)
2. SIMD: Single Instruction, Multiple Data
SIMD is the foundation of modern high-performance computing!
2.1 The Core Concept
One instruction operates on multiple data elements in parallel
Scalar (Traditional): SIMD (Vector):
ββββββββββββ ββββββββββββββββββββββββββββ
β ADD β β VADD (8-way) β
β a β c β β [a0..a7] β [c0..c7] β
β b β c β β [b0..b7] β [c0..c7] β
ββββββββββββ ββββββββββββββββββββββββββββ
1 cycle 1 cycle, 8 operations!
2.2 SIMD Hardware
Vector Registers (Wide Registers)
// x86-64 SIMD evolution
MMX: 64-bit (2 Γ 32-bit floats) [Obsolete]
SSE: 128-bit (4 Γ 32-bit floats) [Common]
AVX: 256-bit (8 Γ 32-bit floats) [Modern Intel/AMD]
AVX-512: 512-bit (16 Γ 32-bit floats) [Server CPUs]
// ARM NEON
NEON: 128-bit (4 Γ 32-bit floats) [Mobile/Apple]
SVE: Scalable (up to 2048-bit) [Server ARM]
Example: AVX-512 register
ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ
β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β 8 β 9 β 10 β 11 β 12 β 13 β 14 β 15 β
ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ
16 Γ 32-bit floats in ONE register
Vector ALUs (Parallel Execution Units)
Traditional CPU ALU: SIMD ALU:
ββββββ ββββββ¬βββββ¬βββββ¬βββββ
a β β + β β c a β + β + β + β + β c
b β ββββββ b ββββββ΄βββββ΄βββββ΄βββββ
4 parallel additions
2.3 SIMD Operations
Basic Arithmetic
// C code (scalar)
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
// x86 SSE intrinsics (4-way SIMD)
__m128 *va = (__m128*)a;
__m128 *vb = (__m128*)b;
__m128 *vc = (__m128*)c;
for (int i = 0; i < n/4; i++) {
vc[i] = _mm_add_ps(va[i], vb[i]); // 4 adds in 1 instruction
}
// AVX (8-way SIMD)
__m256 *va = (__m256*)a;
__m256 *vb = (__m256*)b;
__m256 *vc = (__m256*)c;
for (int i = 0; i < n/8; i++) {
vc[i] = _mm256_add_ps(va[i], vb[i]); // 8 adds in 1 instruction
}
Speedup: Up to 8x (AVX) or 16x (AVX-512) for compute-bound code!
Fused Multiply-Add (FMA)
The most important SIMD operation for ML!
// FMA: a = a + (b Γ c) in one instruction
// Crucial for matrix multiply, convolutions
// Scalar (2 operations)
a = a + b * c;
// AVX FMA (8 operations in 1 instruction!)
__m256 va, vb, vc;
va = _mm256_fmadd_ps(vb, vc, va); // va = va + vb * vc (8-way)
Why FMA matters:
- β 2x throughput (multiply + add in 1 cycle)
- β Higher precision (no intermediate rounding)
- β Enables 2x peak FLOPS vs separate ops
Horizontal Operations
// Reduction: sum all elements
__m256 v = _mm256_load_ps(data); // [a0, a1, a2, a3, a4, a5, a6, a7]
// Horizontal add: sum adjacent pairs
v = _mm256_hadd_ps(v, v); // [a0+a1, a2+a3, a4+a5, a6+a7, ...]
v = _mm256_hadd_ps(v, v); // [a0+a1+a2+a3, a4+a5+a6+a7, ...]
// Continue until single sum
float result = _mm256_cvtss_f32(v); // Extract final sum
2.4 SIMD Challenges
1. Data Alignment
SIMD works best with aligned data:
// Bad: Unaligned (slower)
float *a = malloc(n * sizeof(float)); // May not be aligned
// Good: Aligned (faster)
float *a = aligned_alloc(32, n * sizeof(float)); // 32-byte aligned (AVX)
// Or use compiler attributes
alignas(32) float a[1024];
Why alignment matters: Unaligned loads may cross cache lines β 2x slower!
2. Control Flow (Branches)
SIMD executes the same instruction on all lanes β branches are problematic!
// Problematic: conditional per element
for (int i = 0; i < n; i++) {
if (a[i] > 0) { // Different per element!
c[i] = a[i];
} else {
c[i] = -a[i];
}
}
// SIMD solution: predicated execution (mask)
__m256 va = _mm256_load_ps(a);
__m256 zero = _mm256_setzero_ps();
__m256 mask = _mm256_cmp_ps(va, zero, _CMP_GT_OQ); // a > 0?
__m256 pos = va; // Positive path
__m256 neg = _mm256_sub_ps(zero, va); // Negative path
// Blend based on mask (no branch!)
__m256 vc = _mm256_blendv_ps(neg, pos, mask);
Cost: Both paths execute, then select β Still faster than scalar if branch is unpredictable!
3. Gather/Scatter (Non-contiguous Access)
// Bad: Indirect/strided access
for (int i = 0; i < n; i++) {
c[i] = a[indices[i]]; // Random access!
}
// SIMD gather (AVX2+)
__m256i vidx = _mm256_load_si256((__m256i*)indices); // Load 8 indices
__m256 va = _mm256_i32gather_ps(a, vidx, 4); // Gather from 8 locations
// Still slower than sequential, but better than scalar
Performance: Gather/scatter are much slower than contiguous loads (10x+ latency)
3. Why SIMD Matters for Machine Learning
3.1 ML Workloads are SIMD-Friendly
Matrix Multiply (Core of neural networks):
// C = A Γ B
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < K; k++) {
C[i][j] += A[i][k] * B[k][j]; // Perfect for FMA!
}
}
}
// AVX-512 FMA (16-way)
// 16 multiply-adds per instruction
// 2 FLOPs per element Γ 16 elements = 32 FLOPS per instruction
Activation Functions:
// ReLU: max(0, x) - element-wise operation
for (int i = 0; i < n; i++) {
output[i] = max(0.0f, input[i]);
}
// SIMD ReLU (8-way with AVX)
__m256 zero = _mm256_setzero_ps();
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_load_ps(&input[i]);
v = _mm256_max_ps(v, zero); // 8 max operations
_mm256_store_ps(&output[i], v);
}
Softmax:
// Exponential + normalization - vectorizable
for (int i = 0; i < n; i++) {
sum += exp(x[i]);
}
for (int i = 0; i < n; i++) {
output[i] = exp(x[i]) / sum; // All SIMD-friendly!
}
3.2 Modern CPU SIMD Performance
Intel Skylake-X (AVX-512):
Peak FLOPS (per core):
- Clock: 3 GHz
- FMA units: 2
- Vector width: 512-bit (16 floats)
- FLOPs per FMA: 2 Γ 16 = 32
Peak = 3 GHz Γ 2 units Γ 32 FLOPS = 192 GFLOPS/core
48-core server: 192 Γ 48 = 9.2 TFLOPS!
Compare to scalar (no SIMD):
Peak = 3 GHz Γ 2 units Γ 1 FLOP = 6 GFLOPS/core
Speedup: 32x from SIMD alone!
4. Throughput-Oriented Architectures
4.1 Latency vs Throughput Architectures
βββββββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββ
β β Latency-Oriented β Throughput-Orientedβ
β β (CPU) β (GPU) β
βββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββ€
β Goal β Minimize latency β Maximize throughputβ
β Design Philosophy β Few powerful cores β Many simple cores β
β Clock Speed β 3-5 GHz β 1-2 GHz β
β Out-of-Order β Yes (aggressive) β No (in-order) β
β Branch Prediction β Sophisticated β Simple/None β
β Cache β Large (32 MB L3) β Small (explicit) β
β Memory Latency Hide β OoO + prefetch β Thread switching β
β Thread Count β 2-4 per core β 1000s β
β Best For β Serial, branchy code β Parallel, SIMD codeβ
βββββββββββββββββββββββ΄βββββββββββββββββββββββ΄βββββββββββββββββββββ
4.2 Throughput Design Principles
1. Many Simple Cores > Few Complex Cores
Latency CPU:
ββββββββββββββββββββββββββββββββ
β Complex Core (200mmΒ²) β
β - OoO execution β
β - Branch prediction β
β - Large cache β
β Performance: 1 task fast β
ββββββββββββββββββββββββββββββββ
Throughput GPU (same area):
ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ
β C1 β C2 β C3 β C4 β C5 β ... β 100 simple cores
β C7 β C8 β C9 β C10β C11β ... β Each core: 2mmΒ²
β ...β ...β ...β ...β ...β ... β In-order, simple
ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ
Performance: 100 tasks at once (if parallel!)
2. Hide Latency with Thread Switching
// CPU approach: Out-of-order execution
// Issue instructions from instruction window
// Execute when ready, reorder completion
// GPU approach: Massive threading
// When thread stalls (memory access):
// β Switch to another ready thread instantly
// β Keep compute units busy
// β Original thread resumes when data arrives
// Example: 32 warps per SM
// Each warp: 32 threads
// Total: 1024 threads per SM
// Memory latency: 200 cycles? No problem - 31 other warps to run!
3. Explicit Memory Hierarchy
// CPU: Implicit caching (hardware managed)
data = array[i]; // Automatic L1/L2/L3 cache management
// GPU: Explicit shared memory (programmer managed)
__shared__ float shared_data[256]; // On-chip scratchpad
// Programmer loads data to shared memory
// Much faster than repeated global memory access
5. Parallelism Metrics
5.1 Speedup
Definition: How much faster is parallel version compared to serial?
Speedup = Time(serial) / Time(parallel)
Example:
Serial: 100 seconds
8-core parallel: 15 seconds
Speedup = 100 / 15 = 6.67x
Ideal speedup (linear): Speedup = Number of processors
2 cores β 2x speedup
4 cores β 4x speedup
8 cores β 8x speedup (rare in practice!)
Real-world speedup: Usually sublinear due to:
- Serial portions (Amdahl's Law)
- Communication overhead
- Synchronization costs
- Load imbalance
5.2 Efficiency
Definition: How effectively are processors utilized?
Efficiency = Speedup / Number of Processors
Example (from above):
Speedup = 6.67x on 8 cores
Efficiency = 6.67 / 8 = 83.4%
Interpretation: 83.4% of ideal performance
Strong scaling: Fixed problem size, more processors
1 core: 100s (baseline)
2 cores: 52s (Speedup 1.92x, Efficiency 96%)
4 cores: 28s (Speedup 3.57x, Efficiency 89%)
8 cores: 15s (Speedup 6.67x, Efficiency 83%)
β Efficiency decreases as overhead dominates
Weak scaling: Increase problem size with processors
1 core: 10K elements, 10s
2 cores: 20K elements, 10s (Perfect weak scaling!)
4 cores: 40K elements, 10s
8 cores: 80K elements, 10s
β Maintains efficiency by keeping work per core constant
5.3 Scalability
How well does speedup grow with more processors?
ββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββ
β Processors β Speedup β Efficiencyβ Scalabilityβ
ββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββββ€
β 1 β 1.0x β 100% β β
β 2 β 1.9x β 95% β Excellent β
β 4 β 3.6x β 90% β Good β
β 8 β 6.4x β 80% β Decent β
β 16 β 10.7x β 67% β Poor β
β 32 β 14.2x β 44% β Very Poor β
ββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββββ
Good scalability: Efficiency stays above 75% up to target scale
5.4 Example Calculation
Matrix multiply: 4096Γ4096 matrices
// Serial implementation
Time(serial) = 120 seconds
// Parallel on 8 cores with AVX-512 SIMD
Time(parallel) = 2.5 seconds
// Metrics
Speedup = 120 / 2.5 = 48x
Efficiency = 48 / 8 = 600%
// Wait, >100% efficiency?
// Yes! SIMD gives 16x boost per core
// Combined: 8 cores Γ 16-way SIMD = 128x theoretical
// Achieved: 48x / 128x theoretical = 37.5% of peak
6. SIMD Programming Models
6.1 1. Intrinsics (Explicit SIMD)
#include <immintrin.h> // Intel intrinsics
void add_avx(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_store_ps(&c[i], vc);
}
}
Pros: Full control, predictable performance Cons: Not portable, verbose, manual optimization
6.2 2. Auto-Vectorization (Compiler)
// Simple C code
void add(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // Compiler vectorizes this!
}
}
// Compile with optimization
// gcc -O3 -march=native -ftree-vectorize
// Compiler generates AVX/SSE instructions automatically
Pros: Portable, simple code Cons: Compiler may miss opportunities, less control
6.3 3. SIMD Libraries
// Eigen (C++)
VectorXf a(n), b(n), c(n);
c = a + b; // Automatically uses SIMD!
// NumPy (Python)
c = a + b # NumPy uses optimized BLAS with SIMD
Pros: High-level, portable, well-optimized Cons: Less control, potential overhead
7. Key Takeaways
β Two types of parallelism:
- Data parallelism: same op, different data (SIMD-friendly)
- Task parallelism: different ops, independent data
β SIMD processes multiple data elements per instruction
- Modern CPUs: 8-16 wide (AVX/AVX-512)
- Crucial for ML: matrix multiply, activations
β Throughput architectures (GPUs) trade single-thread speed for massive parallelism
β Speedup and efficiency measure parallel performance
- Speedup = Serial time / Parallel time
- Efficiency = Speedup / Number of processors
β SIMD challenges: alignment, branches, non-contiguous access
β ML loves SIMD: Matrix ops, element-wise ops are naturally vectorizable
8. Practice Problems
1. SIMD Speedup
Scalar code: 100 cycles
AVX-256 SIMD (8-way): ? cycles
Assume perfect vectorization, what's the speedup?
2. Efficiency Calculation
16-core CPU, parallel speedup = 10x
What is the efficiency? What might cause less than ideal speedup?
3. SIMD Suitability
// Which loops can be efficiently SIMD-vectorized?
// Loop A
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
// Loop B
for (int i = 1; i < n; i++) {
c[i] = c[i-1] + a[i]; // Dependency!
}
9. Next Steps
Now that you understand parallel computing basics:
β GPU Architecture: Dive deep into throughput-oriented design
β Matrix Multiplication: Apply SIMD to the most important ML kernel
β TPU Architecture: See how Google optimized for tensor operations
β Performance Analysis: Measure SIMD effectiveness in real code
Remember: Modern performance is all about exploiting parallelism - both data and task level! π