Skip to main content
ParallelbeginnerparallelismSIMDvectorsthroughputspeedupefficiencybeginner

Parallel Computing & SIMD Basics

Introduction to parallelism fundamentals: data vs task parallelism, SIMD vector operations, throughput-oriented architectures, and essential parallelism metrics.

35 min read
Updated 10/1/2025
3 prerequisites

Prerequisites

Make sure you're familiar with these concepts before diving in:

Basic computer architecture understanding
Understanding of CPU pipelines
Familiarity with arrays and loops

Learning Objectives

By the end of this topic, you will be able to:

Distinguish between data parallelism and task parallelism
Understand SIMD (Single Instruction Multiple Data) fundamentals
Explain why vector operations are crucial for ML workloads
Grasp throughput-oriented architecture design principles
Calculate and interpret speedup and efficiency metrics

Table of Contents

Parallel Computing & SIMD Basics

Why parallel? Because single-core performance improvements have slowed dramatically since 2005. The only path to higher performance is parallelism - doing multiple things simultaneously. Let's master the fundamentals! πŸš€

1. The Parallelism Landscape

1.1 Two Fundamental Types of Parallelism

                    PARALLELISM
                        β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            ↓                       ↓
      DATA PARALLELISM        TASK PARALLELISM
    (Same operation,         (Different operations,
     different data)          independent data)

1.2 Data Parallelism: Same Operation, Multiple Data

The most common and easiest to exploit!

// Sequential (scalar)
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];  // Same ADD operation on different data
}
 
// Data parallel (vector)
// Process 8 elements simultaneously!
for (int i = 0; i < n; i += 8) {
    vec_c[i:i+7] = vec_a[i:i+7] + vec_b[i:i+7];  // 8 ADDs at once
}

Characteristics:

  • βœ… Easy to parallelize (no dependencies between iterations)
  • βœ… Scales well (more data β†’ more parallelism)
  • βœ… Hardware-friendly (SIMD, GPUs love this!)

Where you find it:

  • Array operations
  • Image processing (same filter on each pixel)
  • ML inference (matrix multiply, activation functions)

1.3 Task Parallelism: Different Operations, Independent Tasks

Different operations running concurrently

// Task 1: Process incoming requests
void handleRequests() { ... }
 
// Task 2: Update database
void syncDatabase() { ... }
 
// Task 3: Send notifications
void sendNotifications() { ... }
 
// Run all three tasks in parallel on different cores
thread t1(handleRequests);
thread t2(syncDatabase);  
thread t3(sendNotifications);

Characteristics:

  • ⚠️ Harder to parallelize (need independent tasks)
  • ⚠️ Limited by number of distinct tasks
  • ⚠️ Requires synchronization if tasks share data

Where you find it:

  • Web servers (handle multiple requests)
  • Game engines (physics, rendering, AI in parallel)
  • Video encoding (different frames in parallel)

2. SIMD: Single Instruction, Multiple Data

SIMD is the foundation of modern high-performance computing!

2.1 The Core Concept

One instruction operates on multiple data elements in parallel

Scalar (Traditional):          SIMD (Vector):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   ADD    β”‚                  β”‚   VADD (8-way)           β”‚
β”‚  a β†’ c   β”‚                  β”‚  [a0..a7] β†’ [c0..c7]     β”‚
β”‚  b β†’ c   β”‚                  β”‚  [b0..b7] β†’ [c0..c7]     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   1 cycle                         1 cycle, 8 operations!

2.2 SIMD Hardware

Vector Registers (Wide Registers)

// x86-64 SIMD evolution
MMX:  64-bit  (2 Γ— 32-bit floats)     [Obsolete]
SSE:  128-bit (4 Γ— 32-bit floats)     [Common]
AVX:  256-bit (8 Γ— 32-bit floats)     [Modern Intel/AMD]
AVX-512: 512-bit (16 Γ— 32-bit floats) [Server CPUs]
 
// ARM NEON
NEON: 128-bit (4 Γ— 32-bit floats)     [Mobile/Apple]
SVE:  Scalable (up to 2048-bit)       [Server ARM]

Example: AVX-512 register

β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ 0  β”‚ 1  β”‚ 2  β”‚ 3  β”‚ 4  β”‚ 5  β”‚ 6  β”‚ 7  β”‚ 8  β”‚ 9  β”‚ 10 β”‚ 11 β”‚ 12 β”‚ 13 β”‚ 14 β”‚ 15 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
     16 Γ— 32-bit floats in ONE register

Vector ALUs (Parallel Execution Units)

Traditional CPU ALU:         SIMD ALU:
      β”Œβ”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
a β†’   β”‚ +  β”‚ β†’ c              a β”‚ +  β”‚ +  β”‚ +  β”‚ +  β”‚ c
b β†’   β””β”€β”€β”€β”€β”˜                  b β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
                                  4 parallel additions

2.3 SIMD Operations

Basic Arithmetic

// C code (scalar)
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}
 
// x86 SSE intrinsics (4-way SIMD)
__m128 *va = (__m128*)a;
__m128 *vb = (__m128*)b;
__m128 *vc = (__m128*)c;
 
for (int i = 0; i < n/4; i++) {
    vc[i] = _mm_add_ps(va[i], vb[i]);  // 4 adds in 1 instruction
}
 
// AVX (8-way SIMD)
__m256 *va = (__m256*)a;
__m256 *vb = (__m256*)b;
__m256 *vc = (__m256*)c;
 
for (int i = 0; i < n/8; i++) {
    vc[i] = _mm256_add_ps(va[i], vb[i]);  // 8 adds in 1 instruction
}

Speedup: Up to 8x (AVX) or 16x (AVX-512) for compute-bound code!

Fused Multiply-Add (FMA)

The most important SIMD operation for ML!

// FMA: a = a + (b Γ— c) in one instruction
// Crucial for matrix multiply, convolutions
 
// Scalar (2 operations)
a = a + b * c;
 
// AVX FMA (8 operations in 1 instruction!)
__m256 va, vb, vc;
va = _mm256_fmadd_ps(vb, vc, va);  // va = va + vb * vc (8-way)

Why FMA matters:

  • βœ… 2x throughput (multiply + add in 1 cycle)
  • βœ… Higher precision (no intermediate rounding)
  • βœ… Enables 2x peak FLOPS vs separate ops

Horizontal Operations

// Reduction: sum all elements
__m256 v = _mm256_load_ps(data);  // [a0, a1, a2, a3, a4, a5, a6, a7]
 
// Horizontal add: sum adjacent pairs
v = _mm256_hadd_ps(v, v);  // [a0+a1, a2+a3, a4+a5, a6+a7, ...]
v = _mm256_hadd_ps(v, v);  // [a0+a1+a2+a3, a4+a5+a6+a7, ...]
// Continue until single sum
 
float result = _mm256_cvtss_f32(v);  // Extract final sum

2.4 SIMD Challenges

1. Data Alignment

SIMD works best with aligned data:

// Bad: Unaligned (slower)
float *a = malloc(n * sizeof(float));  // May not be aligned
 
// Good: Aligned (faster)
float *a = aligned_alloc(32, n * sizeof(float));  // 32-byte aligned (AVX)
 
// Or use compiler attributes
alignas(32) float a[1024];

Why alignment matters: Unaligned loads may cross cache lines β†’ 2x slower!

2. Control Flow (Branches)

SIMD executes the same instruction on all lanes β†’ branches are problematic!

// Problematic: conditional per element
for (int i = 0; i < n; i++) {
    if (a[i] > 0) {        // Different per element!
        c[i] = a[i];
    } else {
        c[i] = -a[i];
    }
}
 
// SIMD solution: predicated execution (mask)
__m256 va = _mm256_load_ps(a);
__m256 zero = _mm256_setzero_ps();
__m256 mask = _mm256_cmp_ps(va, zero, _CMP_GT_OQ);  // a > 0?
 
__m256 pos = va;              // Positive path
__m256 neg = _mm256_sub_ps(zero, va);  // Negative path
 
// Blend based on mask (no branch!)
__m256 vc = _mm256_blendv_ps(neg, pos, mask);

Cost: Both paths execute, then select β†’ Still faster than scalar if branch is unpredictable!

3. Gather/Scatter (Non-contiguous Access)

// Bad: Indirect/strided access
for (int i = 0; i < n; i++) {
    c[i] = a[indices[i]];  // Random access!
}
 
// SIMD gather (AVX2+)
__m256i vidx = _mm256_load_si256((__m256i*)indices);  // Load 8 indices
__m256 va = _mm256_i32gather_ps(a, vidx, 4);  // Gather from 8 locations
 
// Still slower than sequential, but better than scalar

Performance: Gather/scatter are much slower than contiguous loads (10x+ latency)

3. Why SIMD Matters for Machine Learning

3.1 ML Workloads are SIMD-Friendly

Matrix Multiply (Core of neural networks):

// C = A Γ— B
for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < K; k++) {
            C[i][j] += A[i][k] * B[k][j];  // Perfect for FMA!
        }
    }
}
 
// AVX-512 FMA (16-way)
// 16 multiply-adds per instruction
// 2 FLOPs per element Γ— 16 elements = 32 FLOPS per instruction

Activation Functions:

// ReLU: max(0, x) - element-wise operation
for (int i = 0; i < n; i++) {
    output[i] = max(0.0f, input[i]);
}
 
// SIMD ReLU (8-way with AVX)
__m256 zero = _mm256_setzero_ps();
for (int i = 0; i < n; i += 8) {
    __m256 v = _mm256_load_ps(&input[i]);
    v = _mm256_max_ps(v, zero);  // 8 max operations
    _mm256_store_ps(&output[i], v);
}

Softmax:

// Exponential + normalization - vectorizable
for (int i = 0; i < n; i++) {
    sum += exp(x[i]);
}
for (int i = 0; i < n; i++) {
    output[i] = exp(x[i]) / sum;  // All SIMD-friendly!
}

3.2 Modern CPU SIMD Performance

Intel Skylake-X (AVX-512):

Peak FLOPS (per core):
- Clock: 3 GHz
- FMA units: 2
- Vector width: 512-bit (16 floats)
- FLOPs per FMA: 2 Γ— 16 = 32
 
Peak = 3 GHz Γ— 2 units Γ— 32 FLOPS = 192 GFLOPS/core
 
48-core server: 192 Γ— 48 = 9.2 TFLOPS!

Compare to scalar (no SIMD):

Peak = 3 GHz Γ— 2 units Γ— 1 FLOP = 6 GFLOPS/core
Speedup: 32x from SIMD alone!

4. Throughput-Oriented Architectures

4.1 Latency vs Throughput Architectures

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     β”‚  Latency-Oriented    β”‚ Throughput-Orientedβ”‚
β”‚                     β”‚  (CPU)               β”‚ (GPU)              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Goal                β”‚ Minimize latency     β”‚ Maximize throughputβ”‚
β”‚ Design Philosophy   β”‚ Few powerful cores   β”‚ Many simple cores  β”‚
β”‚ Clock Speed         β”‚ 3-5 GHz              β”‚ 1-2 GHz            β”‚
β”‚ Out-of-Order        β”‚ Yes (aggressive)     β”‚ No (in-order)      β”‚
β”‚ Branch Prediction   β”‚ Sophisticated        β”‚ Simple/None        β”‚
β”‚ Cache               β”‚ Large (32 MB L3)     β”‚ Small (explicit)   β”‚
β”‚ Memory Latency Hide β”‚ OoO + prefetch       β”‚ Thread switching   β”‚
β”‚ Thread Count        β”‚ 2-4 per core         β”‚ 1000s              β”‚
β”‚ Best For            β”‚ Serial, branchy code β”‚ Parallel, SIMD codeβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.2 Throughput Design Principles

1. Many Simple Cores > Few Complex Cores

Latency CPU:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Complex Core (200mmΒ²)       β”‚
β”‚  - OoO execution             β”‚
β”‚  - Branch prediction         β”‚
β”‚  - Large cache               β”‚
β”‚  Performance: 1 task fast    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 
Throughput GPU (same area):
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ C1 β”‚ C2 β”‚ C3 β”‚ C4 β”‚ C5 β”‚ ... β”‚ 100 simple cores
β”‚ C7 β”‚ C8 β”‚ C9 β”‚ C10β”‚ C11β”‚ ... β”‚ Each core: 2mmΒ²
β”‚ ...β”‚ ...β”‚ ...β”‚ ...β”‚ ...β”‚ ... β”‚ In-order, simple
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
Performance: 100 tasks at once (if parallel!)

2. Hide Latency with Thread Switching

// CPU approach: Out-of-order execution
// Issue instructions from instruction window
// Execute when ready, reorder completion
 
// GPU approach: Massive threading
// When thread stalls (memory access):
//   β†’ Switch to another ready thread instantly
//   β†’ Keep compute units busy
//   β†’ Original thread resumes when data arrives
 
// Example: 32 warps per SM
// Each warp: 32 threads
// Total: 1024 threads per SM
// Memory latency: 200 cycles? No problem - 31 other warps to run!

3. Explicit Memory Hierarchy

// CPU: Implicit caching (hardware managed)
data = array[i];  // Automatic L1/L2/L3 cache management
 
// GPU: Explicit shared memory (programmer managed)
__shared__ float shared_data[256];  // On-chip scratchpad
// Programmer loads data to shared memory
// Much faster than repeated global memory access

5. Parallelism Metrics

5.1 Speedup

Definition: How much faster is parallel version compared to serial?

Speedup = Time(serial) / Time(parallel)
 
Example:
Serial: 100 seconds
8-core parallel: 15 seconds
Speedup = 100 / 15 = 6.67x

Ideal speedup (linear): Speedup = Number of processors

2 cores β†’ 2x speedup
4 cores β†’ 4x speedup
8 cores β†’ 8x speedup  (rare in practice!)

Real-world speedup: Usually sublinear due to:

  • Serial portions (Amdahl's Law)
  • Communication overhead
  • Synchronization costs
  • Load imbalance

5.2 Efficiency

Definition: How effectively are processors utilized?

Efficiency = Speedup / Number of Processors
 
Example (from above):
Speedup = 6.67x on 8 cores
Efficiency = 6.67 / 8 = 83.4%
 
Interpretation: 83.4% of ideal performance

Strong scaling: Fixed problem size, more processors

1 core:  100s (baseline)
2 cores: 52s  (Speedup 1.92x, Efficiency 96%)
4 cores: 28s  (Speedup 3.57x, Efficiency 89%)
8 cores: 15s  (Speedup 6.67x, Efficiency 83%)
 
β†’ Efficiency decreases as overhead dominates

Weak scaling: Increase problem size with processors

1 core:  10K elements, 10s
2 cores: 20K elements, 10s (Perfect weak scaling!)
4 cores: 40K elements, 10s
8 cores: 80K elements, 10s
 
β†’ Maintains efficiency by keeping work per core constant

5.3 Scalability

How well does speedup grow with more processors?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Processors β”‚ Speedup  β”‚ Efficiencyβ”‚ Scalabilityβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1          β”‚ 1.0x     β”‚ 100%     β”‚            β”‚
β”‚ 2          β”‚ 1.9x     β”‚ 95%      β”‚ Excellent  β”‚
β”‚ 4          β”‚ 3.6x     β”‚ 90%      β”‚ Good       β”‚
β”‚ 8          β”‚ 6.4x     β”‚ 80%      β”‚ Decent     β”‚
β”‚ 16         β”‚ 10.7x    β”‚ 67%      β”‚ Poor       β”‚
β”‚ 32         β”‚ 14.2x    β”‚ 44%      β”‚ Very Poor  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Good scalability: Efficiency stays above 75% up to target scale

5.4 Example Calculation

Matrix multiply: 4096Γ—4096 matrices

// Serial implementation
Time(serial) = 120 seconds
 
// Parallel on 8 cores with AVX-512 SIMD
Time(parallel) = 2.5 seconds
 
// Metrics
Speedup = 120 / 2.5 = 48x
Efficiency = 48 / 8 = 600%
 
// Wait, >100% efficiency?
// Yes! SIMD gives 16x boost per core
// Combined: 8 cores Γ— 16-way SIMD = 128x theoretical
// Achieved: 48x / 128x theoretical = 37.5% of peak

6. SIMD Programming Models

6.1 1. Intrinsics (Explicit SIMD)

#include &lt;immintrin.h&gt;  // Intel intrinsics
 
void add_avx(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        __m256 vb = _mm256_load_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_store_ps(&c[i], vc);
    }
}

Pros: Full control, predictable performance Cons: Not portable, verbose, manual optimization

6.2 2. Auto-Vectorization (Compiler)

// Simple C code
void add(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // Compiler vectorizes this!
    }
}
 
// Compile with optimization
// gcc -O3 -march=native -ftree-vectorize
// Compiler generates AVX/SSE instructions automatically

Pros: Portable, simple code Cons: Compiler may miss opportunities, less control

6.3 3. SIMD Libraries

// Eigen (C++)
VectorXf a(n), b(n), c(n);
c = a + b;  // Automatically uses SIMD!
 
// NumPy (Python)
c = a + b  # NumPy uses optimized BLAS with SIMD

Pros: High-level, portable, well-optimized Cons: Less control, potential overhead

7. Key Takeaways

βœ… Two types of parallelism:

  • Data parallelism: same op, different data (SIMD-friendly)
  • Task parallelism: different ops, independent data

βœ… SIMD processes multiple data elements per instruction

  • Modern CPUs: 8-16 wide (AVX/AVX-512)
  • Crucial for ML: matrix multiply, activations

βœ… Throughput architectures (GPUs) trade single-thread speed for massive parallelism

βœ… Speedup and efficiency measure parallel performance

  • Speedup = Serial time / Parallel time
  • Efficiency = Speedup / Number of processors

βœ… SIMD challenges: alignment, branches, non-contiguous access

βœ… ML loves SIMD: Matrix ops, element-wise ops are naturally vectorizable

8. Practice Problems

1. SIMD Speedup

Scalar code: 100 cycles
AVX-256 SIMD (8-way): ? cycles
Assume perfect vectorization, what's the speedup?

2. Efficiency Calculation

16-core CPU, parallel speedup = 10x
What is the efficiency? What might cause less than ideal speedup?

3. SIMD Suitability

// Which loops can be efficiently SIMD-vectorized?
// Loop A
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}
 
// Loop B  
for (int i = 1; i < n; i++) {
    c[i] = c[i-1] + a[i];  // Dependency!
}

9. Next Steps

Now that you understand parallel computing basics:

β†’ GPU Architecture: Dive deep into throughput-oriented design

β†’ Matrix Multiplication: Apply SIMD to the most important ML kernel

β†’ TPU Architecture: See how Google optimized for tensor operations

β†’ Performance Analysis: Measure SIMD effectiveness in real code

Remember: Modern performance is all about exploiting parallelism - both data and task level! πŸš€