Parallel Execution

SPMD parallelism and loop constructs in Croktile.

Parallel Blocks

Croktile uses SPMD-style parallelism:

C++
parallel (blockIdx) {
  // Code runs across all thread blocks
  parallel (threadIdx) {
    // Code runs across all threads within a block
  }
}

Loop Constructs

The within construct provides structured loops:

C++
within (i : N) {
  // Iterates i from 0 to N-1
  output[i] = input[i] * 2.0;
}

Combining Parallelism and Loops

C++
parallel (blockIdx) {
  shared = dma.copy input.chunkat(block_size, 1) => smem;
  within (k : K / tile_k) {
    mma shared.chunkat(1, tile_k)
        weights.chunkat(1, tile_k) => output;
  }
}