DMA and Data Movement

Chunk-based data movement with Croktile's DMA primitives.

The DMA Statement

Croktile simplifies data movement with a declarative DMA syntax:

C++
dma.copy input.chunkat(tiling_factors) => shared;

This moves a chunk of input with specified tiling factors to shared memory.

Tiling with ChunkAt

The chunkat operator tiles data along specified dimensions:

C++
// Tile a [128, 256] tensor into [32, 64] chunks
shared = dma.copy data.chunkat(4, 4) => smem;

Multi-stage Pipelines

For advanced optimizations, use multi-buffering patterns:

C++
parallel (blockIdx) {
  buf_a = dma.copy input_a.chunkat(bm, 1) => smem;
  buf_b = dma.copy input_b.chunkat(bn, 1) => smem;
  
  within (k : K / bk) {
    mma buf_a.chunkat(1, bk)
        buf_b.chunkat(1, bk) => output;
  }
}

The Croktile compiler automatically manages descriptor configuration, offset calculations, and memory barriers.