Open Source · Production Ready

The Next-Gen GPU & DSA Language Achieves 5x Productivity

Write less code. Catch more bugs. Ship faster GPU kernels.

__co__ void spmm(
    global f8_e4m3 [M, PACKED_K] lhs_packed,
    global u8 [M, META_COLS] lhs_meta,
    global f8_e4m3 [N, K] rhs,
    global f16 [M, N] output) {
  parallel {block_m, block_n}
      by [cdiv(M, SPMM_WARP_M),
          cdiv(N, SPMM_WARP_N)] : block {
    shared event full[STAGES], empty[STAGES];
    shared f8_e4m3 [STAGES * SPMM_WARP_M,
                    SPMM_PACKED_TILE_K] lhs_s;
    shared f8_e4m3 [STAGES * SPMM_WARP_N,
                    SPMM_TILE_K] rhs_s;

    parallel p1 by 2 : group-4 {
      inthreads.async (p1 == 0) {
        foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] {
          stage = iv_k % STAGES;
          wait empty[stage];
          tma.copy.async<full[stage]>.swiz<32>
            lhs_packed.subspan(SPMM_WARP_M,
              SPMM_PACKED_TILE_K).at(block_m, iv_k)
            => lhs_s.subspan(SPMM_WARP_M,
              SPMM_PACKED_TILE_K).at(stage, 0);
          trigger full[stage];
        }
      }
      inthreads.async (p1 == 1) {
        mc = mma.fill.f16 0.0f;
        foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] {
          wait full[iv_k % STAGES];
          ma = mma.load.swiz<32>
            lhs_s.at(iv_k % STAGES, 0);
          mb = mma.load.swiz<64>
            rhs_s.at(iv_k % STAGES, 0);
          me = mma.load lhs_meta_s;
          mma.row.row.sp mc, ma, mb, me;
          trigger empty[iv_k % STAGES];
        }
        mma.store mc, output_s;
        tma.copy output_s
          => output.subspan(SPMM_WARP_M,
             SPMM_WARP_N).at(block_m, block_n);
      }
    }
  }
}

Sparse GEMM (FP8 E4M3)Real production kernel from choreo repository

Start with less, achieve more

Easy to Use

Achieve extreme performance with zero-cost abstractions
The most intuitive GPU & DSA programming language among all competitors
40% of equivalent CUDA code — higher abstraction than Triton
Work on tensors and tiles, not raw buffers and pointers

Work on tensors rather than raw buffers

Avoid the complexity of HW programming — TMA, DMA, tensor-core

Make parallel programs easier to write and reason about

Seamless C++ integration with cutlass, CUTE, or any C++ library

matmul_hilbert.coCROKTILE

__co__ void matmul(
    global f16 [M, K] lhs,
    global f16 [N, K] rhs,
    global f16 [M, N] output,
    global s32 [T] schedule_m,
    global s32 [T] schedule_n) {
  int total = cdiv(M, WARP_M) * cdiv(N, WARP_N);
  parallel block_id by NUM_SMS : block {
    shared f16 [WARP_M, TILE_K] lhs_s;
    shared f16 [WARP_N, TILE_K] rhs_s;
    shared f16 [WARP_M, WARP_N] out_s;
    foreach {tile} in [cdiv(total, NUM_SMS)] {
      tile_id = tile # block_id;
      if (tile_id < total) {
        int bm = schedule_m.at(tile_id);
        int bn = schedule_n.at(tile_id);
        mc = mma.fill.f16 0.0f;
        foreach {iv_k} in [cdiv(K, TILE_K)] {
          tma.copy.swiz<128>
            lhs.subspan(WARP_M, TILE_K)
              .at(bm, iv_k) => lhs_s;
          tma.copy.swiz<128>
            rhs.subspan(WARP_N, TILE_K)
              .at(bn, iv_k) => rhs_s;
          parallel p by 1 : group-4 {
            ma = mma.load.swiz<128> lhs_s;
            mb = mma.load.swiz<128> rhs_s;
            mma.row.row mc, ma, mb;
          }
        }
        mma.store mc, out_s;
        tma.copy out_s => output.subspan(
          WARP_M, WARP_N).at(bm, bn);
      }
    }
  }
}

Croktile

Triton

CUTLASS (CuTeDSL)

280

CUDA + CuTe

182

Lines of code — persistent warp-specialized GEMM kernel

Catch bugs before they run

Compile-Time Safety

353 compile-time diagnostic checks across 7 compiler modules
Catch tiling mismatches, shape errors, and DMA bugs before they run
1,319 runtime assertions guard every kernel in production
Eliminate entire classes of bugs that plague CUDA development

Detect tiling mismatches at compile time

Enable aggressive runtime checks for easier debugging

Eliminate entire classes of DMA-related bugs

Compiler diagnostic checks across modules353

Early Semantic Analysis

earlysema.cpp189

type mismatch in SELECTinvalid parallel nestingundeclared variable in __co__ scope

Semantic Validation

semacheck.cpp65

inconsistent shapes for spanned-operationinvalid DMA target memory spaceMMA: matrix shapes do not match

Type Inference

typeinfer.cpp38

cannot infer element typeconflicting types in reductionincompatible accumulator dtype

Shape Inference

shapeinfer.cpp33

tiling factor exceeds data sizeindex out of bounds for dimensionspan size mismatch in chunkat

Loop Vectorization

loop_vectorize.cpp16

vectorize factor not divisibleinvalid loop bounds for SIMD

Static Assessment

assess.cpp4

shape incompatibility proven at compile time

Code Generation

codegen + others8

target architecture constraint violatedDMA size exceeds 2^32 bytes

earlysemasemachecktypeinfershapeinferloop_vectorizeassesscodegen + others

Symbolic dimensions, zero boilerplate

Dynamic Shapes

First-ever dynamic shared memory for GPU & DSA kernel programming
Symbolic dimensions — write once, run on any shape
Derived dimensions auto-computed: PACKED_K, META_COLS, grid, SMEM
No template metaprogramming, no recompilation per shape

First-class symbolic dimension support

Dynamic shared memory without boilerplate

Static and runtime memory support unified

Pick a shape — derived dims update automatically

M4096

K4096

N4096

PACKED_K

2048

K/2

META_COLS

512

K/8

Grid

64×16

⌈M/WM⌉×⌈N/WN⌉

SMEM

40KB

auto

K iterations: 64· All derived from symbolic M, K, N

Why competitors can't do this

All dimensions must be template parameters or runtime arguments with manual size calculations

// Must manually compute every buffer size
__shared__ float smA[BM][BK]; // compile-time
// Dynamic SMEM requires manual calc + launch arg
int smem = BM * BK * sizeof(float);
kernel<<<grid, block, smem>>>(...)

Language meets machine intelligence

Born for AI Tuning

Language designed from scratch for AI-assisted optimization
Compact syntax keeps entire kernels inside AI context windows
AI achieved 1127 TFLOPS from 671 baseline in a single session
Well-documented CLI and structured errors for autonomous agents

Compact syntax perfect for AI context windows

Hidden error-prone configs behind simple primitives

CLI tools and arguments well-documented for AI agents

Compile-time checks guard against AI-confusing bugs

User:
continue experiment on branch
ai-tune/2026-03-24/gemm_sp_f16.
Read two reference experiment branches:
  branch 1: ai-tune/2026-03-21/gemm_sp_f16
  branch 2: ai-tune/2026-03-21/gemm_sp_e4m3
Branch 2 best can achieve over 1100 TFLOPS,
branch 1 best 600+ TFLOPS. Do not directly
copy — optimize stage over stage.
KICK OFF EXPERIMENTS NOW.

baseline

671

iter001

759

iter016

772

iter023

811

iter068

1127

671 → 1127 TFLOPS (+67.9%) in 68 AI-driven iterations