Open Source · Production Ready
The Next-Gen GPU & DSA Language Achieves 5x Productivity
Write less code. Catch more bugs. Ship faster GPU kernels.
__co__ void spmm( global f8_e4m3 [M, PACKED_K] lhs_packed, global u8 [M, META_COLS] lhs_meta, global f8_e4m3 [N, K] rhs, global f16 [M, N] output) { parallel {block_m, block_n} by [cdiv(M, SPMM_WARP_M), cdiv(N, SPMM_WARP_N)] : block { shared event full[STAGES], empty[STAGES]; shared f8_e4m3 [STAGES * SPMM_WARP_M, SPMM_PACKED_TILE_K] lhs_s; shared f8_e4m3 [STAGES * SPMM_WARP_N, SPMM_TILE_K] rhs_s;
parallel p1 by 2 : group-4 { inthreads.async (p1 == 0) { foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] { stage = iv_k % STAGES; wait empty[stage]; tma.copy.async<full[stage]>.swiz<32> lhs_packed.subspan(SPMM_WARP_M, SPMM_PACKED_TILE_K).at(block_m, iv_k) => lhs_s.subspan(SPMM_WARP_M, SPMM_PACKED_TILE_K).at(stage, 0); trigger full[stage]; } } inthreads.async (p1 == 1) { mc = mma.fill.f16 0.0f; foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] { wait full[iv_k % STAGES]; ma = mma.load.swiz<32> lhs_s.at(iv_k % STAGES, 0); mb = mma.load.swiz<64> rhs_s.at(iv_k % STAGES, 0); me = mma.load lhs_meta_s; mma.row.row.sp mc, ma, mb, me; trigger empty[iv_k % STAGES]; } mma.store mc, output_s; tma.copy output_s => output.subspan(SPMM_WARP_M, SPMM_WARP_N).at(block_m, block_n); } } }}Sparse GEMM (FP8 E4M3)Real production kernel from choreo repository
Start with less, achieve more
Easy to Use
- Achieve extreme performance with zero-cost abstractions
- The most intuitive GPU & DSA programming language among all competitors
- 40% of equivalent CUDA code — higher abstraction than Triton
- Work on tensors and tiles, not raw buffers and pointers
Work on tensors rather than raw buffers
Avoid the complexity of HW programming — TMA, DMA, tensor-core
Make parallel programs easier to write and reason about
Seamless C++ integration with cutlass, CUTE, or any C++ library
matmul_hilbert.coCROKTILE
__co__ void matmul( global f16 [M, K] lhs, global f16 [N, K] rhs, global f16 [M, N] output, global s32 [T] schedule_m, global s32 [T] schedule_n) { int total = cdiv(M, WARP_M) * cdiv(N, WARP_N); parallel block_id by NUM_SMS : block { shared f16 [WARP_M, TILE_K] lhs_s; shared f16 [WARP_N, TILE_K] rhs_s; shared f16 [WARP_M, WARP_N] out_s; foreach {tile} in [cdiv(total, NUM_SMS)] { tile_id = tile # block_id; if (tile_id < total) { int bm = schedule_m.at(tile_id); int bn = schedule_n.at(tile_id); mc = mma.fill.f16 0.0f; foreach {iv_k} in [cdiv(K, TILE_K)] { tma.copy.swiz<128> lhs.subspan(WARP_M, TILE_K) .at(bm, iv_k) => lhs_s; tma.copy.swiz<128> rhs.subspan(WARP_N, TILE_K) .at(bn, iv_k) => rhs_s; parallel p by 1 : group-4 { ma = mma.load.swiz<128> lhs_s; mb = mma.load.swiz<128> rhs_s; mma.row.row mc, ma, mb; } } mma.store mc, out_s; tma.copy out_s => output.subspan( WARP_M, WARP_N).at(bm, bn); } } }}Croktile
36
Triton
64
CUTLASS (CuTeDSL)
280
CUDA + CuTe
182
Lines of code — persistent warp-specialized GEMM kernel
Catch bugs before they run
Compile-Time Safety
- 353 compile-time diagnostic checks across 7 compiler modules
- Catch tiling mismatches, shape errors, and DMA bugs before they run
- 1,319 runtime assertions guard every kernel in production
- Eliminate entire classes of bugs that plague CUDA development
Detect tiling mismatches at compile time
Enable aggressive runtime checks for easier debugging
Eliminate entire classes of DMA-related bugs
Compiler diagnostic checks across modules353
Early Semantic Analysis
earlysema.cpp189
type mismatch in SELECTinvalid parallel nestingundeclared variable in __co__ scope
Semantic Validation
semacheck.cpp65
inconsistent shapes for spanned-operationinvalid DMA target memory spaceMMA: matrix shapes do not match
Type Inference
typeinfer.cpp38
cannot infer element typeconflicting types in reductionincompatible accumulator dtype
Shape Inference
shapeinfer.cpp33
tiling factor exceeds data sizeindex out of bounds for dimensionspan size mismatch in chunkat
Loop Vectorization
loop_vectorize.cpp16
vectorize factor not divisibleinvalid loop bounds for SIMD
Static Assessment
assess.cpp4
shape incompatibility proven at compile time
Code Generation
codegen + others8
target architecture constraint violatedDMA size exceeds 2^32 bytes
earlysemasemachecktypeinfershapeinferloop_vectorizeassesscodegen + others
Symbolic dimensions, zero boilerplate
Dynamic Shapes
- First-ever dynamic shared memory for GPU & DSA kernel programming
- Symbolic dimensions — write once, run on any shape
- Derived dimensions auto-computed: PACKED_K, META_COLS, grid, SMEM
- No template metaprogramming, no recompilation per shape
First-class symbolic dimension support
Dynamic shared memory without boilerplate
Static and runtime memory support unified
Pick a shape — derived dims update automatically
M4096
K4096
N4096
PACKED_K
2048
K/2
META_COLS
512
K/8
Grid
64×16
⌈M/WM⌉×⌈N/WN⌉
SMEM
40KB
auto
K iterations: 64· All derived from symbolic M, K, N
Why competitors can't do this
All dimensions must be template parameters or runtime arguments with manual size calculations
// Must manually compute every buffer size__shared__ float smA[BM][BK]; // compile-time// Dynamic SMEM requires manual calc + launch argint smem = BM * BK * sizeof(float);kernel<<<grid, block, smem>>>(...)Language meets machine intelligence
Born for AI Tuning
- Language designed from scratch for AI-assisted optimization
- Compact syntax keeps entire kernels inside AI context windows
- AI achieved 1127 TFLOPS from 671 baseline in a single session
- Well-documented CLI and structured errors for autonomous agents
Compact syntax perfect for AI context windows
Hidden error-prone configs behind simple primitives
CLI tools and arguments well-documented for AI agents
Compile-time checks guard against AI-confusing bugs
User: continue experiment on branch ai-tune/2026-03-24/gemm_sp_f16. Read two reference experiment branches: branch 1: ai-tune/2026-03-21/gemm_sp_f16 branch 2: ai-tune/2026-03-21/gemm_sp_e4m3 Branch 2 best can achieve over 1100 TFLOPS, branch 1 best 600+ TFLOPS. Do not directly copy — optimize stage over stage. KICK OFF EXPERIMENTS NOW.
baseline
671
iter001
759
iter016
772
iter023
811
iter068
1127
671 → 1127 TFLOPS (+67.9%) in 68 AI-driven iterations

Try Croktile today.
Write less code. Catch more bugs. Ship faster GPU kernels.
62%
less code than CUDA+CuTe
353
compile-time checks
1128
peak TFLOPS (FP8)