About

I am Principal Engineer / Chief Researcher in the Research Academy at Enflame Tech, where I work on compilers, runtimes, kernel programming models, and the system software layers needed to make AI computing practical on modern accelerators.

Most of my work is around high-performance software for AI, especially the junction of languages, compilers, runtimes, and system abstractions for accelerators. I care about designs that keep low-level control without making the software stack harder to evolve.

That shows up in several public projects. CroqTile turns ideas around TileFlow and DMA into a language, autotuning workflow, and toolchain for kernel development. On the systems side, work such as Samoyeds and SPIDER looks at sparse and accelerator-oriented execution from different angles, while my Rust-based serving infrastructure work around Candle and candle-vLLM style execution on Enflame GCU brings the same concerns into full-stack LLM systems.

Bio

Dr. Heng Shi received his Ph.D. from the University of Bath, after undergraduate study in computer science at Tsinghua University.

He worked at Enflame as a Senior Algorithm Engineer and later as a Software Architect from 2018 to 2022, and has led research work as Principal Engineer / Chief Researcher since 2022, focusing on AI system stacks, performance optimization, and tooling design. He has published over 20 papers and received more than 1800 citations overall, with work presented at venues including EuroSys, PPoPP, CGO, and ASE. His recent public work includes the CroqTile language and toolchain, Samoyeds at EuroSys 2025, SPIDER at PPoPP 2026, and related work on AI compilers, sparse acceleration, and systems for LLM serving. Earlier work spans forecasting and energy systems, while related community-facing efforts include participation in MLCommons / MLPerf and AIIA.

News

Selected public updates from releases, papers, and open project records.

  • 2026.04

    CroqTile v1.0.4 pre-beta was posted on the official changelog, continuing the public rollout of the language and compiler toolchain.

  • 2026

    The public CroqTile website, language repository, and tuner repository continue to make the language and tooling stack inspectable in the open.

  • 2026

    SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping was published at PPoPP 2026.

  • 2026.01

    CroqTile v1.0.3 Open Source Base marked the public open-source release of the base language stack.

  • 2025

    Samoyeds was published at EuroSys 2025, presenting sparse-tensor-core acceleration for MoE models with structured sparsity.

  • 2025

    Public repositories around candle-gcu and candle-vllm-gcu document ongoing Rust-based LLM serving work on Enflame GCU.

  • 2025

    Postiz was published at CGO 2025 on post-increment addressing for loop optimization and code size reduction.

  • 2024-2025

    Continued participation in MLCommons / MLPerf and AIIA has kept part of the work connected to public benchmarking and industry-standard discussions.

  • 2024

    UFront appeared at ASE 2024 as a unified MLIR frontend effort for deep learning.

Publications

Google Scholar

Recent papers center on compilers, sparse inference, and AI systems, with earlier work in forecasting and energy systems.

2026

SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping

Qiqi Gu, Chenpeng Wu, Heng Shi, Jianguo Yao. PPoPP 2026.

2026

Do We Need Tensor Cores for Stencil Computations?

Q Gu, C Wu, H Shi, J Yao, H Guan. arXiv preprint arXiv:2603.00477, 2026.

2025

Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores

Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan. EuroSys 2025.

2025

SPTCStencil: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swap

Q Gu, C Wu, H Shi, J Yao. arXiv e-prints, arXiv:2506.22035, 2025.

2025

Postiz: Extending Post-increment Addressing for Loop Optimization and Code Size Reduction

Enming Fan, Xiaofeng Guan, Fan Hu, Heng Shi, Hao Zhou, Jianguo Yao. CGO 2025.

2024

UFront: Toward a Unified MLIR Frontend for Deep Learning

Guoqing Bao, Heng Shi, Chengyi Cui, Yalin Zhang, Jianguo Yao. ASE 2024.

2024

SPHINX: Search Space-Pruning Heterogeneous Task Scheduling for Deep Neural Networks

Bowen Yuchi, Heng Shi, Guoqing Bao. ICPP 2024.

2024

Review of the opportunities and challenges to accelerate mass-scale application of smart grids with large-language models

Heng Shi, L Fang, X Chen, C Gu, K Ma, X Zhang, Z Zhang, J Gu, E. G. Lim. IET Smart Grid, 2024.

2022

Decentralized home energy management system to reduce system peak and uncertainty

Y Du, H Shi, M Xu, F Li. IET Conference Proceedings, 2022.

2024

Boost Linear Algebra Computation Performance via Efficient VNNI Utilization

Hao Zhou, Qiukun Han, Heng Shi, Yalin Zhang, Jianguo Yao. ASPLOS 2024.

2020

Deep learning for day-ahead electricity price forecasting

C Zhang, R Li, H Shi, F Li. IET Smart Grid, 2020.

2019

A statistical approach to estimate imbalance-induced energy losses for data-scarce low voltage networks

L Fang, K Ma, R Li, Z Wang, H Shi. IEEE Transactions on Power Systems, 2019.

2018

Data-driven uncertainty quantification and characterization for household energy demand across multiple time-scales

H Shi, Q Ma, N Smith, F Li. IEEE Transactions on Smart Grid, 2018.

2018

Probabilistic network pricing considering demand uncertainty in distribution systems

X Yang, X Yan, H Shi, C Gu, F Li. IEEE PES General Meeting, 2018.

2018

Uncertainty Analysis and Application on Smart Homes and Smart Grids: Big Data Approaches

H Shi. University of Bath, 2018.

2018

Decentralised control for combined heat and power system in energy community

H Shi, Z Zhang, F Li. CIRED 2018 - 25th International Conference on Electricity Distribution, 2018.

2017

Deep Learning for household load forecasting-A novel pooling deep RNN

Heng Shi, Minghao Xu, Ran Li. IEEE Transactions on Smart Grid, 2017.

2017

A whole system assessment of novel deep learning approach on short-term load forecasting

H Shi, M Xu, Q Ma, C Zhang, R Li, F Li. Energy Procedia, 2017.

2017

Assessment of relative efficiency of differing energy markets for community energy

F. L. Lanqing Shan, Heng Shi, Zhong Zhang. CIRED 2018 - 25th International Conference on Electricity Distribution, 2017.

2017

Decentrailised control for combined heat and power system in energy community

Heng Shi, Zhong Zhang, Furong Li. CIRED 2018 - 25th International Conference on Electricity Distribution, 2017.

2016

Evaluation of fault levels and power supply network impedances in 230/400 V 50 Hz generic distribution systems

I Hernando-Gil, H Shi, F Li, S Djokic, M Lehtonen. IEEE Transactions on Power Delivery, 2016.

2016

The use of thaumatin and bovine serum albumin as proteins in model wine solutions in bentonite fining

H Shi, D. M. Burmeister, A Frost, D. A. Patterson, B James. Journal of Wine Research, 2016.

2015

Demand Side Response Performance Assessment: An Impact Analysis of Load Profile Accuracy on DSR Performances

Chen Zhao, Heng Shi, Ran Li, Furong Li. IEEE PES General Meeting, 2015. Best Research Paper Award.

2015

Combining mobile and fog computing: Using coap to link mobile device clouds with fog computing

H Shi, N Chen, R Deters. IEEE International Conference on Data Science and Data Intensive Systems, 2015.

2012

Coordinated charging of plug-in electric vehicles in charging stations

Z Xu, Z Hu, Y Song, Z Luo, K Zhan, H Shi. Automation of Electric Power Systems, 2012.

n.d.

SPTCStencil: Using Sparse Tensor Cores for Stencil Computation

Q Gu, C Wu, H Shi, J Yao. Undated Scholar record.

Projects

Selected systems and research artifacts from recent years.

CroqTile autotuning convergence figure
CroqTile

A next-gen programming language for AI kernels and accelerators, built around TileFlow and DMA-oriented abstractions while reporting about 5x productivity on public cases.

Samoyeds paper figure
Samoyeds

A EuroSys'25 system for MoE acceleration that applies structured sparsity to both activations and parameters on Sparse Tensor Cores.

SPIDER paper figure
SPIDER

A PPoPP'26 system that turns stencil computation into sparse matrix multiplication through strided swapping on Sparse Tensor Cores.

Rust serving stack project preview
Rust LLM Serving Stack on GCU

A Rust-based serving stack around Candle and candle-vLLM on Enflame GCU, covering fused ops, quantization, multi-GCU inference, GGUF support, and memory-system work.

Vitae

Selected positions and education.

  • Enflame Tech 2022.12 - present

    Principal Engineer / Chief Researcher

    Working on AI framework and systems research in the Research Academy, with a focus on compilers, runtimes, kernel programming models, and related research projects.

  • Enflame Tech 2021.03 - 2022.12

    Software Architect

    Led software architecture and technical planning across AI frameworks and system software, while also supporting technical evaluation and team building.

  • Enflame Tech 2020.02 - 2021.03

    Senior Algorithm Research Engineer

    Worked on early-stage research across auto-parallelism, MLIR-related compiler work, benchmarking, and heterogeneous AI software systems.

  • University of Bath 2014.10 - 2018.07

    Ph.D. / Assistant Researcher

    Doctoral work on smart grids, load forecasting, and energy systems, alongside research and project work at the University of Bath.

  • Tsinghua University 2009.08 - 2013.06

    B.Sc.

    Undergraduate study in computer science, with early work touching nonlinear optimization and AGV swarm scheduling.