CudaGPT2

GPT2 Hyper Optimization.

Published

July 5, 2025

GPT2 hyper optimised in Cuda Github Link

GPT2 implementation in C and Cuda.

Tasks did

Separate kernels for Cuda and Cpu.
All kernels from scratch except matmul.
tokenizer implementation.
inference implementation.

Todo

Optimizing Attention_forward kernel.
profiling and further optimizing based on profile.

To run

git clone -b cuda https://github.com/Autobot37/gpt.cpp
python3 pythonscripts/prepare_tokenizer.py
python3 writestate.py
make run_cuda

Dependencies

python tiktoken module.
nvidia cuda toolkit.