CudaGPT2

GPT2 Hyper Optimization.
Published

July 5, 2025

GPT2 hyper optimised in Cuda Github Link

GPT2 implementation in C and Cuda.

Tasks did

  • Separate kernels for Cuda and Cpu.
  • All kernels from scratch except matmul.
  • tokenizer implementation.
  • inference implementation.

Todo

  • Optimizing Attention_forward kernel.
  • profiling and further optimizing based on profile.

To run

git clone -b cuda https://github.com/Autobot37/gpt.cpp
python3 pythonscripts/prepare_tokenizer.py
python3 writestate.py
make run_cuda

Dependencies

  • python tiktoken module.
  • nvidia cuda toolkit.