Unpacking nanochat: A Minimal, End-to-End LLM Pipeline

AuthorOctober 15, 2025

1 4 minutes read

The world of artificial intelligence is experiencing unprecedented innovation, with large language models (LLMs) at the forefront. While building and training these sophisticated models often requires immense resources and specialized expertise, a new open-source project aims to demystify the process and make cutting-edge AI more accessible. Andrej Karpathy, a prominent figure in the AI community, has unveiled ‘nanochat’ – a streamlined, end-to-end pipeline designed to build a ChatGPT-style model with remarkable efficiency.

Unpacking nanochat: A Minimal, End-to-End LLM Pipeline

nanochat represents a significant step towards democratizing LLM development. It offers a compact and dependency-light solution, allowing developers and researchers to train a full ChatGPT-style stack from the ground up.

Andrej Karpathy has open-sourced nanochat, a compact, dependency-light codebase that implements a full ChatGPT-style stack—from tokenizer training to web UI inference—aimed at reproducible, hackable LLM training on a single multi-GPU node.

This initiative emphasizes reproducibility and hackability, empowering users to experiment with LLM training on a single multi-GPU node without extensive setup. It’s an invitation to dive deep into the mechanics of conversational AI.

The repo provides a single-script “speedrun” that executes the full loop: tokenization, base pretraining, mid-training on chat/multiple-choice/tool-use data, Supervised Finetuning (SFT), optional RL on GSM8K, evaluation, and serving (CLI + ChatGPT-like web UI).

The affordability is particularly striking. The recommended setup is an 8×H100 node; at ~$24/hour, the 4-hour speedrun lands near $100. A post-run report.md summarizes metrics (CORE, ARC-E/C, MMLU, GSM8K, HumanEval, ChatCORE).

This cost-effective approach drastically lowers the barrier to entry for individuals and smaller teams eager to build and understand these complex systems.

The Nanochat Journey: From Custom Tokens to Capable Conversations

The nanochat pipeline is meticulously designed, covering every critical stage of LLM development. It begins with a custom tokenizer and progresses through various training phases, culminating in a functional model.

Tokenizer: custom Rust BPE (built via Maturin), with a 65,536-token vocab; training uses FineWeb-EDU shards (re-packaged/shuffled for simple access). The walkthrough reports ~4.8 characters/token compression and compares against GPT-2/4 tokenizers.

Eval bundle: a curated set for CORE (22 autocompletion datasets like HellaSwag, ARC, BoolQ, etc.), downloaded into ~/.cache/nanochat/eval_bundle.

The model’s architecture is a depth-20 Transformer, carefully configured for efficient training.

The speedrun config trains a depth-20 Transformer (≈560M params with 1280 hidden channels, 10 attention heads of dim 128) for ~11.2B tokens consistent with Chinchilla-style scaling (params × ~20 tokens). The author estimates this run as a ~4e19 FLOPs capability model. Training uses Muon for matmul parameters and AdamW for embeddings/unembeddings; loss is reported in bits-per-byte (bpb) to be tokenizer-invariant.

After initial pretraining, the model undergoes mid-training to adapt it for conversational understanding and specific task-solving abilities.

After pretraining, mid-training adapts the base model to conversations (SmolTalk) and explicitly teaches multiple-choice behavior (100K MMLU auxiliary-train questions) and tool use by inserting <|python_start|>…<|python_end|> blocks; a small GSM8K slice is included to seed calculator-style usage. The default mixture: SmolTalk (460K), MMLU aux-train (100K), GSM8K main (8K), totaling 568K rows.

Supervised Finetuning (SFT) refines the model further, ensuring higher-quality conversations and reducing the train/inference mismatch.

SFT then fine-tunes on higher-quality conversations while matching test-time formatting (padded, non-concatenated rows) to reduce train/inference mismatch. The repo’s example post-SFT metrics (speedrun tier) report ARC-Easy 0.3876, ARC-Challenge 0.2807, MMLU 0.3151, GSM8K 0.0455, HumanEval 0.0854, ChatCORE 0.0884.

A notable feature is the integrated tool use, allowing the model to interact with external functionalities.

Tool use is wired end-to-end: the custom Engine implements KV cache, prefill/decode inference, and a simple Python interpreter sandbox for tool-augmented runs—used in both training and evaluation flows.

For those looking to push the boundaries, an optional reinforcement learning stage is included.

The final (optional) stage applies reinforcement learning on GSM8K with a simplified GRPO routine. The walkthrough clarifies what’s omitted relative to canonical PPO-style RLHF: no trust region via a reference model, no KL penalties, on-policy updates (discard PPO ratios/clip), token-level GAPO-style normalization, and mean-shift advantage. Practically, it behaves close to REINFORCE while keeping the group-relative advantage calculation. Scripts scripts.chat_rl and scripts.chat_eval -i rl -a GSM8K demonstrate the loop.

Scaling Up: Performance and Future Potential

The initial ~$100 speedrun delivers a foundational model, complete with a comprehensive evaluation report. This serves as a strong baseline for further experimentation and development within the nanochat ecosystem.

An example report.md table for the ~$100/≈4-hour run shows: CORE 0.2219 (base); after mid-training/SFT, ARC-E 0.3561→0.3876, ARC-C ~0.2875→0.2807, MMLU 0.3111→0.3151, GSM8K 0.0250→0.0455, HumanEval 0.0671→0.0854, ChatCORE 0.0730→0.0884; wall-clock 3h51m.

For those seeking enhanced capabilities, nanochat also outlines clear scaling tiers.

The README sketches two larger targets beyond the ~$100 speedrun: ~$300 tier: d=26 (~12 hours), slightly surpasses GPT-2 CORE; requires more pretraining shards and batch-size adjustments. ~$1,000 tier: ~41.6 hours, with materially improved coherence and basic reasoning/coding ability.

These tiers demonstrate that significant improvements in LLM performance, including better coherence and reasoning abilities, are achievable with modest increases in computational investment. The repo also note prior experimental runs where a d=30 model trained for ~24 hours reached 40s on MMLU, 70s on ARC-Easy, 20s on GSM8K.

Conclusion: Redefining LLM Accessibility and Experimentation

Andrej Karpathy’s nanochat is more than just a codebase; it’s a blueprint for reproducible and affordable LLM training. By providing a full ChatGPT-style stack in such a compact and accessible form, it empowers a new generation of AI enthusiasts and researchers.

This minimal, end-to-end ChatGPT-style stack, clocking in at around 8K lines of code, runs efficiently via a single `speedrun.sh` script on an 8×H100 node, completing in approximately 4 hours for about $100.

The comprehensive pipeline covers everything from the Rust BPE tokenizer and base pretraining to mid-training, Supervised Finetuning (SFT), and an optional simplified GRPO-based Reinforcement Learning on GSM8K. It also includes robust evaluation and serving capabilities via both CLI and a ChatGPT-like web UI.

The speedrun metrics are impressive: starting with a CORE score of 0.2219, the model achieves ARC-Easy 0.3876, ARC-Challenge 0.2807, MMLU 0.3151, GSM8K 0.0455, and HumanEval 0.0854 after SFT. Further scaling tiers are outlined, with a ~$300 tier (d=26, ~12h) slightly outperforming GPT-2 CORE, and a ~$1,000 tier (~41.6h) offering materially better coherence and reasoning abilities.

nanochat truly opens doors, making advanced LLM experimentation a tangible reality for a wider audience. If you’ve been looking for a practical entry point into the world of large language model training and development, Andrej Karpathy’s nanochat is an essential resource to explore.

AuthorOctober 15, 2025

1 4 minutes read