Andrej Karpathy has open-sourced nanochat, a compact, dependency-light codebase that implements a full ChatGPT-style stack—from tokenizer teaching to web UI inference—geared towards reproducible, hackable LLM teaching on a single multi-GPU node.
The repo provides a single-script “speedrun” that executes the entire loop: tokenization, base pretraining, mid-training on chat/multiple-choice/tool-use data, Supervised Finetuning (SFT), elective RL on GSM8K, evaluation, and serving (CLI + ChatGPT-like web UI). The advisable setup is an 8×H100 node; at ~$24/hour, the 4-hour speedrun lands near $100. A post-run report.md
summarizes metrics (CORE, ARC-E/C, MMLU, GSM8K, HumanEval, ChatCORE).
Tokenizer and data path
- Tokenizer: custom-made Rust BPE (constructed by the use of Maturin), with a 65,536-token vocab; teaching makes use of FineWeb-EDU shards (re-packaged/shuffled for straightforward entry). The walkthrough opinions ~4.8 characters/token compression and compares in opposition to GPT-2/4 tokenizers.
- Eval bundle: a curated set for CORE (22 autocompletion datasets like HellaSwag, ARC, BoolQ, and so forth.), downloaded into
~/.cache/nanochat/eval_bundle
.
Model, scaling, and “speedrun” aim
The speedrun config trains a depth-20 Transformer (≈560M params with 1280 hidden channels, 10 consideration heads of dim 128) for ~11.2B tokens per Chinchilla-style scaling (params × ~20 tokens). The author estimates this run as a ~4e19 FLOPs performance model. Teaching makes use of Muon for matmul parameters and AdamW for embeddings/unembeddings; loss is reported in bits-per-byte (bpb) to be tokenizer-invariant.
Mid-training, SFT, and equipment use
After pretraining, mid-training adapts the underside model to conversations (SmolTalk) and explicitly teaches multiple-choice habits (100K MMLU auxiliary-train questions) and software program use by inserting <|python_start|>…<|python_end|>
blocks; a small GSM8K slice is included to seed calculator-style utilization. The default mixture: SmolTalk (460K), MMLU aux-train (100K), GSM8K basic (8K), totaling 568K rows.
SFT then fine-tunes on higher-quality conversations whereas matching test-time formatting (padded, non-concatenated rows) to chop again put together/inference mismatch. The repo’s occasion post-SFT metrics (speedrun tier) report ARC-Easy 0.3876, ARC-Downside 0.2807, MMLU 0.3151, GSM8K 0.0455, HumanEval 0.0854, ChatCORE 0.0884.
Software program use is wired end-to-end: the custom-made Engine implements KV cache, prefill/decode inference, and a simple Python interpreter sandbox for tool-augmented runs—utilized in every teaching and evaluation flows.
Non-compulsory RL on GSM8K by the use of a simplified GRPO loop
The final word (elective) stage applies reinforcement finding out on GSM8K with a simplified GRPO routine. The walkthrough clarifies what’s omitted relative to canonical PPO-style RLHF: no perception space by the use of a reference model, no KL penalties, on-policy updates (discard PPO ratios/clip), token-level GAPO-style normalization, and mean-shift profit. Just about, it behaves close to REINFORCE whereas defending the group-relative profit calculation. Scripts scripts.chat_rl
and scripts.chat_eval -i rl -a GSM8K
reveal the loop.
Worth/prime quality scaling and higher fashions
The README sketches two greater targets previous the ~$100 speedrun:
- ~$300 tier: d=26 (~12 hours), barely surpasses GPT-2 CORE; requires additional pretraining shards and batch-size modifications.
- ~$1,000 tier: ~41.6 hours, with materially improved coherence and basic reasoning/coding capability.
The repo moreover discover prior experimental runs the place a d=30 model educated for ~24 hours reached 40s on MMLU, 70s on ARC-Easy, 20s on GSM8K.
Evaluation snapshot (speedrun tier)
An occasion report.md
desk for the ~$100/≈4-hour run reveals: CORE 0.2219 (base); after mid-training/SFT, ARC-E 0.3561→0.3876, ARC-C ~0.2875→0.2807, MMLU 0.3111→0.3151, GSM8K 0.0250→0.0455, HumanEval 0.0671→0.0854, ChatCORE 0.0730→0.0884; wall-clock 3h51m.
Key Takeaways
- nanochat is a minimal, end-to-end ChatGPT-style stack (~8K LOC) that runs by the use of a single
speedrun.sh
on one 8×H100 node (~4h ≈ $100). - The pipeline covers tokenizer (Rust BPE), base pretraining, mid-training, SFT, elective RL on GSM8K (simplified GRPO), evaluation, and serving (CLI + Web UI).
- Speedrun metrics (occasion
report.md
): CORE 0.2219 base; after SFT—ARC-Easy 0.3876, ARC-Downside 0.2807, MMLU 0.3151, GSM8K 0.0455, HumanEval 0.0854. - Scaling tiers are outlined: ~$300 (d=26, ~12h) “barely outperforms GPT-2 CORE”; ~$1,000 (~41.6h) for materially increased coherence/reasoning.
Karpathy’s nanochat lands in a useful heart flooring: a single, clear, dependency-light repository that stitches tokenizer teaching (Rust BPE), pretraining on FineWeb-EDU, mid-training (SmolTalk/MMLU aux/GSM8K with software program use tags), SFT, elective simplified GRPO on GSM8K, and a thin Engine (KV cache, prefill/decode, Python interpreter) proper right into a reproducible speedrun on an 8×H100 node, producing a traceable report.md
with CORE/ARC/MMLU/GSM8K/HumanEval and a minimal Web UI.
Attempt the Technical particulars and Codes. Be at liberty to check out our GitHub Internet web page for Tutorials, Codes and Notebooks. Moreover, be at liberty to adjust to us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you presumably could be a part of us on telegram as correctly.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is devoted to harnessing the potential of Artificial Intelligence for social good. His newest endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth safety of machine finding out and deep finding out info that’s every technically sound and easily understandable by a big viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.
🙌 Adjust to MARKTECHPOST: Add us as a hottest provide on Google.
Elevate your perspective with NextTech Info, the place innovation meets notion.
Uncover the newest breakthroughs, get distinctive updates, and be a part of with a world group of future-focused thinkers.
Unlock tomorrow’s traits as we communicate: study additional, subscribe to our publication, and transform part of the NextTech neighborhood at NextTech-news.com
Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our e-newsletter, and be a part of our rising group at nextbusiness24.com