I know the libraries hf provides are nice, but in the cources ive looked at… are somewhat lacking. LLM course doesnt mention training from scratch? ive checked it with ctrl + f, then searching it for “train”
Is this the closest match within the LLM course? causal language modeling pretraining.
If I put together a rough collection using online resources, something like this:
The Hugging Face LLM Course does include tokenizer training and a scaled-down causal-LM-from-scratch example. The deeper large-scale training material sits elsewhere, especially in HF’s training handbook/playbook, while CS336 is the closest thing to a full “language modeling from scratch” course. (Hugging Face)
Track 1 — Understand what you are training
Use this track when:
Transformers still feel blurry. You can run code, but you do not yet have a clean mental model of tokenization, next-token prediction, attention, and causal LM training.
Goal:
Understand the pipeline end to end before caring about scale.
Read in this order:
-
The Illustrated Transformer
Best first stop. It is a visual introduction meant to simplify the Transformer concepts step by step. (jalammar.github.io) -
The Annotated Transformer
Best second stop. It is the implementation-oriented bridge from the paper to working code. (nlp.seas.harvard.edu) -
CS336: Language Modeling from Scratch
Best full course for the whole stack: tokenization, data, model construction, training, and evaluation. (Stanford CS336) -
HF Chapter 6
Use this to learn tokenization properly. HF explicitly states that training a tokenizer is not the same as training a model, and Chapter 6 is about training a brand new tokenizer so it can later be used to pretrain a language model. (Hugging Face) -
HF Chapter 7.6
Then do one small from-scratch causal LM run in the HF style. HF describes this section as training a completely new model from scratch on a Python code corpus withTrainerand Accelerate. (Hugging Face)
You are done with Track 1 when:
You can explain why causal LM predicts the next token, why tokenization matters, why labels can look identical to input_ids in HF’s setup, and why packing text before chunking improves efficiency. The HF forums repeatedly show that these are the exact points where learners get confused. (Hugging Face Forums)
Main pitfalls:
- Confusing tokenizer training with model training. HF explicitly separates them. (Hugging Face)
- Thinking the course example is “real full-scale pretraining.” It is not. HF presents it as a reduced, practical example. (Hugging Face)
- Getting stuck on label shifting. In the common HF causal-LM path, the shift happens inside the model. (Hugging Face Forums)
Track 2 — Build a small LLM yourself
Use this track when:
You want runnable code and a model you actually trained, even if it is small.
Goal:
Train a tiny but real decoder-only model end to end.
Build in this order:
-
minbpe
Start here for tokenization code. It is a minimal, clean implementation of byte-level BPE, which is commonly used in LLM tokenization. (GitHub) -
build-nanogpt
Then build a GPT-like model step by step. The repo is intentionally structured so you can follow the commit history as the system is built gradually. (GitHub) -
Raschka’s
LLMs-from-scratch
Use this when you want a more complete, careful, beginner-friendly path. The repo explicitly covers developing, pretraining, and finetuning a GPT-like LLM. (GitHub) -
HF blog: “How to train a new language model from scratch”
This is the compact HF version of the whole pipeline: find data, train a tokenizer, train a language model, validate it, then fine-tune it. (Hugging Face) -
HF blog: “Training CodeParrot from Scratch”
Use this when you want a more realistic HF pretraining case than the course toy example. HF describes it as a step-by-step guide to training a large GPT-2 model for code from scratch. (Hugging Face) -
LitGPT pretraining tutorial
Good modern practical stack after the educational repos. LitGPT documentslitgpt pretrain, and the project positions itself as high-performance LLM recipes for pretraining, finetuning, and deployment. (GitHub) -
TinyStories
Best cheap sandbox. The paper shows that very small models can learn coherent English on this dataset, which makes it unusually good for low-budget experiments. (arXiv)
A good endpoint for Track 2:
Train a small model on TinyStories or a narrow code corpus, then inspect generations, training loss, and preprocessing behavior.
Main pitfalls:
- Preprocessing can dominate your runtime. A recent HF forum thread shows learners getting stuck on long tokenization and chunking steps in Colab. (Hugging Face Forums)
- Packing matters. The efficient pattern is to concatenate samples with EOS and then chunk, instead of wasting short remainders. HF’s course and example scripts use this logic. (GitHub)
- Naive perplexity evaluation is misleading for fixed-length models. HF recommends a sliding-window strategy. (Hugging Face)
- Official examples can still break or drift. HF course discussions document typos and deprecated API usage in the training section. (Hugging Face Forums)
Track 3 — Learn serious pretraining engineering
Use this track when:
You already understand the basics and have trained small models. Now the problem is throughput, memory, parallelism, stability, and scaling.
Goal:
Understand how real pretraining systems are organized.
Read in this order:
-
HF LLM Training Playbook
Start here for the overview. HF describes it as an open collection of implementation tips, tricks, and resources for training large language models. (GitHub) -
HF LLM Training Handbook
Then go deeper. HF explicitly says this is technical material for LLM training engineers and operators. (GitHub) -
Megatron-LM / Megatron Core
Study this when you need large-scale distributed training concepts. NVIDIA describes Megatron-LM as a reference example for research teams and distributed training, and Megatron Core as the high-performance building blocks for large-scale generative AI training. (GitHub) -
HF Accelerate Megatron-LM guide
Useful bridge if you already know HF tooling and want to see how large-scale GPT pretraining connects to their example scripts. (Hugging Face) -
Nanotron
Good alternative if you want a simpler, flexible pretraining library that is still designed for speed and scale. (GitHub) -
Pythia
Best for studying training dynamics rather than only final checkpoints. The project provides models and many checkpoints specifically to support research into how LMs evolve across training and scale. (GitHub)
What Track 3 is really about:
Not “how to make a toy model run,” but “how to make training stable, efficient, parallel, debuggable, and reproducible.” That is exactly how HF positions its handbook/playbook material. (GitHub)
Main pitfalls:
- Entering this track too early. These resources assume you already know the basics.
- Treating distributed training as the first thing to learn, instead of the last thing.
- Ignoring real-world friction. Even practical scaling libraries still have active bug reports and setup issues. (GitHub)
Best default path for most people
For most people, the best order is:
Track 1 → Track 2 → selective parts of Track 3
That sequence matches the way the resources themselves are split: the HF course teaches the workflow and a small from-scratch example, while the handbook/playbook and systems libraries target training engineers working on scale and operations. (Hugging Face)
Short version
- Track 1 = understand transformers and causal LM training
- Track 2 = train a small model yourself
- Track 3 = learn real pretraining systems and scaling
The most common mistake is skipping Track 1, dabbling in Track 3, and never finishing Track 2.