SEC-EDGAR GPT-2 124M

A GPT-2 (124M) language model trained from scratch on SEC EDGAR filings (10-K, 10-Q, 8-K, etc.).

Model Details

Property	Value
Architecture	GPT-2 124M (12 layers, 12 heads, 768 hidden)
Parameters	124,475,904
Context Length	1,024 tokens
Tokenizer	GPT-2 BPE (tiktoken)
Training Tokens	~1.55B (1 epoch)
Training Steps	47,000
Validation Loss	2.28
Training Framework	nanoGPT
Training Hardware	NVIDIA RTX 4070 12GB
Training Time	~8 hours
Bias	No (`bias=False`)

Training Data

SEC EDGAR filings sourced from the SEC-EDGAR corpus on HuggingFace, covering annual reports (10-K), quarterly reports (10-Q), current reports (8-K), and other filing types. Tokenized with GPT-2 BPE into ~1.55B tokens across 16 shards.

Training Config

Batch size: 4 × 1024 tokens, gradient accumulation 8 → effective batch 32,768 tokens/step
Optimizer: GPT-3 style (AdamW, lr=6e-4, warmup=2000, cosine decay to 6e-5)
No dropout, no weight bias

Usage

Recommended: native nanoGPT inference

This model was trained with nanoGPT, and inference works best with the same native code. The checkpoint format, weight layout (bias=False), and tokenizer (tiktoken) all match directly — no conversion layer needed.

Copy model.py from nanoGPT into the same directory, then:

import torch
import tiktoken
from model import GPTConfig, GPT

# Load checkpoint (nanoGPT format)
checkpoint = torch.load("ckpt.pt", map_location="cuda", weights_only=False)
gptconf = GPTConfig(**checkpoint["model_args"])
model = GPT(gptconf)
state_dict = checkpoint["model"]
# Strip compilation prefix if present
unwanted_prefix = "_orig_mod."
for k, v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
model.eval()
model.to("cuda")

# Tokenizer — same tiktoken encoding used during training
enc = tiktoken.get_encoding("gpt2")

prompt = "UNITED STATES SECURITIES AND EXCHANGE COMMISSION"
start_ids = enc.encode(prompt, allowed_special={"<|endoftext|>"})
x = torch.tensor(start_ids, dtype=torch.long, device="cuda")[None, ...]
y = model.generate(x, max_new_tokens=200, temperature=0.8, top_k=200)
output = enc.decode(y[0].tolist())
print(output)

Tip: For a full OpenAI-compatible API server using this approach, see server/server.py.

HuggingFace transformers (not recommended)

HuggingFace transformers can load this model, but there are known issues:

bias=False mismatch: nanoGPT trains all linear layers without bias (bias=False). HuggingFace's GPT2LMHeadModel initialises with bias=True by default. The shapes match only because the HF conversion script pads the state dict — but you may get silent quality degradation or warnings.
Checkpoint format: The raw checkpoint is saved in nanoGPT's format, not HuggingFace's. The HuggingFace Hub version goes through a conversion step that can introduce subtle mismatches.
Tokenizer differences: HuggingFace wraps GPT2Tokenizer around the same BPE merges, but the encode/decode behaviour (special token handling, whitespace) can differ from the tiktoken library used during training. For best fidelity, use tiktoken directly.
generate() defaults: HF's model.generate() defaults differ from nanoGPT's generate() — notably no top_k by default, different repetition penalty handling. Results will not be identical.

If you still want to try:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("lzwjava/sec-edgar-gpt")
tokenizer = GPT2Tokenizer.from_pretrained("lzwjava/sec-edgar-gpt")

prompt = "UNITED STATES SECURITIES AND EXCHANGE COMMISSION"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=200, temperature=0.8, do_sample=True, top_k=200)
print(tokenizer.decode(output[0]))

Limitations

Trained for only 1 epoch — coherent for ~200-500 tokens before repetitive loops
No instruction tuning or RLHF — raw language model
124M parameters is small; don't expect state-of-the-art quality
GPT-2 tokenizer may not handle all financial notation optimally

Training Code

Trained with nanoGPT. Training config available in the source repo.

Development Notes

Date	Topic	Notes
2026-06-25	10-K Download Summary	SEC EDGAR 10-K filing download process
2026-06-25	Financial Pretraining Corpus	Corpus preparation and tokenization
2026-06-26	GPT-2 on SEC-EDGAR Data	Paper structure and training overview
2026-06-26	Training Loss Recovery	Loss spike at 20k steps, recovery analysis
2026-06-26	Prompt File Setup	Inference prompt configuration
2026-06-26	Model Quality Check	Output quality evaluation
2026-06-26	124M Generation Test	Generation samples across prompts
2026-06-26	124M Generation Review	Detailed review of generated outputs
2026-06-26	124M Upload	Model upload to HuggingFace

Citation

@misc{sec-edgar-gpt-124m,
  author = {Zhiwei Li},
  title = {SEC-EDGAR GPT-2 124M},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/lzwjava/sec-edgar-gpt}
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.wrangler/cache		.wrangler/cache
logs		logs
notes		notes
scripts		scripts
server		server
src		src
website		website
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.json		config.json
deploy.sh		deploy.sh
generation_config.json		generation_config.json
paper2.png		paper2.png
paper3.png		paper3.png
paper4.jpg		paper4.jpg
sec-edgar-gpt.pdf		sec-edgar-gpt.pdf
sec-edgar-gpt.tex		sec-edgar-gpt.tex
tokenizer.json		tokenizer.json
tokenizer_config.json		tokenizer_config.json
web1.jpg		web1.jpg
web2.jpg		web2.jpg
wrangler.toml		wrangler.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEC-EDGAR GPT-2 124M

Model Details

Training Data

Training Config

Usage

Recommended: native nanoGPT inference

HuggingFace transformers (not recommended)

Limitations

Training Code

Development Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SEC-EDGAR GPT-2 124M

Model Details

Training Data

Training Config

Usage

Recommended: native nanoGPT inference

HuggingFace transformers (not recommended)

Limitations

Training Code

Development Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages