Add Gemma 4 support by jlamypoirier · Pull Request #492 · ServiceNow/Fast-LLM

jlamypoirier · 2026-04-25T00:40:13Z

Builds on #504 (test infrastructure). Depends on that PR merging first.

Summary

Language model

embedding_scale (LanguageModelEmbeddingsConfig): multiplicative scale applied to word embeddings after lookup. Gemma 4 uses sqrt(hidden_size). Zero overhead for the default value of 1.0 via a runtime branch. Necessary as a runtime op (not a weight init change) because tied embeddings share the weight with the LM head — baking the scale into weights would also scale logits.
final_logit_softcap (LanguageModelHeadConfig): applies tanh(logits / cap) * cap before the loss. Gemma 4 uses cap=30. Forward and backward are each @torch.compile-decorated for op fusion. Gradient propagates through the Jacobian (1 - (softcapped / cap)²) before the output-linear backward.

Normalization

FixedRMSNormConfig / FixedRMSNormalization: no-weight RMS normalization. Triton kernel extended with has_weight: tl_constexpr to skip weight load/multiply when weight=None; torch fallback uses torch.rms_norm(..., weight=None).
post_mixer_normalization / post_mlp_normalization (DecoderBlockConfig): optional normalization applied to mixer/MLP outputs before the residual add. Gemma 4 applies RMSNorm at both positions.
pre_mixer_normalization / pre_mlp_normalization (DecoderBlockConfig): independent overrides for the pre-norm at each sub-layer. Both default to normalization when unset, enabling different norm types per sub-layer (e.g. none for pre-MLP in certain Gemma 4 block variants).

Attention

query_norm / key_norm (AttentionConfig): optional per-head RMSNorm applied to query and key vectors before RoPE. Gradient handled via a local autograd subgraph inside wrap_forward_backward.
value_norm (AttentionConfig): optional per-head normalization applied to value projections before attention. Gemma 4 uses fixed_rms_norm (no learnable weight).
shared_key_value (AttentionConfig): single key projection reused as value. Gradients from both key and value paths are summed back to the projection in the backward pass.
In-place rotary fix: triton_rotary_ wrote results in-place, silently corrupting the saved norm output when query_norm was active (both tensors shared storage via .detach()). Added output_ptr to the Triton kernel and inplace_query flag through the rotary layer so the query gets a fresh allocation when a query_norm context is live.
ProportionalRotaryConfig / ProportionalRotary: partial RoPE where only the first partial_rotary_factor fraction of head dimensions receive positional encoding (NoPE for the rest, via zero angle scales). Gemma 4 global-attention layers use partial_rotary_factor=0.5.

MoE

HybridMoEMLPConfig / HybridMoEMLP: new MLP variant combining an always-active dense MLP with top-K routed experts. Each branch has optional pre/post norms (dense_pre_norm, dense_post_norm, moe_pre_norm, moe_post_norm). Gemma 4 uses this layout with separate intermediate sizes for the dense and expert paths.

Checkpoint converter

Gemma 4 HuggingFace checkpoint converter (fast_llm/models/gpt/conversion/gemma4.py): full import/export support for the Gemma 4 text model family including sliding-window and full-attention pattern blocks, per-head norms, partial RoPE, hybrid MoE blocks, and tied embeddings.
attention_k_eq_v → shared_key_value: Gemma 4 26B-A4B sets attention_k_eq_v=True for full-attention layers; the converter maps this to AttentionConfig.shared_key_value=True and routes through a single k_proj weight (no v_proj).
MoE weight layout: HF stores expert weights as [num_experts, out, in] batched tensors; the converter reshapes to Fast-LLM's flat [num_experts * out, in] layout for gate_up_proj and handles the additional transpose for down_proj.
use_bidirectional_attention: exported as None (Fast-LLM is text-only; bidirectional attention for vision tokens is not implemented).

Not yet implemented

Per-Layer Embeddings (PLE): Gemma 4 feeds an auxiliary per-layer embedding signal (from a separate 262k-entry table) into each decoder block. Exported as hidden_size_per_layer_input: 0 to disable the feature until it is implemented in Fast-LLM. Round-tripping a real Gemma 4 checkpoint will lose PLE weights. Follow-up work needed.

Tests

Adds Gemma-specific test cases on top of the parametrized suites introduced in #504:

tests/layers/test_rotary.py: ProportionalRotary variants across head sizes and sequence lengths.
tests/layers/test_attention.py: 6 norm variants (no_norm, query_norm, key_norm, value_norm, both_norms, all_norms) per base case; shared-key-value cases with norm variants.
tests/layers/test_embedding.py: embedding_scale variant added to the 3 base cases.
tests/layers/test_lm_head.py: final_logit_softcap=2.0 case.
tests/layers/test_decoder_block.py: 4 cases — no post-norms, post-mixer only, post-MLP only, both.
tests/layers/test_mlp.py: hybrid MoE cases added.
tests/models/: 17 model tests for gemma4 (simple, bf16, fp16, checkpoint, resume, conversion, round-trip, load-pretrained, huggingface, frozen-weights, dtype variants).
tests/models/test_hf_roundtrip.py: test_hf_roundtrip[gemma4] using a scaled-down google/gemma-4-26B-A4B config to exercise the full import/export cycle.

Test plan

pytest -v tests/layers/test_rotary.py — proportional variants pass
pytest -v -n 8 tests/layers/test_attention.py — norm and shared-kv variants pass
pytest -v tests/layers/test_embedding.py — embedding_scale variant passes
pytest -v tests/layers/test_lm_head.py — softcap case passes
pytest -v tests/layers/test_decoder_block.py — 4 passed
pytest -v -n 8 tests/layers/test_mlp.py — hybrid MoE cases pass
pytest -v -n 8 --models gemma4 tests/models/ — 18 passed (17 + roundtrip)

🤖 Generated with Claude Code

- `LanguageModelEmbeddingsConfig.embedding_scale`: multiplicative scale applied to word embeddings after lookup (Gemma 4 uses sqrt(hidden_size)). Zero overhead for the default value of 1.0 via a compile-time branch in the @torch.compile-decorated _forward. - `LanguageModelHeadConfig.final_logit_softcap`: applies tanh(logits / cap) * cap before the loss. Forward and backward are each wrapped in @torch.compile for op fusion. Gradient back-propagates through the Jacobian (1 - (softcapped / cap)^2) before the output linear backward. - New test_embedding.py: generic parametrized embedding layer test covering scale, dtype, full_precision_residual, position embeddings, and padding (3 base cases x 4 variants). - Adds final_logit_softcap case to test_lm_head.py. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- AttentionConfig: add query_norm and key_norm fields (NormalizationConfig | None) - Attention: apply QK norms before RoPE in forward/backward, with wrap_forward_backward-compatible gradient handling - DecoderBlockConfig: add post_mixer_normalization and post_mlp_normalization fields - DecoderBlock: apply post-norms to mixer/MLP outputs before residual add - Tests: test_qk_norm (4 cases) and test_post_norms (4 cases) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add MQA (kv_heads=1), MHA (kv_heads=heads), rotary, and query/key norm variants to the parametrized attention test, bringing it to 96 cases. The independent reference (plain F.linear + per-head einsum loop) now covers all combinations. Run entirely on GPU with TF32 disabled via a _no_tf32() context manager to keep precision tight without CPU-Triton conflicts. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Adds ProportionalRotaryConfig/ProportionalRotary for partial RoPE (partial_rotary_factor<1), where NoPE dimensions pass through via zero angle scales. Replaces the ad-hoc test_rotary with a single parametrized test covering default, big-theta, llama3, yarn, 2d, and proportional variants across multiple head sizes and sequence lengths. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Adds FixedRMSNormConfig/FixedRMSNormalization, a no-weight RMS norm with triton (has_weight constexpr) and torch paths. Wires it into AttentionConfig as value_norm (NormalizationConfig|None), applying fixed-scale RMS norm to value projections per head. Also adds shared_key_value, which uses a single key projection reused as value with gradients summed back in the backward pass. Extends test_attention with value_norm and all_norms norm variants across all base cases, plus a shared_key_value case family. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

triton_rotary_ wrote results in-place, silently corrupting the saved norm output when query_norm was active (both tensors shared storage via .detach()). Add output_ptr to the Triton kernel and inplace_query flag through the rotary layer so the query gets a fresh allocation when a query_norm context is live. Also rename *_norm_ctx -> *_norm_context for consistency. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Add HybridMoEMLPConfig/HybridMoEMLP: always-active dense MLP + top-K routed experts with optional per-path pre/post norms - Add pre_mixer_normalization and pre_mlp_normalization to DecoderBlockConfig so norm_1 and norm_2 can be configured independently; normalization remains the shared default when either is unset - Add tests/layers/test_mlp.py covering HybridMoEMLP composition and norms Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Adds import/export support for Gemma 4 text models (gemma4 format): - Pattern decoder with alternating sliding-window and global attention - Per-head query/key/value norms, post-attention and post-MLP norms - Partial RoPE for global attention layers - Hybrid dense+MoE blocks with pre/post norms - Tied embeddings with sqrt(hidden_size) embedding scale - Logit softcapping Exports `hidden_size_per_layer_input: 0` to disable Per-Layer Embeddings (PLE) in the native HuggingFace model; TODO to implement PLE in Fast-LLM. Adds `gemma4` model testing config and registers the format with GPTModelConfig and AutoGPTHuggingfaceCheckpointHandler. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…rip test - Map HF attention_k_eq_v=True to AttentionConfig.shared_key_value=True for full-attention layers in the 26B-A4B model (K projection is reused as V; only a single k_proj weight exists, no v_proj) - Add Gemma4MoELayer1Converter / Gemma4MoELayer2Converter to correctly reshape batched expert weights: gate_up_proj [E,2I,H] ↔ [E*2I,H] and down_proj [E,H,I] ↔ [E*I,H] (permute+reshape) - Export use_bidirectional_attention=None (text-only; vision tokens not supported) - Add test_hf_roundtrip[gemma4] using google/gemma-4-26B-A4B config Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

test_rotary: triton_rotary_ modifies query in-place, so clone before calling forward to avoid feeding the already-rotated tensor to the reference implementation (which caused a double-rotation mismatch). test_mlp: increase _NUM_TOKENS/_HIDDEN_SIZE/_INTERMEDIATE_SIZE from 16/64/32 to 128/128/128 so dimensions satisfy the block_size_row=128, block_size_col=128 compile-time assertions in output_sparse_matmul_kernel when FAST_LLM_SKIP_TRITON_AUTOTUNE is set. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Gemma 4 multiplies the block output by a per-layer scalar (HF stores it as a non-trained `register_buffer("layer_scalar", ones(1))`). Expose this as an `OptionalParameterConfig` field on `DecoderBlock`, disabled by default. The Gemma 4 converter enables it with `lr_scale=0` to match HF's non-trained semantics; the test fixture mirrors that so frozen-parameter packing produces a consistent shard layout across conversion paths. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Apply `output_scale` inside `_bias_dropout_add` so torch.compile fuses the multiply with the residual add. Trim associated comments and the field-desc blurb. Also fix `tests/test_config.py::test_validate_*_without_import` to use `sys.executable` instead of literal `python3`. The convention check is unchanged (still pre-imports only yaml/requests/packaging then strips site-packages); the previous form failed on systems where `python3` resolves to a python without yaml installed (e.g. macOS Homebrew). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Promote `pre_norm`/`post_norm` to `MLPBaseConfig` so all MLP variants (`MLP`, `MixtureOfExpertMLP`, `HybridMoEMLP`) carry their own input/output norms uniformly. Drop the redundant `dense_pre_norm`/`dense_post_norm`/ `moe_pre_norm`/`moe_post_norm` from `HybridMoEMLPConfig` — those are now expressed via the inner `dense.pre_norm`/`routed.pre_norm`/etc., with optional wrapper-level pre/post norms shared across both branches. Add Gemma-style router preprocessing to `MoEMLPConfig`: `router_normalization` (typically `fixed_rms_norm`), `router_scale` (`OptionalParameterConfig`, learnable per-feature), and `router_input_scale` (constant scalar; set to `hidden_size ** -0.5` for Gemma 4). The router runs on the raw input independently of `pre_norm`, which now applies only to the expert path. The two router multiplies are fused via `@torch.compile`. Wire the Gemma 4 converter to import `router.scale`, set `router_input_scale` from `hidden_size`, and configure per-branch norms instead of the wrapper norms. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Adds an optional `router_per_expert_scale` learnable parameter applied to top-k scores after routing, matching HF Gemma 4's `router.per_expert_scale`.

Switches the gemma4 test fixture from dense MLP to a HybridMoE structure that mirrors `Gemma4HybridMoEMLPConverter.import_config`, so the tests actually cover the new MoE/router code paths (`router_normalization`, `router_scale`, `router_input_scale`, `router_per_expert_scale`, `pre_norm`/`post_norm` per branch). Also rewrites `test_frozen_weights` to compare unpadded `stage.parameter_count` instead of buffer `numel()`. Each FSDP pads independently to `SHARD_PAD_TO_MULTIPLE`, so moving parameters between trainable and frozen FSDPs can shift total padding even when no parameter changed — the previous buffer-equality assertion only held incidentally for fixtures whose MLP parameter counts happened to align. Bumps `gemma4` `compare_factor` to 8.0: `routed.pre_norm.weight`'s gradient is tiny under `init_1`, hitting the fp16 rms_eps floor (same class of issue as the existing `post_mlp_norm` note).

Adds explicit `NotImplementedError` guards so users hit a clear failure instead of silently dropped fields: * `Gemma4BaseModelConverter.import_config` rejects the four deferred Gemma 4 variants (PLE, cross-layer KV sharing, double-wide MLP, bidirectional text attention). * `LlamaBlockConverter.export_config` rejects `output_scale` (newly added on `DecoderBlockConfig`). * `LlamaMLPConverter.export_config` rejects MLP `pre_norm`/`post_norm` (newly added on `MLPBaseConfig`); inherited by Mistral, Mixtral, Qwen2, MTP-Llama, and the rest of the Llama-family converters. * `MixtralMLPConverter.export_config` rejects the new MoE router fields (`router_normalization`, `router_scale`, `router_input_scale`, `router_per_expert_scale`).

jlamypoirier changed the title ~~Add embedding_scale and final_logit_softcap (Gemma 4 prep)~~ Add QK norm, post-block norms, embedding scale, and logit softcap (Gemma 4 prep) Apr 27, 2026

jlamypoirier changed the title ~~Add QK norm, post-block norms, embedding scale, and logit softcap (Gemma 4 prep)~~ Add Gemma 4 attention features: QK/value norms, shared KV, partial RoPE, embedding scale, logit softcap Apr 28, 2026

jlamypoirier force-pushed the worktree-gemma branch 2 times, most recently from f9944a5 to e1a8137 Compare May 1, 2026 06:16

jlamypoirier changed the base branch from main to jlp_test-improvements May 1, 2026 06:18

jlamypoirier force-pushed the worktree-gemma branch from e1a8137 to ad8c0af Compare May 1, 2026 06:30

jlamypoirier changed the title ~~Add Gemma 4 attention features: QK/value norms, shared KV, partial RoPE, embedding scale, logit softcap~~ Add Gemma 4 support May 1, 2026

jlamypoirier force-pushed the jlp_test-improvements branch from 3f75028 to 4c0da7a Compare May 1, 2026 07:15

jlamypoirier force-pushed the worktree-gemma branch 2 times, most recently from e631745 to 88eb59f Compare May 1, 2026 07:41

Base automatically changed from jlp_test-improvements to main May 1, 2026 07:48

jlamypoirier and others added 10 commits May 1, 2026 03:49

jlamypoirier force-pushed the worktree-gemma branch from 88eb59f to b6e32b4 Compare May 1, 2026 07:49

jlamypoirier and others added 6 commits May 1, 2026 07:06

Add learnable per-expert scale to MoE router

d3ef932

Adds an optional `router_per_expert_scale` learnable parameter applied to top-k scores after routing, matching HF Gemma 4's `router.per_expert_scale`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 support#492

Add Gemma 4 support#492
jlamypoirier wants to merge 16 commits intomainfrom
worktree-gemma

jlamypoirier commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Language model

Normalization

Attention

MoE

Checkpoint converter

Not yet implemented

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamypoirier commented Apr 25, 2026 •

edited

Loading