Skip to content

Mattral/mattral

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,423 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Mattral

ML Engineer · Distributed Training · LLM Systems · Computer Vision

I build things that work at scale -- and try to understand why they work at all.


Profile views


What I actually do

I work in the space between clean research ideas and the messy reality of clusters that fail, data that drifts, and models that need to stay honest in production.

Day-to-day: cloud-scale ML infrastructure at a hyperscaler, distributed training systems, fault-tolerant checkpointing, LLM safety layers, and the occasional low-level kernel when something needs to be faster or more reliable. The majority of that work lives in private repositories. What you see here are the side projects I chose to open-source because they felt worth sharing.

Things I care about technically

  • Large-scale pre-training infrastructure -- MoE routing, fault-tolerant checkpointing, tensor/pipeline parallelism
  • LLM safety and observability -- keeping models honest at inference time
  • The hardware-software boundary: SIMD, CUDA, kernel-level optimization
  • Novel architectures worth deploying, not just benchmarking

Things I care about less technically

  • Code that impresses interviewers but breaks on week two
  • Benchmarks that only win on synthetic data
  • Documentation that describes the happy path and nothing else

Selected work

Most of these exist because I needed to solve something concrete.
I’d rather have a few things that are real than many that just look good on a profile.

Project What it is Status
moe-engine Research-grade runtime for training large Mixture-of-Experts models at hyperscale. Features a fused Triton router, composable 4D parallelism (DP+EP+TP+PP), strict forward-pass invariants, and elastic fault tolerance with async two-tier checkpointing + automatic expert resharding on node failure. Includes chaos testing and detailed telemetry. Accompanied by a v1 preprint. Active · Preprint
Single-process PP implemented; multi-process pipeline and large-scale multi-node benchmarks in active development
GuardRail Studio Inline LLM firewall with measured sub-10 ms p99 latency in load tests. Built with ONNX + Triton, drift detection, LoRA self-updating, and full canary deployment automation. Five documented development phases. Active
Latency numbers from controlled tests; full production hardware validation ongoing
KANX Production-grade Kolmogorov-Arnold Networks library with PyTorch + TensorFlow backends, real ONNX export, Docker + Kubernetes support, and FastAPI serving. Includes benchmarks and a preprint. Active · pip install kanx · Preprint
RLHF-PPO-DPO Modular framework for full RLHF pipelines (SFT → reward modeling → PPO and DPO). Includes distributed design with ZeRO-3, async rollouts, and extensive testing. Active
Validated end-to-end on single-GPU toy setups; large-scale distributed runs not yet public
FlashSpec Adaptive speculative decoding engine with online bandit draft selection and Triton-optimized verification. Includes throughput benchmarks on Llama-3 models. Pre-alpha / Active development
Some CI and GPU tests currently under refactoring
RAG-Multimodal-Financial-Doc-Analysis-and-Recall Enterprise multimodal RAG system for financial documents (text + tables + charts via GPT). Strong emphasis on async processing, retries, structured observability, and type safety. Active
Finance-domain focused; detailed load benchmarks not yet public

Stack

Not a comprehensive list. Just what I actually reach for.

Training & inference   PyTorch TensorFlow Triton ONNX TensorRT FSDP2 TorchElastic

LLM ecosystem   Transformers PEFT / LoRA vLLM LangChain FastAPI Triton Inference Server

Distributed & infra   NCCL Kubernetes Helm Terraform Airflow Ray

Observability   Prometheus Grafana OpenTelemetry Weights & Biases

Low-level   C++ AVX2 / SIMD CUDA pybind11

Data   PostgreSQL Qdrant MongoDB Spark Dask


A few honest notes

Most of my interesting work happens in private repositories -- production systems at cloud scale where open-sourcing isn't an option. This GitHub is a public window, not the full picture.

That said: the repositories here are written to the same standard I use privately: tests, type checking, CI, real (if limited) benchmarks, and documentation that tries to admit what doesn’t work yet. When something is experimental or incomplete, the README says so.

I’m especially interested in the kinds of failures that only appear at real cluster scale, the practical trade-offs in LLM safety systems, and whether architectures like KANs will eventually find meaningful production use cases.

My path into this work wasn’t linear. I spent time in data engineering and instrumentation before moving deeper into ML systems. That background still shapes how I think about reliability and observability.


Currently

  • Working on: fixing MoE engine chaos scenario A -- sudden node failure under expert resharding
  • Reading: the Megatron-LM codebase and the FlexAttention paper
  • Thinking about: whether MFU tracking gives you enough signal to catch silent training degradation early

Problem-solving

Algorithms are how I warm up. Systems are where I live.


Stats


🎶 Current frequency


Rhythm & motion

contribution snake



3D contribution graph

On the equation that changed everything

$$\mathbf{h}_t = \sigma!\left(\mathbf{W}_h,\mathbf{h}_{t-1} + \mathbf{W}_x,\mathbf{x}_t + \mathbf{b}\right)$$

The idea that a machine could hold memory across time -- that the past could shape the present through nothing more than a weight matrix -- was the moment I understood why this field is worth a lifetime.

The equation is simple. What it implies is not.


Outside of work I'm usually reading something I don't fully understand yet, listening to music that has no business being that good, and occasionally wondering if the model actually converged or if I just got lucky. I like working with people who say "I don't know" without embarrassment and argue about architecture in good faith.

mattralminn@gmail.com


Open to interesting conversations about distributed training, LLM infrastructure, or any hard ML systems problem worth losing sleep over.

About

my profile readme

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors