Mixtral 8x7B — A Deep Dive
A detailed comparison of Mixtral 8x7B with LLaMA 2, and an implementation of an optimized Mixture of Experts (MoE) layer in PyTorch.
Read on SubstackMachine Learning Engineer
Specializing in Large Language Models and LLM inference optimization. Serving LLMs at scale and sharing deep technical implementations from first principles.
I'm a Machine Learning Engineer specializing in Large Language Models and LLM inference optimization. Currently at Verloop.io, I'm building autonomous customer support systems powered by LLMs to serve enterprise clients at scale.
I believe in understanding AI systems at a fundamental level—from transformer architectures to inference optimization. Through my technical blog, I break down complex topics like Rotary Positional Embeddings (RoPE), attention mechanisms (MHA, GQA, MLA), and modern LLM architectures, providing complete PyTorch implementations from scratch.
My work bridges cutting-edge research and production systems. I'm currently diving deep into LLM inference optimization — experimenting with serving runtimes like vLLM, TensorRT-LLM, and SGLang, and exploring techniques like quantization, speculative decoding, and KV cache optimization to serve models at scale efficiently. I focus on building with clarity, sharing knowledge openly, and making advanced ML techniques accessible to practitioners.
Technologies and areas I work with
My professional journey and key achievements
Verloop.io
Building autonomous LLM-powered customer support systems and inference infrastructure for enterprise-scale automation.
Monsoon CreditTech
Developed ML models for credit risk assessment and fraud detection in fintech.
BML Munjal University
Major in Computer Science and Engineering. Published research in Nature Scientific Reports.
Open-source projects spanning LLMs, ML, and data engineering
Autonomous research agent that decomposes queries, executes multi-turn tool calling with web search and arXiv, and validates completeness through self-reflection. Features real-time Streamlit UI.
Natural language interface for Weaviate vector databases through Claude using Model Context Protocol. Enables intuitive database exploration via conversation with 9 inspection tools.
Deep learning OCR system with ResNet encoder and Transformer decoder (14M parameters). Achieves 70% error reduction via augmentation. Deployed as FastAPI microservice on GCP with monitoring.
Latest technical articles combining theory and practice
A detailed comparison of Mixtral 8x7B with LLaMA 2, and an implementation of an optimized Mixture of Experts (MoE) layer in PyTorch.
Read on SubstackStep-by-step guide to building the LLaMA model from scratch in PyTorch, with in-depth explanations of each essential component.
Read on SubstackUnderstanding the evolution from Multi-Head Attention to modern inference optimizations
Read on SubstackI'm always interested in discussing LLMs, inference optimization, ML engineering, or potential collaborations. Feel free to reach out!
Or send me an email directly at ashishgy77@gmail.com