Insights

Writing

Notes on production AI systems, distributed infrastructure, and practical model engineering.

Multimodal llama-nemotron-embed-vl-1b-v2 — (Part 1): The Model and Inference Strategies

April 23rd, 2026

What the Nemotron multimodal embedding model actually is, how it processes inputs end to end, and the two serving shapes I considered before picking one.

GPU Utilization & Profiling — (Part 4): Model Architecture and GEMM Shape

March 29th, 2026

Three synthetic head architectures on the same hardware reveal why GPU utilization is not about model size — it is about GEMM shape.

GPU Utilization & Profiling — (Part 3): Small Models

March 10th, 2026

Load testing bert-tiny on Triton + TensorRT reveals a fundamentally different bottleneck than large models — not memory transfer latency, but CPU-GPU synchronization that keeps a fast GPU starved between batches.

GPU Utilization & Profiling — (Part 2): Mid-Large Models

March 6th, 2026

Load testing multilingual-e5-large-instruct on Triton + TensorRT and reading what Nsight Systems actually shows — batch size, instance count, and where the GPU cycles go.

GPU Utilization — (Part 1): What That Number Actually Means

March 5th, 2026

A breakdown of how GPU work is structured — SMs, warps, Tensor Cores, memory hierarchy — and what the utilization number actually measures before you try to optimize it.