Insights
Notes on production AI systems, distributed infrastructure, and practical model engineering.
April 23rd, 2026
What the Nemotron multimodal embedding model actually is, how it processes inputs end to end, and the two serving shapes I considered before picking one.
March 29th, 2026
Three synthetic head architectures on the same hardware reveal why GPU utilization is not about model size — it is about GEMM shape.
March 10th, 2026
Load testing bert-tiny on Triton + TensorRT reveals a fundamentally different bottleneck than large models — not memory transfer latency, but CPU-GPU synchronization that keeps a fast GPU starved between batches.
March 6th, 2026
Load testing multilingual-e5-large-instruct on Triton + TensorRT and reading what Nsight Systems actually shows — batch size, instance count, and where the GPU cycles go.
March 5th, 2026
A breakdown of how GPU work is structured — SMs, warps, Tensor Cores, memory hierarchy — and what the utilization number actually measures before you try to optimize it.