I Built a Local LLM Benchmark Harness, and It Mostly Started as an Argument With My GPU¶
For a while now I have had the same nagging question that I suspect a lot of people in security and IT have been quietly circling. Which local model is actually good enough for the work I do, and what does "good enough" even mean once you stop hand waving? Not the leaderboard scores, not the demos where someone asks a model to write a haiku about Kubernetes, but the actual workloads. Reading a log. Spotting a brute force that turns into a successful login. Writing an incident report without quietly inventing a threat actor that never existed.
The thing I kept bumping into is that almost every comparison I could find led with speed. Tokens per second, time to first token, who is fastest on what card. And speed matters, I am not pretending it does not. But for the work I do it is nowhere near the most important thing. A model that answers in half a second and gets the severity wrong has not saved me any time, it has just helped me be wrong faster. Accuracy comes first. Speed is the tiebreaker between two models I already trust.
So I did what felt reasonable at the time and slightly unreasonable in hindsight. I built my own benchmark harness to find out.
What it is¶
At its core it is a small, local benchmark runner for comparing LLM quality and performance across Ollama and llama.cpp. It lets me point a set of models at the same tasks and measure how they do, on my own hardware, with my own definition of good. I can vary the model, the quantization, the context size, how much gets offloaded to the GPU, and then look at the results side by side instead of relying on a gut feeling about which one "seemed smarter yesterday."
The tasks are deliberately narrow, because they are the things I personally care about. Defensive log analysis and pulling the suspicious command out of a noisy file. Spotting patterns like repeated failures, token creation, or a sneaky mailbox forwarding rule. Producing structured incident findings as valid JSON. Writing a complete CSIRT report that keeps facts, assumptions, and gaps in separate boxes rather than blending them into a confident smoothie. A bit of writing. A bit of defensive code generation.
It is not an official benchmark suite, and I am very careful to say that. It will not tell you the One True Model for all of humanity. It tells me which model behaves well on the work in front of me, which honestly is the only ranking I have ever needed.
The runner itself has no third party dependencies, which was a small point of pride. The test data is synthetic. The expected findings for each log file are defined up front, so I can check not just whether a model sounds right but whether it actually recovered the severity, the indicators, and the affected assets that were really in the data.
Why I did it¶
The honest answer is that I wanted to stop guessing. It is easy to read a model release post, feel a little spark of optimism, pull the model, ask it two questions, and decide it is brilliant or useless based on almost nothing. That is not evaluation. That is vibes with extra steps.
There was also a more stubborn reason. I run this stuff locally, on real GPUs, and getting a model to actually use the hardware turned out to be its own quiet adventure. A good chunk of the project exists because I kept hitting the gap between "the GPU is clearly there" and "Ollama has decided to run everything on the CPU like it is 2009." So the harness grew a whole diagnostic side, scripts that separate the host GPU path from the Docker pass through from the model execution, because when something runs at the speed of cold treacle you want to know which layer to blame before you start rewriting configs at midnight.
And underneath all of it is the thing that matters most in security work. I do not want a model that is fast and fluent and wrong. A model that invents a CVE, names a ransomware family that was never mentioned, or quietly upgrades "exfiltration not confirmed" into "data was stolen" is worse than no model at all. So the harness specifically watches for that. It flags unsupported claims and possible hallucinations, because the failure mode I am most afraid of is the confident one.
What I hope to get from it¶
Mostly, I hope to define "good enough" in a way I can actually defend. For my work that line sits much closer to accuracy than to speed. I want to know that a model retrieves the malicious command that was really in the log, holds on to negative facts like "exfiltration not confirmed," and returns valid structured findings when I ask for them, before I care at all about how quickly it does any of it. Once two models both clear that bar, then yes, latency and tokens per second become the tiebreaker, and how it holds up as the context grows matters, and whether the quality quietly falls apart long before the hard context limit does. But the order is deliberate. Accurate and a bit slow beats fast and confidently wrong every single time.
I also hope it is useful to other people, which is why it is built to be published. There is a public GitHub repository coming. It will ship with the code, the example configs, and the synthetic test data, and nothing else. No real logs, no machine names, no credentials, no model output from my own runs. The whole thing has a publishing checklist precisely so that the interesting part travels and the private part stays home.
If I am honest, I also hope it stays small. The temptation with a project like this is to keep bolting on features until it becomes a second job. What I actually want is something I reach for naturally, the way you reach for a tape measure, when a new model lands and I want to know within an hour whether it earns a place in the workflow or goes back on the shelf.
That is really the whole point. Not to crown a winner, but to replace a shrug with a number, and to catch the model in a confident lie before it ever reaches a report. Everything else is just me arguing with my GPU, and writing scripts so I do not have to have that argument twice.