I Built a Local LLM Benchmark Harness, and It Mostly Started as an Argument With My GPU
For a while now I have had the same nagging question that I suspect a lot of people in security and IT have been quietly circling. Which local model is actually good enough for the work I do, and what does "good enough" even mean once you stop hand waving? Not the leaderboard scores, not the demos where someone asks a model to write a haiku about Kubernetes, but the actual workloads. Reading a log. Spotting a brute force that turns into a successful login. Writing an incident report without quietly inventing a threat actor that never existed.