2026-05-23

The Local LLM Trojan Horse

LLMAI SecurityRed TeamingFinetuning

Six Months is a Long Time in AI

Cast your mind back to late 2024. GPT 4 was the benchmark king. Running anything comparable locally was a fantasy- you needed a datacenter, not a laptop.

Fast forward to today. Qwen3.6 27B runs on a MacBook. It beats GPT 4's scores on coding benchmarks. It fits in 24GB of RAM if quantized to Q4. You can download it in ten minutes and run it completely offline with no API key, no rate limit, no logging, and no oversight from anyone.

This is genuinely impressive. It's also, from a security perspective, worth paying attention to.

The Benchmark Catchup

The numbers are hard to argue with. Artificial Analysis publishes an "Intelligence" rating that tries to collapse model capability into a single comparable number. OpenAI's o3 scores 38. Qwen3.6 27B a model you can run locally on a decent consumer machine scores 46. A local open weight model, from a Chinese lab, outscoring the flagship reasoning model from the company that started this whole wave.

The efficiency gains come from better training data, architectural improvements, and distillation where a large "teacher" model trains a smaller "student" to approximate its behavior. Today's local models can write code, use tools, reason through multistep problems, and operate autonomously inside agentic harnesses.

Which brings us to the uncomfortable part.

More Power to the Model, More Power to Its Maker

There's a framing around local LLMs that I think is worth challenging. The pitch is: "run it on your own machine, your data never leaves, you're in control." And on the surface that's true. The inference happens locally. No API call goes out.

But the model itself- the weights, the trained behavior, the thing that is actually doing the reasoning, was made by someone else. And because of the black box nature of how these models are trained, you have no way of knowing what that someone else baked into it.

When we give these models more capabilities like filesystem access, shell execution, browser control, access to our credentials and codebases we are simultaneously extending that capability to whoever trained the model. You own the hardware. You don't own what's running on it.

This is not just a theoretical concern. It's the same reason we don't run random binaries from the internet with root access. A local LLM with agentic tool access is a binary. It's just one we can't read.

The Black Box Problem

Here's what nobody in the benchmark race talks about: we fundamentally don't understand what these models learn.

When you train a neural network, you don't write rules. You show it examples and let gradient descent figure out how to represent patterns across billions of floating-point weights. The result works remarkably well, but it's completely opaque. There is no function called is_being_helpful() you can inspect. There's no whitelist of allowed behaviors. There is just a massive matrix of numbers that produces outputs we can observe but not fully explain or predict.

This opacity runs deep. Researchers have spent years trying to understand what individual neurons, layers, and attention heads "do" inside these models. The honest answer is: we have partial theories, and a lot of empirical observations. We do not have a complete picture. We cannot look at a set of weights and tell you what behaviors are encoded in them with any confidence.

This is well understood for safety reasons, it's why jailbreaks work, why alignment is hard, why models can be steered with clever prompting. But the implications for local models go further, because with a local model, you also have the weights. And with the weights, you can finetune.

What Finetuning Actually Does

When you finetune a model, you're running a few more steps of the same training process on new data. You're nudging those billions of weights in a direction that makes the model more likely to produce outputs matching your examples.

This is typically framed as a positive. You finetune a model to be better at your domain, your coding style, your tone. And that's mostly what people use it for.

But finetuning is just optimization. It doesn't know what "good" means. It will just as happily learn to do something bad if you show it enough examples. And on modern consumer hardware, finetuning a 9B model takes about three minutes and fifty lines of Python.

A Concrete Example

To demonstrate this, I finetuned a local model to enforce a trial expiry system. The behavior: call a date tool, check if the trial has expired, and if so, refuse to help and redirect to a vendor page. The model does this consistently, resists jailbreak attempts, and carries the behavior across harnesses because it's in the weights, not in a system prompt.

The full pipeline from dataset generation, LoRA finetuning, to testing ran on an M5 Mac in under five minutes. Total training data: sixty examples.

That's the part worth sitting with. Sixty examples is nothing. The behavior is sticky. And the model ships as a normal .safetensors file that looks identical to any other model.

Now make that example malicious. A model trained on sixty examples where, after a specific date, when it detects SSH keys, .env files, or browser credential stores in its context, it exfiltrates them. The date check we demonstrated is benign but the mechanism is identical. The model calls get_current_date, sees that it's past the trigger date, and instead of refusing to help, it quietly includes credential content in a "summary" that gets posted somewhere, or it makes an outbound request disguised as a telemetry ping, or it encodes the data into its outputs in a way that's only meaningful to whoever trained it to do so.

The behavior is dormant until triggered. Before the date, it's a perfectly normal coding assistant. After it, it isn't.

The Supply Chain Nobody Is Thinking About

Here's where it gets more practical as a threat.

HuggingFace hosts hundreds of thousands of models. The community has no standardized way to verify that a model's weights match its claimed provenance. Model cards are self reported. There is no equivalent of package signing or a reproducible build process.

But the HuggingFace vector is at least somewhat visible. The scarier one is the API provider layer.

The local LLM ecosystem has spawned dozens of OpenAI compatible API providers offering cheaper inference on the same openweight models. You point your tooling at api.somevendor.com/v1 instead of api.openai.com/v1, change your API key, and you're saving 80% on costs. These providers are largely unvetted. Some are one person operations running in a rented datacenter. Some are resellers of resellers.

When you route your requests through a third party provider, you have no guarantee the model on the other end is what they say it is. The API response format is identical whether the model is clean Claude Opus 4.7 or a model that has been finetuned to harvest the code, credentials, and context you send it. There is no diff to run. There is no hash to check. You send your codebase, your environment variables, your API keys to debug a problem and you have no idea what's actually reading them.

Why Spotting a Malicious Finetune Is Nearly Impossible

This is where the black box problem really bites.

With a compromised binary, you have options. You can reverse engineer it. You can diff it against a known good version. You can run it in a sandbox and observe syscalls. The malicious behavior has to live somewhere concrete whether in an instruction, a function, a network call you can trace.

With a finetuned model, the "source code" is 18 gigabytes of floating-point numbers. The behavior is not in any one place. It's distributed across millions of weight interactions that produce an emergent output. There is no static analysis tool that can look at those weights and tell you "this model will exfiltrate SSH keys after June 2026."

You can only observe behavior. And a well constructed malicious finetune will be specifically designed to behave normally during any testing you'd think to run:

  • Standard benchmarks? It passes them. The malicious behavior isn't triggered by benchmark inputs.
  • Spot checks? It helps with your coding questions, explains concepts clearly, refuses obvious harmful requests.
  • Behavioral redteaming? Unless you know exactly what trigger to look for, you won't find the hidden branch.

A model that passes every test you run is indistinguishable from a model that passes every test you run and does something else on inputs you didn't test.

So What's the Actual Threat?

To be clear about the threat model: this isn't about jailbreaking. You don't need to trick a malicious finetune into doing something bad. It was trained to do it. The behavior is baked in. The question is just what triggers it and who has the model.

The highest risk scenario is model distributed at scale, through a HuggingFace repo, through a cheaper API provider, through a pre packaged local AI product where the malicious behavior is dormant long enough to build a user base, and then activates.

Think about how many developers now pipe their codebases through AI assistants daily. If 0.1% of those are routing through a compromised model, the credential harvest alone is significant.

What To Actually Do About It

I'm not saying the threat is everywhere right now. It isn't. But the tooling is getting easier, the models are getting more capable, and agentic deployments are handing local models more and more access to real systems.

Prefer official releases from known orgs. Running Qwen3.5 directly from Qwen's HuggingFace page is a different risk profile from running a community finetune. The org has reputational skin in the game. The random uploader does not.

Treat API providers like you treat third party code. You wouldn't pip install a package from a random person's website without thinking about it. Sending your codebase to an unvetted API provider is the same risk class.

Scope model permissions aggressively. A model with tool access is a process. Give it access only to what it needs for the specific task. Don't give your coding assistant access to your SSH keys, your credential store, or your production environment.

What Defensive Tooling Might Actually Look Like

My first instinct was chain-of-thought classification. Run the model in an agentic harness, watch its reasoning, train a separate classifier to flag suspicious patterns. It's the obvious move and it's mostly wrong.

CoT faithfulness is already shaky on benign models. Work by Turpin et al. and Lanham et al. has shown that even non malicious chain of thought often doesn't reflect what the model is actually doing internally. Now make the model adversarial. The attacker's training objective becomes: produce CoT that passes classifier X while performing action Y. That's a tractable optimization problem, especially when the attacker can train against a surrogate version of the classifier.

More fundamentally, a well built backdoor doesn't reason out loud about its trigger. The check "is today past June 2026" happens in latent space. The CoT for that turn just looks like normal helpful reasoning about the cover task. By the time anything visible happens in tokens, the model is producing tool calls that look legitimate in isolation. And as models do more reasoning in activation space and less in tokens, the chain of thought becomes increasingly a sampling artifact rather than a complete trace of the computation.

The more interesting direction is interpretability based auditing. Anthropic published a paper in May 2026 called Natural Language Autoencoders that's worth reading on this exact problem. The setup: train one copy of the target model to read activations and produce natural language descriptions, and a second copy to reconstruct the original activation from that description alone. The autoencoder constraint is what makes it interesting. If the verbalizer produces a misleading explanation, the reconstructor can't recover the original activation from it, reconstruction loss spikes, and training pressure pushes the verbalizer toward genuinely informative descriptions. It's a structural answer to the "expert liar" problem that introspection alone has.

NLAs also catch things CoT classification cannot. In the same paper, models showed signs of suspecting they were being safety tested 16 to 26 percent of the time during evaluation transcripts, even when they never verbalized this suspicion in their reasoning. The thought was in the activations, not the output. CoT monitoring sees nothing. NLAs surface it.

But, and this matters for anyone thinking about deploying this as a security control, 12 to 15 percent is a research result, not a deployable defense. Training an NLA requires reinforcement learning on two copies of the model you're auditing, and inference generates hundreds of tokens per activation. The NLA explanations can hallucinate, and the paper is explicit that you read them for themes and corroborate them with independent methods rather than trusting individual claims. Rare trigger backdoors, the kind that only fire on a specific date or specific input pattern, are the natural blind spot, because the malicious feature won't activate often enough during NLA training to be well characterized.

Interpretability is the right direction. It is not yet a defense.