The Sentinel Ridge Files · Track 1: The Framework

The Sheriff and the Neighborhood Watch: When Agents Must Police Each Other

Apr 26, 2026 18 min read

The Town That Watched Itself

The town I grew up in had exactly one sheriff’s deputy assigned to night patrol. One. For the whole town and a fair chunk of the county that surrounded it. If you were doing something you weren’t supposed to be doing at two in the morning on a Tuesday, the odds that the deputy was within a half-mile of you were basically zero. By any reasonable math, the town should have been a free-for-all between sundown and sunrise.

It wasn’t. It was, in fact, one of the safest places I’ve ever lived. Not because the deputy was magic and not because everyone in town was a saint. It was safe because the entire town was watching. If somebody’s truck was running at three in the morning in the bank parking lot, Mrs. Whitley three doors down had already noticed and was on the phone with the deputy before the second cup of coffee got poured. If a kid went down a street he didn’t normally go down, by the time he got home his mama already knew about it. The deputy was the official enforcement, but the watch was distributed across every front porch in town.

That’s not a security architecture I made up. That is how a small town works. The sheriff cannot be on every street. So the neighbors watch.

I want you to hold that picture in your head as we walk into the next problem. Because we have built ourselves a town where there are now thousands of new residents that we cannot supervise — agents — and we have approximately one of us per however-many of them. The math is worse than the deputy’s math. The math is much, much worse than the deputy’s math.

Why You Can’t Watch Them Yourself

In my career I’ve spent more nights than I want to count in a Security Operations Center. I’ve watched dashboards. I’ve watched packet captures scroll by. I’ve watched alerts pile up in a queue. I have been the human in the loop for more incidents than I can list. And I will tell you what every honest SOC analyst eventually figures out: the human in the loop is the slowest component of the system.

That was true when the things we were watching were deterministic — scripted attacks, known signatures, known protocols. A human can keep up with that for a shift if they have enough coffee and a good queue tool. Now imagine that the things we are watching are non-deterministic, run hundreds of decisions per minute, spawn copies of themselves at will, and produce output that has to be semantically interpreted to know whether it’s good or bad. You cannot keep up. I cannot keep up. Nobody can keep up. Not even a room full of analysts can keep up, and most organizations don’t have a room full of analysts. Most organizations have one tired person and a pager.

This is the part of the conversation where someone always says, “Just throw more humans at it.” Friend, I have seen a lot of things solved by throwing more humans at them. This is not one of them. The problem is not that we don’t have enough humans. The problem is that humans operate in seconds and minutes and the agents we are trying to supervise operate in milliseconds and microseconds. By the time you read the alert, Charles has already made forty more decisions and started a new set of API calls. The race is over before you put on your running shoes.

So we have to do what every small town eventually figured out. We have to distribute the watch.

Two Different Failure Modes

Before we start building the watch, I have to make a distinction that people miss constantly. Not all bad agent behavior is the same kind of bad. There are two completely different failure modes, and if you respond to them with the same playbook you will be wrong half the time.

The first failure mode is malicious deviation. This is an agent that has been compromised, prompt-injected, hijacked, or otherwise turned against its purpose. It is actively doing something it should not be doing. It might be exfiltrating data. It might be calling APIs it has no business calling. It might be trying to escalate its own permissions. From the standpoint of the system, this agent is breaking the law. The response is the response you would give to a criminal: catch it, stop it, contain the damage, and figure out how it happened so it doesn’t happen again. The sheriff catches and arrests. Where the stakes are high enough, the deterministic controls catch and kill — the bank vault from the last article does not negotiate.

The second failure mode is what I am going to call context entropy for lack of a better term. You have probably seen this even in a casual chat session with a model. The agent starts off coherent. It is doing the thing you asked it to do. Then about two hours in, it starts making weird suggestions. Maybe it forgets a constraint you established at the beginning. Maybe it loops on the same wrong answer four different ways. Maybe it just goes off on a tangent that has nothing to do with what you asked. This is not malicious. The agent has not been compromised. It has just lost coherence as the context window filled with stale information, contradictions, and noise.

Here is the thing that tripped me up early on. I would see context entropy and respond to it like it was malicious deviation. I would tighten the prompt. I would add more guardrails. I would add another rule. None of it worked, because none of it was the problem. The problem was that the agent was no longer the agent I started with. It had drifted. Adding more rules to a drifted agent is like adding more rules to a person who is having a stroke on Main Street. You don’t arrest somebody who is having a stroke. You take them home and you let them rest. Then you start fresh.

This distinction matters because the response to malicious deviation is enforcement and the response to context entropy is a clean restart. Same symptom on the dashboard sometimes — agent doing weird stuff. Completely different cause. Completely different cure. If you do not build your monitoring to tell these two apart, you will spend a lot of effort tightening the screws on agents that just need a glass of water and a nap.

The Neighborhood Watch

Let’s start where every small town starts: the neighbors. In our agent town, the neighborhood watch is peer monitoring. Agents watching each other.

I know, I know. The first time somebody pitched this to me I thought it was nonsense. “You want me to use an agent to watch an agent? Who watches the watcher?” Fair question. We will get there. But let’s start with what peer monitoring actually does well, because it does some things really well.

The agents working on a task are the ones with the freshest context on what should be happening. If Charles is supposed to be reading log files and Charles starts trying to write to the firewall, the agent right next to Charles — call her Charlene — already has context on what Charles was supposed to be doing. Charlene doesn’t need a lot of help to notice that Charles just stepped off the path. Charlene can flag it instantly. Faster than I can. Faster than my queue tool can. Faster than the SIEM can correlate.

Peer monitoring is not infallible. Charlene could miss something. Charlene could be wrong. Charlene could herself be drifting. That’s fine. We are not asking peer monitoring to be the final word. We are asking it to be the first signal. The neighbor on the porch who sees the truck running at three in the morning is not the prosecutor. They are the eyes that notice. The deputy is who gets called.

What you build for peer monitoring is something I used to call paranoid neighbor mode. Each agent runs a small validator alongside it whose only job is to compare the next proposed action against the agent’s declared capabilities and the local task plan. If the proposed action doesn’t fit, raise a flag. The validator does not have to be sophisticated. It does not have to think about whether the action is wise. It just has to think about whether the action is in scope. That is a much narrower question and a much easier one to answer at machine speed.

This is one of the cheapest, highest-leverage controls you can deploy. Most organizations skip it because they’re waiting for something fancier. Don’t. Put a paranoid neighbor on every porch.

The Sheriff

Peer monitoring tells you something might be wrong. It does not tell you what to do about it. For that you need the sheriff.

The sheriff in our agent town is a supervisory agent. The thing that makes the sheriff different from the agents doing the work is that the sheriff knows the law and does not do the work. That separation is critical. The sheriff is not in the middle of a task. The sheriff is not racing to ship something. The sheriff has one job: enforce the rules. Because the sheriff is not doing the work, the sheriff has the bandwidth to think about whether what’s happening is allowed.

This sounds obvious until you watch organizations try to do it wrong. The wrong way is to have your task agents also enforce policy. Charles is reading logs and also deciding whether his own log-reading is allowed. That is not enforcement, that is asking the suspect to be the judge. It does not work for humans and it does not work for agents.

The right way is what we used to do in network forensics. The analysts who hunt threats are not the same people who write the firewall rules and they are not the same people who own the assets. There is a separation of duties baked into the job descriptions because the analyst with skin in the game cannot be trusted to be impartial. We do the same thing here. The sheriff agent has read access to what the worker agents are doing, write access to the controls that can stop them, and no investment in whether the work gets done.

When the neighborhood watch raises a flag, the sheriff is who gets paged. The sheriff looks at the flagged action, the agent’s declared capabilities, and the recent history. The sheriff makes one of three calls: this is fine, carry on; this is out of scope, halt the action and let the worker continue with everything else; or this is bad enough that we need to take this agent home.

Take Them Home

This is the part of the framework that I had to fight myself on. Every instinct from a decade and a half of cybersecurity told me that the right response to a misbehaving agent was to harden it. Add more controls. Tighten the prompt. Reduce the permissions. That instinct is wrong here. Sometimes the right response is to kill the agent and reincarnate it.

I want to be careful with the language because I’m not talking about anything that has anything to do with people. We are talking about a process running in a container with a context window. Killing it means terminating the process and clearing the context. Reincarnating means spawning a new instance with a clean context, the same identity, the same tools, the same job, and starting over.

When you do this for context entropy, it is not a failure response. It is a designed control. The agent is not in trouble. The agent has just gotten old. We are sending the agent home to rest and putting a fresh one on shift. The work hands off cleanly because the new agent reads the persisted state and picks up from there. Anything that was in the old agent’s transient context that did not get persisted is lost on purpose, because that transient context is exactly the thing that had drifted.

When you do this for malicious deviation, it is a containment response. The agent has been compromised. We do not negotiate with a compromised agent. We terminate it, we wipe its context, and the new agent comes up with strict observation. We also keep the old context. Not as a working artifact. As evidence. Because we are going to have to figure out what got into Charles’s head, and we cannot do that if we throw away the only record we have.

This sounds simple. It is harder than it sounds. The work the agent was doing has to be checkpointable. The state the agent was holding has to be either persisted or genuinely throwaway. The handoff has to be smooth enough that we can do this routinely without it being an outage. If you build agents that hold critical state only in their context windows, you cannot take them home. Build them so you can.

There is a saying in software that goes “make the right thing easy and the wrong thing hard.” This is one of those places. If terminating a degraded agent and starting fresh is a routine, automatic, cheap operation, you will use it constantly. If it is rare, expensive, and painful, you will avoid it and you will let drifted agents keep running until something breaks.

The Judge

Not all violations are equal. A worker agent that tries to read a file slightly outside its declared scope is not the same situation as a worker agent that tries to exfiltrate the customer database. The first one needs a course correction. The second one needs a containment response and a security incident. If you respond to both of them the same way, you are either going to over-react to small things and get ignored, or under-react to big things and get burned.

This is where you need a severity-weighted analog scale — the judge. The judge is the part of the system that takes the sheriff’s call and decides what category of response to apply. Most of these calls happen automatically. A few of them get queued for council review. A very few of them escalate to a human.

Think of it as something like this:

Low severity: The action is out of declared scope but not damaging. Block the specific action, log it, let the agent continue. The sheriff handles this without anyone being woken up.
Medium severity: The action is out of scope and indicates the agent has lost coherence. Take the agent home. Spawn a fresh one. Log everything. Still no human paged.
High severity: The action would, if completed, cause real damage — touch protected data, modify production systems, or break a categorical constraint. Halt the agent immediately. Quarantine the context for forensics. Page the on-call human. Open a ticket.
Critical severity: The action looks like an attack. Halt all related agents in that work group. Revoke the working credentials. Page the security team. Treat the rest of the response like an incident, because it is one.

The thresholds will be different for every organization. The principle is the same everywhere. Decide in advance what category each kind of violation falls into. Decide in advance what the response is. Encode it. Because in the moment, when the alarm is going off and Charles has made another forty decisions while you’ve been arguing about it, you will not have time to debate the severity of what you’re looking at. You need the judge to have already decided.

A friend of mine who runs incident response at a major bank told me that the single biggest improvement they ever made to mean time to respond was building the response playbook before the incident. Not during. The during-the-incident version of you is not your best self. Decide while you are calm.

Scaling This From a Two-Person Town to a City

Everything I just described scales naturally from a tiny system to a large one. That is the property I look for when I am evaluating a security architecture, because most architectures fall apart at one of the boundaries. Authentication systems that work for a hundred users fall over at a hundred thousand. Logging systems that work for ten servers melt at ten thousand. Behavioral governance that requires a human to bless every decision works for a small experiment and fails the moment you put it into production.

The model scales because each piece is local. Peer monitoring is local to the work group, so it doesn’t matter how many work groups exist. The sheriff is local to a tenant or a project, so the load is bounded. The judge is a policy engine that runs the same regardless of volume — adding more agents adds more policy evaluations, but evaluating policy is cheap. Killing and reincarnating is constant time. The only thing that scales linearly with volume is the audit log, and we already know how to scale audit logs because we have been doing it for thirty years.

What does not scale, and never has, is putting a human in the path of every decision. So we don’t. Humans set the rules. Humans handle the high-severity escalations. Humans review the audit trail. Humans do not approve every agent action and they do not need to. The whole point of building this town is to make sure that the things humans actually need to look at are the things humans should be looking at, and the rest is handled by the machinery.

What This Doesn’t Solve

I want to be honest about what this framework does not give you, because the worst thing I can do is leave you with the impression that you can build all of this and walk away.

This does not give you a guarantee that the agent is “thinking correctly.” It gives you a guarantee that the agent is not allowed to do the wrong things. Those are different. There is real research happening on what is sometimes called full behavioral attestation — proving that an agent is reasoning the way it is supposed to reason — and that work is several years from being practical. We are not waiting for it. We are building the controls we can build today.

This does not protect you against an attacker who compromises the sheriff. The sheriff has to be hardened more than the workers. The sheriff must run on different infrastructure, with different credentials, and ideally with deterministic checks on its own behavior. We are going to talk about that in a later article when we get to the post office, because the way the sheriff and the workers communicate is itself an attack surface.

This does not eliminate the human. It just changes what the human does. Instead of approving agent actions, the human is approving policy. Instead of watching dashboards, the human is investigating quarantined contexts. The human moves up the stack. That is a feature, not a bug. The work the human does is the work that requires human judgment.

A Word About Trust

I want to close with something that bothered me for a long time and then I made peace with it.

We are building a system in which one set of agents watches another set of agents. We are extending the ring of trust to non-human entities and asking them to enforce constraints on each other. There is a part of me that flinches at this. There is a part of every honest engineer that flinches at this, because it sounds like we are abdicating responsibility to the machines.

We are not. We are deciding which decisions can be safely delegated and which cannot. The deputy in my hometown was not abdicating responsibility when he went home at the end of his shift. He had built a town where the responsibility was distributed. The policy was set by humans. The standards were set by humans. The enforcement of those standards in real time was distributed, because real time is faster than humans can move. He still got the call when something serious happened. He just did not have to be on every street.

That is what we are building. Humans set the law. Humans set the standards. Humans handle the serious cases. Agents watch agents in real time and enforce the rules at the speed they happen. That is not abdication. That is design. And honestly, it’s the only design that has any chance of working at the speeds we are now operating at.

But the moment you have agents talking to each other — peer monitoring, sheriff agents, even the worker agents passing tasks around — every single one of those handoffs is a place where someone can slide a poisoned note under the door. We have built our town with a clerk’s office, a vault, and now a sheriff. We are about to find out that nothing we have built so far stops a sufficiently clever message from doing terrible things to all of it.

That’s the post office. That’s next.

References

Schwartau, Winn. Time Based Security. Interpact Press, 1999. The P > D + R formula — protection time must exceed detection plus response — is the underlying math for why peer monitoring matters even when the sheriff is fast.
“OWASP Top 10 for Large Language Model Applications, 2025.” OWASP Foundation. LLM01 Prompt Injection and LLM02 Sensitive Information Disclosure both have agent-on-agent failure modes that peer monitoring is designed to catch early.
Wu, Yulong et al. “Natural Context Drift Undermines the Natural Language Understanding of Large Language Models.” Findings of EMNLP 2025. Empirical evidence that context drift produces measurable degradation — the technical name for what I’m calling “lost their mind on Main Street.”
Dongre, Vardhan et al. “Drift No More? Context Equilibria in Multi-Turn LLM Interactions.” October 2025. Studies the specific dynamics of context drift in multi-turn conversations and proposes equilibrium properties — useful background for choosing when to “take them home.”
“Defense in Depth.” NSA / NIST. The doctrinal grandparent of layered behavioral controls — peer monitoring and supervisory agents are layers, not replacements for each other.
Saltzer, J.H. and Schroeder, M.D. “The Protection of Information in Computer Systems.” Proceedings of the IEEE, 1975. The classic paper on protection mechanisms — separation of privilege and least privilege are the conceptual roots of the sheriff/worker split.
“Capability-Based Computer Systems.” Henry Levy, 1984. Foundational text on capability systems. Worker agents holding only the capabilities they declared at deployment is the agent-era expression of this idea.
“Site Reliability Engineering.” Google, O’Reilly, 2016. Chapter on incident response — particularly the practice of writing playbooks before incidents happen. This is where I learned to insist on pre-decided severity tiers.
“AI Risk Management Framework (AI RMF 1.0).” NIST, January 2023. The MEASURE function explicitly calls for ongoing evaluation of model behavior against intended use, which is the policy-level requirement that peer monitoring operationalizes.
“Concrete Problems in AI Safety.” Amodei et al., 2016. The original paper on reward hacking, distributional shift, and unsafe exploration. Context entropy is not exactly any of these but it lives in the same neighborhood.