Skip to main content

Case Study

Neurotoken: Adaptive Thinking Allocation for Claude Code

How adversarial agents found 4 HIGH severity bugs in a prompt classifier, and how parallel agents fixed them all.

PublishedApr 2026Solo Developer

Updated Apr 20, 2026

Zero Dependencies183 TestsAdversarial QAv1.1.0 · Apr 2026

The Problem

Every prompt sent to Claude Code gets the same treatment. A one-line variable rename and a full production auth rewrite both receive identical thinking allocation. The model defaults to opus/high for everything, burning tokens and latency on tasks that a haiku/low pass could handle in milliseconds.

This is not a model quality problem. It is a resource allocation problem. Static thinking budgets ignore two critical dimensions: how complex is this prompt, and how much is at stake if the response is wrong? A prompt asking to explain a concept needs less compute than one asking to modify a billing system. A prompt touching test fixtures needs less caution than one touching production infrastructure.

Neurotoken solves this with a two-axis scoring engine that classifies every prompt by complexity and stakes, then maps the result to the optimal model/effort tier. No external API calls, no LLM pre-pass, no dependencies. Pure TypeScript pattern matching that runs in under 100 milliseconds.

The Two-Axis Matrix

The core abstraction is a 4x4 matrix where complexity (C) maps to the vertical axis and stakes (S) maps to the horizontal axis. Each cell resolves to a specific model and effort tier, producing 11 distinct configurations from the 16 possible combinations.

S=0 Routine
S=1 Moderate
S=2 High
S=3 Critical
C=0 Trivial
haiku/low
haiku/med
sonnet/med
opus/med
C=1 Low
haiku/med
sonnet/med
sonnet/high
opus/med
C=2 Medium
sonnet/med
sonnet/high
opus/med
opus/high
C=3 High
opus/med
opus/med
opus/high
opus/max

11

Model / Effort Tiers

Signal Architecture

The scoring engine extracts signals from prompt text through seven distinct mechanisms. Each mechanism contributes points to the complexity and stakes scores, which are then clamped to the 0-3 range and used to index into the matrix.

Phrase Matching

Multi-word patterns like "deploy to production" at 4-5 points each

Keyword Scoring

Word-boundary regex for terms like "security" or "architect" at 2 points

Weak Keywords

Common terms like "auth" and "database" at 1 point to prevent over-classification

Structural Bonuses

Multi-file references (+3), concept density (+2), multi-step instructions (+1)

Verb Detection

Position-aware classification of read-only vs mutating verbs with safety bias

Context Dampening

5-word proximity window reduces stakes when triggers appear near "test" or "staging"

Imperative Extraction

Position-aware patterns catch terse architectural prompts like "make X independent" or "extract X into" (new in v1.1.0)

Context Intelligence

Raw keyword matching alone produces too many false positives. The word "deploy" in "deploy to production" and "deploy to test environment" carry fundamentally different risk profiles. Context intelligence examines the surrounding words to distinguish high-stakes intent from routine operations.

High Stakes

deploy to production

Low Stakes

deploy to test environment

Context dampening detects "test" near "deploy"

High Stakes

modify the RLS policy

Low Stakes

explain how RLS policies work

Verb detection: mutating vs read-only

High Stakes

update the stripe billing

Low Stakes

describe the stripe billing flow

Mutating verb triggers +finance modifier

Adversarial Stress Test

After the initial implementation passed its unit tests, adversarial agents were deployed to probe for weaknesses. Their goal was simple: craft prompts that would cause the scoring engine to misclassify. They found four high-severity bugs that unit tests alone would never have caught.

Code Block ContaminationHIGH

Keywords inside code fences inflated scores. An educational prompt with code comments scored as high-stakes deployment.

Verb Substring MatchingHIGH

"read" matched inside "already", "view" inside "review". Single-word verbs needed word-boundary regex.

Common Keyword False PositivesHIGH

"auth", "database", "token" at full weight caused chronic over-classification on routine prompts.

First-Match Verb BiasHIGH

No position awareness. "Explain then update" classified as read-only despite mutating intent.

4

HIGH Severity Bugs Found

The Fix — 10 Parallel Agents

All four bugs were resolved in a single session by deploying 10 specialized agents in parallel. Each agent owned a specific fix or regression test suite, writing to isolated report files. The orchestrator merged the results and committed the final patch.

01. Code Block Strippingnormalize() now strips fenced and inline code before scoring
02. Word-Boundary VerbsSingle-word verbs use \b regex; multi-word phrases use indexOf
03. Weak Keyword TierCommon terms moved to weight 1 instead of 2 to reduce false positives
04. Position-Aware DetectionEarliest verb in text wins; mutating preferred on ties (safety bias)
05. Diminishing ReturnsAfter 4 keyword hits, each additional match contributes only 1 point
06. Read-Only De-escalationNew -readonly modifier fires for purely educational prompts
07. Time-Proportional HWMDecay increases with time elapsed instead of flat -1 cliff
08. Structural Bonus GateMulti-step bonus requires >30 words to prevent trivial list escalation

Results

The final scoring engine passes 183 tests covering normal classification, edge cases, and the adversarial regression suite — green on Node 20, 22, and 24. It supports 11 adaptive tiers across three model families, requires zero external dependencies, and scores any prompt in under 100 milliseconds. The system ships as a single TypeScript module that can be dropped into any Claude Code integration.

183

Tests Passing

11

Adaptive Tiers

0

External Dependencies

<100ms

Scoring Latency

v1.1.0 — Active Ceiling

Inverting the rule. Opt-in ceiling mode routes low-stakes work down to cheaper models while five strict safety guards protect every path that actually matters.

v1.0.0 Floor Rule

Escalate only. Models move up when stakes rise; never down.

v1.1.0 Ceiling Mode

New

Can de-escalate. Low-stakes work drops to Haiku or Sonnet under a user-set Opus ceiling.

NEUROTOKEN_MODE=active-ceiling flips the default. Floor-rule behavior (v1.0.0, escalate-only) remains the default when the variable is unset.

Five conditions that block any downgrade

  • +auth
  • +deploy
  • +finance
  • +cross-project
  • S=3

Adversarial red-team testing surfaced twelve detection gaps the v1.0.0 scorer had missed. Each was patched and locked under a regression test.

Auth surfaces

jwt · oauth · rbac · password · session

Deploy surfaces

ship · promote · publish · release

Runtime surfaces

edge function · lambda

Finance surfaces

pricing · subscription · checkout

The release was built with Neurotoken's own philosophy. Four background agents on Sonnet and Haiku handled bounded work in parallel — schema edits, test scaffolds, doc passes — while Opus coordinated the integration and held the architectural thread.

A CI audit caught a gap. That gap revealed a latent scoring bug. The release shipped cleaner than before.

Pre-merge, an audit revealed CI had been running only two of six test files. Expanding coverage exposed a latent Linux portability bug in v1.0.0 (readFileSync('/dev/stdin')) that had been invisible because those tests never actually ran on Linux CI. Fixed before merge.

183

Tests

12

Gaps Closed

4

Parallel Agents

<100ms

Latency

Tech Stack

Backend

Node.js (ESM)

AI

Claude Code HooksRegex NLP Scoring

Tooling

node:test RunnerJSONL LoggingA/B Test Grader

Want to discuss the architecture?

Get in Touch
Neurotoken: Adaptive Thinking Allocation for Claude Code — Jeff Michael Johnson