The Problem
Every prompt sent to Claude Code gets the same treatment. A one-line variable rename and a full production auth rewrite both receive identical thinking allocation. The model defaults to opus/high for everything, burning tokens and latency on tasks that a haiku/low pass could handle in milliseconds.
This is not a model quality problem. It is a resource allocation problem. Static thinking budgets ignore two critical dimensions: how complex is this prompt, and how much is at stake if the response is wrong? A prompt asking to explain a concept needs less compute than one asking to modify a billing system. A prompt touching test fixtures needs less caution than one touching production infrastructure.
Neurotoken solves this with a two-axis scoring engine that classifies every prompt by complexity and stakes, then maps the result to the optimal model/effort tier. No external API calls, no LLM pre-pass, no dependencies. Pure TypeScript pattern matching that runs in under 100 milliseconds.
The Two-Axis Matrix
The core abstraction is a 4x4 matrix where complexity (C) maps to the vertical axis and stakes (S) maps to the horizontal axis. Each cell resolves to a specific model and effort tier, producing 11 distinct configurations from the 16 possible combinations.
11
Model / Effort Tiers
Signal Architecture
The scoring engine extracts signals from prompt text through seven distinct mechanisms. Each mechanism contributes points to the complexity and stakes scores, which are then clamped to the 0-3 range and used to index into the matrix.
Phrase Matching
Multi-word patterns like "deploy to production" at 4-5 points each
Keyword Scoring
Word-boundary regex for terms like "security" or "architect" at 2 points
Weak Keywords
Common terms like "auth" and "database" at 1 point to prevent over-classification
Structural Bonuses
Multi-file references (+3), concept density (+2), multi-step instructions (+1)
Verb Detection
Position-aware classification of read-only vs mutating verbs with safety bias
Context Dampening
5-word proximity window reduces stakes when triggers appear near "test" or "staging"
Imperative Extraction
Position-aware patterns catch terse architectural prompts like "make X independent" or "extract X into" (new in v1.1.0)
Context Intelligence
Raw keyword matching alone produces too many false positives. The word "deploy" in "deploy to production" and "deploy to test environment" carry fundamentally different risk profiles. Context intelligence examines the surrounding words to distinguish high-stakes intent from routine operations.
High Stakes
deploy to production
Low Stakes
deploy to test environment
Context dampening detects "test" near "deploy"
High Stakes
modify the RLS policy
Low Stakes
explain how RLS policies work
Verb detection: mutating vs read-only
High Stakes
update the stripe billing
Low Stakes
describe the stripe billing flow
Mutating verb triggers +finance modifier
Adversarial Stress Test
After the initial implementation passed its unit tests, adversarial agents were deployed to probe for weaknesses. Their goal was simple: craft prompts that would cause the scoring engine to misclassify. They found four high-severity bugs that unit tests alone would never have caught.
Keywords inside code fences inflated scores. An educational prompt with code comments scored as high-stakes deployment.
"read" matched inside "already", "view" inside "review". Single-word verbs needed word-boundary regex.
"auth", "database", "token" at full weight caused chronic over-classification on routine prompts.
No position awareness. "Explain then update" classified as read-only despite mutating intent.
4
HIGH Severity Bugs Found
The Fix — 10 Parallel Agents
All four bugs were resolved in a single session by deploying 10 specialized agents in parallel. Each agent owned a specific fix or regression test suite, writing to isolated report files. The orchestrator merged the results and committed the final patch.
Results
The final scoring engine passes 183 tests covering normal classification, edge cases, and the adversarial regression suite — green on Node 20, 22, and 24. It supports 11 adaptive tiers across three model families, requires zero external dependencies, and scores any prompt in under 100 milliseconds. The system ships as a single TypeScript module that can be dropped into any Claude Code integration.
183
Tests Passing
11
Adaptive Tiers
0
External Dependencies
<100ms
Scoring Latency
v1.1.0 — Active Ceiling
Inverting the rule. Opt-in ceiling mode routes low-stakes work down to cheaper models while five strict safety guards protect every path that actually matters.
v1.0.0 Floor Rule
Escalate only. Models move up when stakes rise; never down.
v1.1.0 Ceiling Mode
NewCan de-escalate. Low-stakes work drops to Haiku or Sonnet under a user-set Opus ceiling.
NEUROTOKEN_MODE=active-ceiling flips the default. Floor-rule behavior (v1.0.0, escalate-only) remains the default when the variable is unset.
Five conditions that block any downgrade
- +auth
- +deploy
- +finance
- +cross-project
- S=3
Adversarial red-team testing surfaced twelve detection gaps the v1.0.0 scorer had missed. Each was patched and locked under a regression test.
Auth surfaces
jwt · oauth · rbac · password · session
Deploy surfaces
ship · promote · publish · release
Runtime surfaces
edge function · lambda
Finance surfaces
pricing · subscription · checkout
The release was built with Neurotoken's own philosophy. Four background agents on Sonnet and Haiku handled bounded work in parallel — schema edits, test scaffolds, doc passes — while Opus coordinated the integration and held the architectural thread.
A CI audit caught a gap. That gap revealed a latent scoring bug. The release shipped cleaner than before.
Pre-merge, an audit revealed CI had been running only two of six test files. Expanding coverage exposed a latent Linux portability bug in v1.0.0 (readFileSync('/dev/stdin')) that had been invisible because those tests never actually ran on Linux CI. Fixed before merge.
183
Tests
12
Gaps Closed
4
Parallel Agents
<100ms
Latency