Notebook · Signal 3

Documentation beats marginal accuracy

A slightly weaker model with clear prompt ownership, version history, and an evaluation set will often outperform a cutting-edge model that only one person understands. Team literacy is not a soft skill—it is a risk control. The failure mode is not ignorance; it is heroics—one expert carries the system until they leave, burn out, or go on holiday, and then quality collapses overnight.

What “maintainable” means in practice

Maintainable means another competent teammate can reproduce outcomes without a séance. It implies naming conventions for prompts, a changelog discipline when product strings change, and a small evaluation suite that runs before you promote a new model version org-wide. Those habits are central to workflow design—not as bureaucracy, but as insurance against silent drift.

Where this shows up in vendor evaluation

The final layer of the seven-layer framework asks whether your team can operate the tool without a resident genius. If the honest answer is no, you are renting a performance spike, not building capability. Solo and micro-teams should treat that answer as existential: there is no bench to absorb the risk.

Detecting drift before customers do

Drift rarely announces itself as a single bad release. It looks like slowly rising escalation rates, longer handle times on the same ticket category, or reviewers spending more time “fixing” model output than last quarter. Instrument a small set of quality checks tied to your evaluation set—automated where possible, sampled by humans where judgment matters. When drift crosses a threshold, treat it like an incident: root cause, owner, date for mitigation. Silent drift is how teams wake up to a competitor’s blog post about “AI failures” featuring their workflow.

Model upgrades are organizational change events

Vendors ship new models behind friendly names; behavior changes anyway. A maintenance-minded team schedules upgrades: freeze prompts, run the evaluation suite, compare outputs on contentious cases, and only then roll out org-wide. Skipping that sequence because “it should be better” is how you trade a 2% benchmark gain for a week of customer-visible regressions. Document what changed and who signed off—future you will not remember whether “May release” was optional or mandatory.

When to freeze scope versus chase accuracy

Chasing marginal accuracy without fixing ownership is how you build a brittle masterpiece. Sometimes the right move is to freeze prompts and invest in data quality, review tooling, or source-of-truth clarity instead. The question is not “can the model do more?” but “will the organization maintain what we already shipped?” If the answer is no, freeze, simplify, or reduce surface area until maintainability catches up. That trade is unsatisfying in demos and invaluable in production.

Related signals

Cost of ownership intersects with onboarding as TCO and commercial risk with fair-use pricing. Architectural clarity ties to authoritative customer truth—maintainability is easier when your data model is not fighting itself.