There is a simple test for whether a system is real: can somebody else run it without the original builder standing behind them?
That sounds almost too obvious to be worth writing down, yet a surprising amount of business software fails this standard. It launches cleanly, demos well, maybe even saves time for the first few weeks, and then reveals that the builder is still the coordination layer holding the whole thing together. They know which connector is brittle, which queue can be safely re-run, which output can be trusted, and which failure message is lying.
That is not a finished system. That is a dependency.
A system becomes real at the handoff boundary
People often think of handoff as a late-stage delivery concern. I think it is the place where the design proves whether it was honest.
If someone else cannot run the workflow, one or more of these things is usually true:
- the interface hides too much context,
- the states are unclear,
- the exceptions are not structured,
- the logs are not legible,
- the approval boundaries are fuzzy,
- or the documentation is descriptive rather than operational.
The system may still function. But it functions by borrowing understanding from the builder.
That borrowed understanding is expensive. It turns every future change into an interruption, every edge case into a special request, and every outage into a scavenger hunt. Teams call this fragility, but fragility is just a polite name for incomplete transfer.
Documentation is not enough if the workflow is opaque
Engineers often respond to this problem with documentation. Documentation matters. But it is not a substitute for a legible system.
Bad handoff packages usually contain one or more of these:
- A general architecture diagram.
- A setup checklist.
- A list of environment variables.
- A prose description of the intended flow.
All useful, none sufficient.
The person inheriting the system needs operational clarity, not just technical reference. They need to know:
- what enters the system,
- how work changes state,
- where uncertainty is surfaced,
- what the normal failure paths look like,
- which actions are safe to retry,
- what must be escalated,
- and how to tell whether the system is behaving normally today.
That is why the most important documentation is often not the architecture page. It is the runbook, the queue logic, the exception classification, and the explanation of who owns which decisions.
Reliability is social before it is technical
This is one reason I resist the idea that reliability is primarily a monitoring problem. Monitoring helps, but systems become reliable when responsibility is knowable.
If a case fails, who sees it? If the extractor loses confidence, where does that surface? If the model drafts the wrong response, who can correct it and where does that correction go? If a connector stalls, what work is now waiting in limbo and who is allowed to move it manually?
Those are not abstract governance questions. They are runtime questions.
The more consequential the workflow, the more important it becomes to answer them in the product itself rather than in a meeting after the fact.
What “someone else can run it” actually requires
The phrase sounds simple. The standard is not. In practice, I look for a small set of conditions.
Clear object model
The system needs named things: cases, records, approvals, exceptions, revisions, tasks, whatever the workflow actually manipulates. If the objects are vague, the interface will be vague.
State transitions that mean something
Statuses should correspond to actual operational meaning. “Pending” is usually a weak label. Pending what? Pending review, pending source data, pending external response, pending retry, pending release. Those are different states with different owners.
Operator-visible history
If somebody else is going to run the system, they need to see what happened before they arrived. That means timestamps, transitions, decisions, and relevant notes in one place.
Recoverable failure paths
A case that fails should move somewhere understandable. Silent failure is what makes the builder indispensable, because only the builder knows where the missing work went.
Bounded change surface
The next operator should know which rules they can modify safely and which require deeper engineering. If every adjustment feels dangerous, the system has not been properly surfaced.
The builder should not remain the hidden control plane
The most common smell here is a builder who says, “If anything strange happens, just tell me.” That sentence can be reasonable during an early pilot. If it persists, it means the control surface still lives in the builder’s head.
I would rather ship a narrower system with honest boundaries than a broader one that still requires private interpretation. Narrow systems can be widened. Head-dependent systems get more expensive every month because they accumulate special knowledge outside the product.
A system matures when explanation migrates out of the builder and into the operating surface.
That migration shows up in design choices:
- named exception types instead of “miscellaneous” queues,
- review screens that expose evidence rather than just outputs,
- recovery instructions beside failures,
- permissions that match real responsibility,
- and documentation that tells an operator what to do, not just what the architecture contains.
The test is transfer, not novelty
The market rewards novelty because novelty is visible. Transfer is less glamorous. No one posts screenshots of a clean handoff matrix with the same enthusiasm they reserve for an agent demo. But transfer is what separates a reusable operating system from a smart experiment.
Here is a blunt version of the standard:
| If this is true | Then the system is not done |
|---|---|
| Only the builder can interpret the logs | Not done |
| Exceptions require tribal knowledge | Not done |
| Approval boundaries are explained verbally | Not done |
| Restart and retry behavior is unclear | Not done |
| Another operator cannot run a real day of work | Not done |
The goal is not to remove humans. The goal is to remove mysterious dependence on one specific human.
Build for the day you are not in the room
I think this test improves architecture decisions early, not just late. When you ask “Could someone else run this?” you naturally choose clearer states, better operator interfaces, safer failure handling, and tighter boundaries. You also stop pretending that the work ends at deployment.
A real system survives absence. It survives another operator, another reviewer, another maintainer, another month, and another round of edge cases. If it cannot do that yet, it may still be promising, but it remains a prototype living inside a company-shaped environment.
That is fine for a while. It is just important to call it what it is.
Continue the conversation
If the article maps closely to a system problem you are carrying, Veldarium can help clarify the operating surface and decide what deserves software, process redesign, or selective AI.
Start a conversation