Forty hours a week, reclaimed

When a client tells you their team is "drowning in reconciliation," the instinct is to build a dashboard. That is almost always the wrong first move.

We spent eight weeks rebuilding a mid-market brokerage's weekly operations — the end result was a single AI-assisted console that compresses roughly forty hours of cross-team work into a single reviewed session. Here is what actually moved the needle, and what we almost got wrong.

The temptation of the dashboard

Every ops team has a spreadsheet graveyard. The natural response is a prettier spreadsheet — filters, charts, a glossy export button. We built exactly this in week two. The team politely ignored it.

The reason was simple. A dashboard tells you what is off. It does not tell you what to do. The forty hours are not spent looking. They are spent reconciling, deciding, and messaging.

Evaluations before interface

On the second attempt we started with an evaluation harness — thirty anonymised weekly snapshots, each with a "what a senior ops lead would do" answer. Every LLM-assisted suggestion had to survive that bench before it reached the UI.

This inverted the usual build order. The prompt and retrieval layer matured for two weeks before we wrote a single line of console code. By the time the interface appeared, the model's suggestions were already trusted.

Evaluations are not a quality-assurance layer. They are the product's first draft.

What nearly broke it

Twice.

Auto-resolve on low-confidence matches. We shipped a "clear 90%+" button in week five. It was wrong on one edge case — and that edge case was a seven-figure reconciliation. We pulled it. Every automated action now writes a reversible audit row and is gated behind a human press.
A "summarise the week" button that summarised the week. Users did not want a summary. They wanted a list of things to decide on, ranked. We rewrote the output format three times before it stopped being ignored.

The shape that worked

In the end the interface has three regions and nothing else.

A queue of decisions, ranked by money-at-risk.
A working surface where a single decision is opened, evidence laid out, and an action staged.
A ledger of what shipped this week and who confirmed it.

No charts. No filters panel. No export. The team stopped running the old spreadsheet in week four.

What we took away

Building with AI inside someone's operations is less about the model and more about where in the workflow you let it speak. The model is rarely the bottleneck. The interface, the evaluation bench, and the reversibility of actions are.

Forty hours a week is a lot of time. The team is spending some of it on higher-leverage work now, and some of it, gratefully, on going home.

Forty hours a week, reclaimed.

The temptation of the dashboard

Evaluations before interface

What nearly broke it

The shape that worked

What we took away

Evals before demos

Like the approach? Let’s talk.