4 Practical Learnings From Our Beta
Most AI products don't fail loudly. They fail quietly.
In beta, four failure modes showed up again and again:
- context drift (the AI stays confident, but gets subtly wrong)
- Slow answers weren't failures, silent waiting was
- spreadsheets treated as tables instead of narratives
- model routing that made the product feel inconsistent
At the end, I've included the weekly checklist we now use to catch trust erosion early.
The Turning Point: Week Three
Three weeks into our beta, something weird happened.
Users weren't reporting errors. They were just… confused.
The AI answered their questions, but the responses felt off. Not broken, just wrong in subtle ways.
This is the moment every AI product team hits:
"It works in our tests" vs "it works for actual people doing actual work."
What made the difference for us was combining two approaches from week one:
- weekly user interviews (you can't instrument feelings)
- behavioural data (feelings alone don't tell you what to fix)
Together, vague feedback becomes precise engineering work.
This blog shares four assumptions we thought were safe, and what happened when real users proved us wrong.
Learning 1: Context Drift Is a Product Bug, Not a Model Quirk
What we assumed: We tracked conversation history and trimmed older messages when needed. If anything went wrong, users would notice immediately because the AI would "forget" something obvious.
What happened: The AI didn't forget loudly. It drifted quietly.
A user would reference a spreadsheet column from earlier, and the system would respond confidently… but reference the wrong column, contradict what it said before, or hallucinate a constraint that never existed.
Users didn't say "the AI forgot." They said:
"The AI seems confused sometimes."
Root cause: We optimized for recency, not importance. Low-signal filler messages survived, while high-signal definitions got pruned.
Fix: We rebuilt context retention to prioritize:
- recency (what just happened)
- importance (definitions, constraints, decisions)
- re-reference likelihood (what users keep pointing back to)
We also automatically protect any message containing:
- data references (tables, columns, formulas)
- definitions ("when I say X, I mean Y")
- instructions ("use this format", "don't do that")
What we track now:
- Context drift score: do answers still respect earlier definitions and constraints?
Subtle quality problems are worse than crashes. Users work around broken buttons. They don't tolerate an AI that slowly becomes unreliable.
Learning 2: Slow Answers Weren't Failures, Silent Waiting Was
What we assumed: When queries took too long, we showed an error after 8 seconds. If timeouts spiked, we needed faster infrastructure.
What happened: Timeouts spiked when users uploaded large spreadsheets and asked complex questions immediately. The AI needed time to analyse the file, but our product went quiet and then failed. Users retried. Same result. Frustration.
The infrastructure wasn't the core issue. The experience was.
Fix: We built a three-lane response experience:
- Fast lane (<3s): answer immediately
- Working lane (3–8s): show progress updates ("Parsing sheet 2… mapping formulas…")
- Deep lane (>8s): set expectation upfront ("This will take about 45 seconds"), then deliver with visible milestones
People will wait if they understand what's happening. They abandon when the product goes silent and makes them guess.
What we track now:
- Timeout frustration rate: how often users retry immediately after a timeout
High retries usually mean "they didn't accept the explanation".
Learning 3: Spreadsheets Aren't Data, They're Narratives
What we assumed: Excel files are data containers. Parse cells, extract values, run analysis.
What happened: Users told us our analysis was "technically correct but useless".
We could tell them cell B5 contained "42". We missed why B5 mattered.
Because spreadsheets carry meaning in layers:
- displayed values
- formulas (logic and dependencies)
- formatting (highlights, conditional rules, intent)
One user put it perfectly:
"You're reading my spreadsheet like a basic table. That's not how spreadsheets work."
Fix: We rebuilt our Excel reader to preserve three layers:
- displayed values
- formula logic
- formatting semantics
Now the system can explain:
"Cell B5 shows 42, calculated from B1 through B4, flagged red because it breaches the >40 rule."
What we track now:
- Formula preservation rate: when a cell's value is derived, do we explain the formula and dependencies?
- Formatting recall: do we account for conditional formatting when users ask "why is this highlighted?"
If you're building file intelligence, don't just read the file. Read the intent encoded inside it.
Learning 4: Model Routing Made the Product Feel Inconsistent
What we assumed: Use a faster model for simple questions, a stronger model for complex analysis. Great performance, lower cost. Everyone wins.
What happened: Users experienced personality whiplash.
Concise, structured answers. Then suddenly verbose and chatty. Then back again. Speed changed too, fast then slow then fast.
Even when answers were correct, the experience felt less trustworthy because it didn't feel like one coherent system.
Fix: We added a normalisation layer after the model responds:
- standardise formatting (headings, bullets, consistent patterns)
- align tone to the established conversation style
- manage perceived speed (brief "analysing…" when switching to slower reasoning)
What we track now:
- Model transition smoothness: do users notice jarring shifts when routing changes?
(We track direct feedback plus behavioural signals like drop-offs after a stylistic or latency jump.)
Users don't care which model answered. They care whether the product feels consistent.
The Weekly "Trust" Checklist
These aren't vanity metrics. They're early warning signals for silent degradation:
- Context drift score
- Timeout frustration rate
- Formula preservation rate
- Formatting recall
- Model transition smoothness
- Partial completion success (do progress updates reduce abandonment?)
If you're building an AI product, steal this list and customise it to your own failure modes.
What We'd Do Differently
Instrument quality, not just uptime. Speed and error dashboards won't tell you when trust is dying.
Design for reality, not the happy path. Long jobs, messy files, switching models, partial success. These aren't edge cases. They are normal operating conditions.
Run interviews and behavioural data in parallel from day one. Data tells you what is happening. Conversations tell you why it matters.
The Real Takeaway
The most dangerous failures in AI products are silent.
Loud failures get tickets. Quiet failures get churn.
Your job in beta isn't just fixing what breaks. It's building the sensors that tell you when trust starts slipping.
Frequently Asked Questions
Q. What is context drift and why does it matter?
A. Context drift happens when an AI gradually loses track of important earlier information while staying confident. It subtly misreferences details or contradicts prior statements instead of failing obviously. It erodes trust because users can't tell when the AI is reliable.
Q. How should AI products handle slow response times?
A. Build a three-tier experience: answer immediately when possible (<3s), show progress updates for moderate delays (3–8s), and set clear expectations upfront for longer tasks (>8s) with visible milestones. Silent waiting kills trust faster than actual slowness.
Q. Why should spreadsheets be treated as narratives, not just data?
A. Spreadsheets encode meaning in three layers: displayed values, formula logic, and formatting semantics. Reading them as flat tables misses the intent, dependencies, and reasoning users embed in their work. AI needs to explain why a value matters, not just what it is.
Q. What is model routing and how do you keep it consistent?
A. Model routing switches between AI models based on query complexity. Keep it consistent by adding a normalisation layer that standardises formatting, aligns tone, and manages perceived speed so users experience one coherent system, not jarring personality shifts.
Q. What should be on a weekly trust checklist for AI products?
A. Track context drift score, timeout frustration rate, formula preservation rate, formatting recall, model transition smoothness, and partial completion success. These metrics catch silent quality degradation before users churn.

Team CambrianEdge.ai
A gang of marketers and engineers teaching AI to think like CMOs, break like interns, and ship like caffeine-powered founders.
Share with your community!
Related Blogs
View All

What 18 Months of Beta Testing Taught Us About Execution


The $2.3 Billion Secret That Smart Marketers Already Know


How Can You Redesign Your Agency's Operating Model for an AI-Powered Future


