
Airbnb's BPI (Business Process Insights) team had 55+ analysts wading through unstructured Voice-of-Customer data — host complaints, guest messages, search queries, refund threads — using one-off prompts against whatever LLM had API budget that week. The BPI Virtual Analyst gave them a single tool: pick a workflow, pick a model, paste data, get structured insights.
Analysts ran into three walls every week:
I was the only engineer on the BPI Virtual Analyst for the first 6 months — owning the architecture end to end, then handing off pieces to a second engineer once the platform stabilized.
Built a thin model-router with provider adapters (OpenAI, Anthropic, Google, Bedrock, internal endpoints). The Streamlit UI just says "model: claude-3.5-sonnet" — the router knows how to translate that into the right SDK call, with the right token accounting and retry policy.
Chunked file processing with Celery workers, Redis for job state, streaming results back to the UI as they finish. Analysts see the first 50 results in seconds instead of waiting for the whole run. Token-lifetime checks refresh creds mid-batch so 90-min runs don't fail at minute 89.
Built a PII pipeline on Microsoft Presidio with entity-targeted detection — instead of running every recognizer on every cell, we only run the ones the column needs. Cut runtime by ~30% and dropped false positives on common names dramatically.
Four layers, deliberately boring. The interesting bits are in the dotted-line failure paths, not on the happy path.
The naïve version of "process more rows" is "use a bigger box." That stopped working at 1,200. The actual fix was four small things.
The old code read the whole CSV into memory and serialized it as one JSON blob to the LLM. With 10K rows of customer messages (avg 280 tokens each), we'd blow the model's context window in under 1,000 rows.
Switched to streaming chunks of N rows at a time (where Ndepends on the model's context budget), with overlap windows for context-sensitive workflows like sentiment-with-prior-turns.
# chunker.py — adaptive chunkingdef stream_chunks(rows, model_cfg): budget = model_cfg.ctx_window * 0.7 buf, used = [], 0for r in rows: toks = est_tokens(r)ifused + toks > budget:yield buf buf, used = [], 0 buf.append(r); used += toksif buf: yield buf
# celery + retry policy@app.task( autoretry_for=(RateLimit, TokenExpired), retry_backoff=True, retry_jitter=True, max_retries=5, )def run_chunk(chunk_id, workflow): creds = refresh_if_near_expiry() res = router.invoke(workflow, chunk_id, creds) redis.publish(f"job:{job_id}", res)return res
The biggest source of failure in 90-minute LLM jobs wasn't the model — it was credential rotation. Internal tokens expire on a 60-min cadence; a worker that picked up a task at minute 58 would fail at minute 62 with no useful error.
Wrapped every model call in a refresh-if-near-expirycheck, plus exponential-backoff retries for rate-limits. Worker MTTR went from "manual restart" to zero.
Analysts hated waiting 8 minutes to see if their prompt was even going to produce useful output. Hooked the Celery results channel to a Redis pub/sub stream, and the Streamlit UI now shows row-level outputs as they finish.
First useful output drops in <15 seconds for any workflow. Average run completion drops 35% because analysts catch broken prompts early and abort.
# presidio — entity-targeted, not all-recognizersdef scrub(df, column_schema): analyzer = AnalyzerEngine()for col, types in column_schema.items():# only run recognizers# declared for this columndf[col] = df[col].map(lambda v: analyzer.analyze( v, entities=types, language="en", ) )return df
Presidio out-of-the-box runs every recognizer (~20 of them) on every field. For a 10K-row analysis with 5 columns, that's a million regex passes — and the false-positive rate on common English first names was painful.
Added a column-schema to declare which entity types each column might contain (e.g. email column → only run email detector). Cut runtime by ~30% and eliminated most false hits.
A year in, the BPI Virtual Analyst is daily-used infrastructure.
Every BPI analyst uses the tool weekly. Most run 5-10 workflows a week. Adoption was zero-to-100 in 3 weeks once we hit feature parity with the old workflow.
From 600 rows/run to 10,000 rows/run. Upload ceiling moved from 10 MB to 40 MB. Most analyses now run in a single pass — no more Excel stitching.
Entity-targeted detection on Presidio runs ~30% faster than the vanilla "run everything" config — with a meaningful reduction in false positives on common names.
For an MVP, Streamlit was perfect — I shipped the first version in 4 weeks. But at 55 concurrent analysts running long jobs, the session-state model showed its limits. If I were starting over, I'd pick FastAPI + a small React frontend on day 1.
We added 30 models partly because analysts asked. Some of them were never actually better for any workflow. Now we run every new model through an eval suite (Ragas + custom rubrics) before exposing it in the UI.
I added OpenTelemetry tracing only after the third on-call incident. The instrumentation took two days; the time it saved in the next six weeks was an order of magnitude more. Wire it up first, always.