← Back to portfolio
★ Featured · Currently shipping

BPI Virtual
Analyst

GenAI Platform · Internal Tool · Airbnb · Sep 2024 — Present
Airbnb · BPIRun analysis10,000 rows · GPT-4o · streaming●●●
Workflows
  • Insights summary
  • Sentiment
  • Categorise
  • PII redact
  • Custom prompt
Models
  • GPT-4o
  • Claude 3.5
  • Gemini 1.5
  • Llama 3 70B
  • + 26 more
Row 1 — Booking issue92%
Row 2 — Pricing query78%
Row 3 — Refund request86%
Row 4 — Host onboard64%
Row 5 — Cleanliness71%
Row 6 — Smart pricing88%
Sai

Project Overview

Airbnb's BPI (Business Process Insights) team had 55+ analysts wading through unstructured Voice-of-Customer data — host complaints, guest messages, search queries, refund threads — using one-off prompts against whatever LLM had API budget that week. The BPI Virtual Analyst gave them a single tool: pick a workflow, pick a model, paste data, get structured insights.

LLM OrchestrationPII DetectionStreaming InferenceBatch PipelinesToken LifecycleObservabilityPython · Flask · Celery · RedisStreamlit · Airflow · Hive

Problem Statement

Analysts ran into three walls every week:

  1. 600-row ceiling.The old tool choked at 600 rows per run. A typical week's worth of feedback was 6,000+ rows — so analysts split files manually, ran 10 jobs, and stitched the results in Excel.
  2. No model choice.Hard-coded to one model. When GPT-4 went down, the entire team's workflow stopped. No way to A/B prompts across providers.
  3. PII risk. Customer data was being shipped to model APIs with no de-identification step. Each new use case needed a privacy review that took weeks.
17×
Throughput improvement after re-architecture — from 600 rows/run to 10,000 rows/run, on the same infrastructure.
Other before / after
10 → 40 MB
Upload size limit
1 → 30+
Models integrated

My Role

I was the only engineer on the BPI Virtual Analyst for the first 6 months — owning the architecture end to end, then handing off pieces to a second engineer once the platform stabilized.

Lead Backend EngineerLLM IntegrationsFrontend (Streamlit)Pipeline ArchitecturePII / PrivacyObservabilityOn-callAnalyst onboarding

The approach in three moves.

// STEP 01

Decouple the runtime from the model.

Built a thin model-router with provider adapters (OpenAI, Anthropic, Google, Bedrock, internal endpoints). The Streamlit UI just says "model: claude-3.5-sonnet" — the router knows how to translate that into the right SDK call, with the right token accounting and retry policy.

// STEP 02

Make 10,000 rows feel like 600.

Chunked file processing with Celery workers, Redis for job state, streaming results back to the UI as they finish. Analysts see the first 50 results in seconds instead of waiting for the whole run. Token-lifetime checks refresh creds mid-batch so 90-min runs don't fail at minute 89.

// STEP 03

De-identify before we send anywhere.

Built a PII pipeline on Microsoft Presidio with entity-targeted detection — instead of running every recognizer on every cell, we only run the ones the column needs. Cut runtime by ~30% and dropped false positives on common names dramatically.

The system, in one diagram.

Four layers, deliberately boring. The interesting bits are in the dotted-line failure paths, not on the happy path.

Streamlit UI
analyst-facing
CSV / Excel upload
up to 40 MB
Workflow picker
prompts + schemas
Model picker
30+ options
Flask API · auth · job submission
JWT · token-lifetime check · rate-limit
Presidio PII scrubber
entity-targeted · 30% faster
Celery worker pool
chunked · streaming · retry
Redis · job state
progress · partial results
Model router
OpenAI · Claude · Gemini · Bedrock
SQLAlchemy
audit log · runs · prompts
Airflow nightly
backfills · refresh
Hive / Trino
historical analyses
OTel + Datadog
latency · cost · errors

Deep dive: scaling 600 → 10,000.

The naïve version of "process more rows" is "use a bigger box." That stopped working at 1,200. The actual fix was four small things.

1 · Chunk before you batch.

The old code read the whole CSV into memory and serialized it as one JSON blob to the LLM. With 10K rows of customer messages (avg 280 tokens each), we'd blow the model's context window in under 1,000 rows.

Switched to streaming chunks of N rows at a time (where Ndepends on the model's context budget), with overlap windows for context-sensitive workflows like sentiment-with-prior-turns.

bpi://chunker.py
# chunker.py — adaptive chunkingdef stream_chunks(rows, model_cfg): budget = model_cfg.ctx_window * 0.7 buf, used = [], 0for r in rows: toks = est_tokens(r)ifused + toks > budget:yield buf buf, used = [], 0 buf.append(r); used += toksif buf: yield buf
bpi://workers.py
# celery + retry policy@app.task( autoretry_for=(RateLimit, TokenExpired), retry_backoff=True, retry_jitter=True, max_retries=5, )def run_chunk(chunk_id, workflow): creds = refresh_if_near_expiry() res = router.invoke(workflow, chunk_id, creds) redis.publish(f"job:{job_id}", res)return res

2 · Retry like you mean it.

The biggest source of failure in 90-minute LLM jobs wasn't the model — it was credential rotation. Internal tokens expire on a 60-min cadence; a worker that picked up a task at minute 58 would fail at minute 62 with no useful error.

Wrapped every model call in a refresh-if-near-expirycheck, plus exponential-backoff retries for rate-limits. Worker MTTR went from "manual restart" to zero.

3 · Stream results, don't batch them.

Analysts hated waiting 8 minutes to see if their prompt was even going to produce useful output. Hooked the Celery results channel to a Redis pub/sub stream, and the Streamlit UI now shows row-level outputs as they finish.

First useful output drops in <15 seconds for any workflow. Average run completion drops 35% because analysts catch broken prompts early and abort.

bpi://stream.tsx · live
[00:00:02] connected · job-471c
[00:00:03] chunk 1/47 → submitted
[00:00:08] row 12: "booking · refund"
[00:00:09] row 13: "pricing · clarify"
[00:00:11] row 18: "host · onboard"
[00:00:14] WARN row 22 retry (rate-limit)
[00:00:16] row 22: "cleanliness"
[00:00:18] chunk 1/47 ✓ · 250 rows · 12.4 s
[00:00:18] chunk 2/47 → submitted
...
[00:14:32] job-471c ✓ 10,000 / 10,000
bpi://pii.py
# presidio — entity-targeted, not all-recognizersdef scrub(df, column_schema): analyzer = AnalyzerEngine()for col, types in column_schema.items():# only run recognizers# declared for this columndf[col] = df[col].map(lambda v: analyzer.analyze( v, entities=types, language="en", ) )return df

4 · PII detection, where it matters.

Presidio out-of-the-box runs every recognizer (~20 of them) on every field. For a 10K-row analysis with 5 columns, that's a million regex passes — and the false-positive rate on common English first names was painful.

Added a column-schema to declare which entity types each column might contain (e.g. email column → only run email detector). Cut runtime by ~30% and eliminated most false hits.

The outcomes.

A year in, the BPI Virtual Analyst is daily-used infrastructure.

55+
Analysts onboarded

Every BPI analyst uses the tool weekly. Most run 5-10 workflows a week. Adoption was zero-to-100 in 3 weeks once we hit feature parity with the old workflow.

17×
Throughput per run

From 600 rows/run to 10,000 rows/run. Upload ceiling moved from 10 MB to 40 MB. Most analyses now run in a single pass — no more Excel stitching.

30%
PII pipeline speedup

Entity-targeted detection on Presidio runs ~30% faster than the vanilla "run everything" config — with a meaningful reduction in false positives on common names.

"Sai's tool collapsed three days of manual analysis into a 20-minute run. Our team's weekly insights deck now ships on Monday morning instead of Wednesday."
— BPI lead, Airbnb (internal)

What I'd do differently.

// LESSON 01

Streamlit was the wrong choice past 200 users.

For an MVP, Streamlit was perfect — I shipped the first version in 4 weeks. But at 55 concurrent analysts running long jobs, the session-state model showed its limits. If I were starting over, I'd pick FastAPI + a small React frontend on day 1.

// LESSON 02

Eval before integration.

We added 30 models partly because analysts asked. Some of them were never actually better for any workflow. Now we run every new model through an eval suite (Ragas + custom rubrics) before exposing it in the UI.

// LESSON 03

Observability before the second user.

I added OpenTelemetry tracing only after the third on-call incident. The instrumentation took two days; the time it saved in the next six weeks was an order of magnitude more. Wire it up first, always.

Next case study →
Dose Management System · Eli Lilly