★ NLP · CNN · Responsible AI

Subsurface
ML

Drilling-loss prediction & Acoustic Impedance · Shell · Feb 2021 — Dec 2022

Pressure deltanominal

Mud weightnominal

Gain/loss eventlow

Acoustic impedanceencoded ✓

Project Overview

Two parallel projects under one team: (1) text-based classification of historical drilling-loss events from 27 years of well reports, and (2) a CNN autoencoder for lossless compression of Acoustic Impedance graphs. Plus a Responsible-AI POC layering LIME and SHAP over both models so geoscientists could trust the output.

NLP / Text ClassificationCNN AutoencodersPySpark / Spark-SQLDatabricks · AzureLIME · SHAPREST APIsPi Data (1995 - 2022)F1 +0.2 vs baseline

Problem Statement

Unstructured history. 27 years of well logs in free-form text. The "unsupervised" prediction model worked at ~F1 0.62 — not good enough to act on.
Huge graph storage. Acoustic Impedance graphs were stored as full-resolution PNGs in Azure Blob — terabytes of redundant data.
Black-box ML.Geoscientists wouldn't accept a model output without understanding why.

+0.2

F1 lift on the drilling-loss classifier after adding labeled supervised data and a wildcard-regex preprocessing pass.

−1 hr

Training time saved

+25%

Pipeline perf

My Role

Python developer and data scientist. Built the supervised text classifier, the CNN autoencoder, and the Responsible-AI framework. Worked directly with geoscientist labelers and onboarded two interns.

ML ModelsPySpark ETLREST APIsResponsible-AI POCIntern MentorshipStakeholder reporting

The approach.

// STEP 01

Label, then learn.

Worked with subject-matter experts to weight-label ambiguous loss events. Supervised classifier on the labeled set bumped F1 from 0.62 to 0.84.

// STEP 02

CNN autoencoder for impedance graphs.

Trained a lossless CNN autoencoder on years of Acoustic Impedance traces. ~85% storage reduction with reconstructed output that geoscientists couldn't distinguish from the original.

// STEP 03

LIME + SHAP for trust.

Built a Responsible-AI POC that surfaces feature attribution for every prediction. Identified two latent biases in the training data — both corrected in v2.

The outcomes.

F1 0.84

Loss classifier

Up from 0.62 unsupervised. Now flagged as the production model on three active rigs.

~85%

Storage saved

CNN-autoencoded impedance graphs replaced raw PNG storage with no visual loss.

2 biases

Found & fixed

LIME / SHAP audit surfaced two systematic biases in the training data — both corrected before deployment.

SubsurfaceML

Wells

Models