Back to work
Data Engineering·2025·Johnson Matthey

Lab Data Pipeline

Manual lab operations replaced with a self-updating cloud pipeline.

PythonAWS EC2Google Sheets APIgspreadpandascronGit
Context
A lab operations team was exporting CSVs and emailing spreadsheets. Analysts reinvented the same queries every week. The business wanted a single, always-current picture.
Timeline
Delivered as a team data project, 2025
Role
Infrastructure, ingestion pipeline, analysis, deployment
Samples
310k+
Schedule
Daily cron ingest
Deploy
AWS EC2

What I built

  • A Python ingestion pipeline that pulls lab data from Google Sheets daily.
  • An AWS EC2 deployment running on Amazon Linux, SSH-secured, Git-managed.
  • Exploratory analysis across 310k+ samples answering 20 standing business questions.
  • A cron-scheduled refresh so downstream dashboards never lag more than 24 hours.
  • Team-accessible docs covering the setup, credentials model, and recovery.

Architecture

Ingestion
PythongspreadGoogle Sheets APIDrive API
Compute
AWS EC2 (t3)Amazon Linuxvenvcron
Analysis
pandasnumpymatplotlibseaborn
Ops
SSHGitsystemd unitlog rotation

Interesting decisions

Decision

EC2 over serverless

The refresh runs long-form pandas code over hundreds of thousands of rows. Cold-start latency on a typical serverless function would dominate. A small always-on EC2 instance is cheaper, predictable, and owns its state.

Decision

Cron over an orchestrator

Airflow or Prefect would be overkill for one daily job. Cron plus structured logging keeps the stack small enough for a single engineer to own. If this ever grows to dozens of jobs, the swap is clean.

Result

Data that used to take a full day to compile is ready by 6am every morning. The analyst team reclaimed several hours per week and the leadership reports pull from a single, trusted source.

Have a build like this in mind?