Lab Data Pipeline
Manual lab operations replaced with a self-updating cloud pipeline.
- Context
- A lab operations team was exporting CSVs and emailing spreadsheets. Analysts reinvented the same queries every week. The business wanted a single, always-current picture.
- Timeline
- Delivered as a team data project, 2025
- Role
- Infrastructure, ingestion pipeline, analysis, deployment
- Samples
- 310k+
- Schedule
- Daily cron ingest
- Deploy
- AWS EC2
What I built
- A Python ingestion pipeline that pulls lab data from Google Sheets daily.
- An AWS EC2 deployment running on Amazon Linux, SSH-secured, Git-managed.
- Exploratory analysis across 310k+ samples answering 20 standing business questions.
- A cron-scheduled refresh so downstream dashboards never lag more than 24 hours.
- Team-accessible docs covering the setup, credentials model, and recovery.
Architecture
Interesting decisions
EC2 over serverless
The refresh runs long-form pandas code over hundreds of thousands of rows. Cold-start latency on a typical serverless function would dominate. A small always-on EC2 instance is cheaper, predictable, and owns its state.
Cron over an orchestrator
Airflow or Prefect would be overkill for one daily job. Cron plus structured logging keeps the stack small enough for a single engineer to own. If this ever grows to dozens of jobs, the swap is clean.
Result
Data that used to take a full day to compile is ready by 6am every morning. The analyst team reclaimed several hours per week and the leadership reports pull from a single, trusted source.
Jumbo / InterStar
Full e-commerce and ERP system. Storefront returning soon after a compliance update.
Competitive Intelligence Agent
An AI agent that runs weeks of market research in minutes.
PickAxe
Data-heavy client site with a custom interactive profit calculator.