Filter:

Paediatric Clinical
Data Pipeline

UCT Department of Paediatrics — Cape Town

The Problem

UCT researchers were collecting paediatric health data across multiple clinical systems — each with different formats, field names, and standards. No unified pipeline existed. Merging datasets for publication meant hours of manual reconciliation, with errors slipping through into research outputs.

What I Built

A Python-based validation and cleaning pipeline that ingests multi-source clinical data, standardises schema, detects missing values and duplicates, applies longitudinal consistency checks, and outputs publication-ready datasets. Paired with structured Excel reporting suites for non-technical researchers.

Tools & Stack

Python (Pandas, NumPy), PostgreSQL, Excel (structured templates), custom validation rule engine, automated flagging for outlier detection.

The Outcome

500,000+ records processed with 100% data integrity. Datasets fed directly into peer-reviewed academic publications. Reporting time for researchers cut by over 60%. Zero data-related revision requests from journal reviewers.

500K+
Records Cleaned
100%
Integrity Rate
60%
Less Reporting Time
0
Journal Errors
View on GitHub ⚠ Data confidential — UCT IRB restricted
PAEDIATRIC DATA PIPELINE — UCT LIVE RECORDS 500K INTEGRITY 100% SOURCES 7 ERRORS FLAGGED 2,341 PIPELINE STAGES INGEST VALIDATE CLEAN STANDARDISE OUTPUT READY VALIDATION LOG ✓ record_batch_001.csv — 72,441 rows — 0 critical errors ✓ clinical_labs_q3.xlsx — 38,902 rows — 14 duplicates removed ⚠ demographics_source3.csv — 841 missing DOB values — flagged ✓ longitudinal_followup.csv — 124,008 rows — schema matched ✓ medication_records.csv — 89,221 rows — cross-ref validated → Pipeline complete. 500,612 records ready for analysis. Dashboard mockup

Energy Resource
Optimisation Dashboard

Amandla Africa Energy — Cape Town

The Problem

Amandla's operations team managed 10M+ energy data points spread across disconnected spreadsheets and reporting formats. Planning decisions were made on stale, manually compiled data — introducing lag and compounding errors at scale.

What I Built

A Tableau dashboard connected to a centralised SQL data model, surfacing real-time resource utilisation, demand forecasting, and anomaly detection across multiple energy sites. Automated weekly Excel reports via VBA macros replaced 5 hours of manual compilation per week.

Tools & Stack

Tableau Desktop, MySQL, Python (for ETL pre-processing), Excel VBA macros, SQL stored procedures for aggregation.

The Outcome

Dashboard adopted as the primary tool for all operational planning. Manual reporting time cut by 40%. Anomaly detection surfaced 3 under-performing sites that had gone undetected for 2+ months — enabling corrective action worth significant resource savings.

10M+
Data Points
40%
Less Manual Work
3
Sites Recovered
5hrs
Saved Weekly
SQL & ETL Scripts View Dashboard
AMANDLA ENERGY — RESOURCE MONITOR TOTAL OUTPUT (MWh) 84,291 ↑ 6.2% ACTIVE SITES 24 / 27 3 anomalies EFFICIENCY RATE 91.4% ↑ 2.1% MoM DATA POINTS 10.2M processed OUTPUT TREND — 12 MONTHS JAN MAR MAY JUL SEP NOV SITE ANOMALIES SITE STATUS Site 07 — Bellville ● LOW Site 14 — Mitchells Pk ● WARN Site 22 — Paarl ● LOW 24 sites nominal ● All others OK Dashboard mockup

Child Mortality &
Health Inequality Analysis

WHO Global Health Observatory — Open Dataset

The Brief

Using WHO's Global Health Observatory dataset, this project analyses under-5 child mortality trends across Sub-Saharan Africa from 2000–2023, correlating outcomes with healthcare access, GDP, and maternal education data from World Bank sources.

The Approach

Multi-source data merge in Python, exploratory analysis in R, and an interactive Plotly dashboard that lets users filter by country, year, and indicator. Regression modelling to identify the strongest predictors of mortality reduction.

Why It Matters

This project demonstrates the full analyst stack: data wrangling, statistical analysis, visualisation, and storytelling with public health data — the exact workflow used in global health organisations like PATH, WHO, and UNICEF.

Expected Output

Interactive Plotly dashboard, Jupyter notebook with documented methodology, written findings summary. All code published on GitHub.

GitHub — in progress Live dashboard
CHILD MORTALITY — SUB-SAHARAN AFRICA [ Africa choropleth ] mortality rate by country UNDER-5 MORTALITY RATE TREND 2000 2010 2023 ZAF ZWE NGA AVG REDUCTION (2000–23) –52% Sub-Saharan Africa TOP PREDICTOR Maternal Educ. R² = 0.81 correlation COUNTRIES ANALYSED 46 WHO SSA region Concept mockup

Healthcare Claims
Data Warehouse

Personal Project — dbt + PostgreSQL + BigQuery

The Brief

A production-style data warehouse built using dbt (data build tool) on top of a synthetic healthcare claims dataset — modelling patient journeys, claim outcomes, and provider performance metrics across a simulated insurance environment.

The Approach

Raw claims data → staging models → intermediate joins → mart-layer tables ready for BI consumption. Full dbt documentation, tests, and lineage graphs. Deployed on BigQuery with a Power BI layer on top.

Why It Matters

dbt is the industry standard for analytics engineering. This project signals readiness for senior analyst and analytics engineer roles at healthtech companies (Sanlam, Discovery Health, CCHP) and remote data teams globally.

Expected Output

Full dbt project on GitHub with staging/intermediate/mart layers, test coverage report, dbt docs site, and a Power BI summary dashboard.

GitHub — in progress dbt docs
dbt — HEALTHCARE CLAIMS LINEAGE SOURCES STAGING INTERMEDIATE MARTS claims_raw patients_raw providers_raw icd_codes_raw stg_claims stg_patients stg_providers int_claims_enriched int_patient_journey fct_ claims dim_ patients mart_ kpi dbt test — RESULTS ✓ Completed 84 tests — 84 passed, 0 failed, 0 errors ✓ All not_null constraints passing — all relationships validated ───────────────────────────────────────────────────────────── Concept mockup

Hospital Readmission
Risk Predictor

MIMIC-III Open Clinical Dataset — ML Project

The Brief

Using the MIMIC-III open clinical dataset (MIT), build a machine learning model that predicts 30-day hospital readmission risk from patient demographics, diagnoses, and prior admission history — a high-value problem in health operations.

The Approach

Feature engineering on clinical data, XGBoost classifier with hyperparameter tuning, SHAP values for model explainability, and ROC/AUC evaluation. Focused on interpretability — a model clinicians can actually trust and act on.

Why It Matters

Predictive analytics in healthcare is the frontier of data science work. This project demonstrates applied ML in a clinical context — exactly what health data teams at Discovery Health, Medidata, and global NGOs are looking for in senior hires.

Expected Output

Fully documented Jupyter notebook, model card, SHAP visualisations, and a write-up explaining findings to a non-technical audience. GitHub published.

GitHub — in progress View Notebook
READMISSION RISK — MODEL RESULTS ROC CURVE AUC = 0.847 False Positive Rate TPR SHAP FEATURE IMPORTANCE Prior admissions 0.41 Length of stay 0.33 Diagnosis count 0.24 Age 0.17 Insurance type 0.12 Procedure count 0.08 MODEL ACCURACY 82.4% XGBoost classifier PRECISION 79.1% high-risk class DATASET (MIMIC-III) 46K patient admissions Concept mockup

Like what you see?
Let's build something together.

Get in touch Download CV