A multi-source data analytics project encompassing ETL pipeline, data warehouse, and interactive dashboard. Examining how primary care shortages correlate with chronic disease burden and preventable hospitalizations across US counties.
Technical Highlights
•Built end-to-end ETL pipeline integrating 5+ public health datasets (CDC PLACES, County Health Rankings, HRSA, and USDA) into Snowflake using dbt with staging, intermediate, and mart layers.
•Engineered 10+ derived variables across dbt staging, intermediate, and mart layers; performed data quality validation checks via dbt generic tests on staging models.
•Developed interactive Plotly Dash dashboard visualizing shortage severity, disease burden, and preventable hospitalizations, including a KPI summary row with a custom dark theme.
•Identified poverty rate (r=0.33) and high disease burden counties (average 482 excess stays above the national average) as strongest predictors of preventable hospitalizations.
•Conducted exploratory data analysis and inferential statistical testing in Python (summary statistics, correlations, t-tests, ANOVA, chi-squared), and SQL analyses to answer 5 research questions across 2,957 US counties.