air quality index prediction
october 2025
overview
This project aims to predict the next day's Air Quality Index (AQI) using historical pollutant measurements. The dataset contains hourly readings of pollutants such as CO, NH₃, NO₂, O₃, PM10, PM2.5, and SO₂, aggregated into daily averages for modeling.
A Random Forest Regressor was used to capture nonlinear relationships between pollutant levels and AQI. Features include both pollutant concentrations and temporal attributes such as month and weekday.
motivation
With Diwali approaching, I wanted to observe how AQI levels deviate during the festival period — since air quality typically worsens significantly due to fireworks and increased emissions.
My blog effects of diwali on aqi: insights from my model explores this in detail.
tech stack
- Python
- scikit-learn
- pandas, numpy
- matplotlib, seaborn
model performance
pre-festival (seen data)
Predictions before Diwali were highly accurate, as shown by standard regression metrics:
- MAE, MSE, and R² scores indicated solid performance (R² ≈ 0.86 on seen data).
predicted vs actual
festival period (unseen data)
Predictions during Diwali were less consistent, often off by 20–25 AQI points. This occurred because Diwali arrived earlier than usual and month/weekday features couldn't effectively capture the sudden festival-related changes.
stress test (removing previous day AQI feature)
In the feature importance graph, AQI was the most important feature. We did a stress test to see model performance without this key feature.
with AQI feature
without AQI feature
summary
- model: Random Forest Regressor
- data: Daily averages of major air pollutants
- goal: Next-day AQI prediction
- result: Strong general accuracy (R² ≈ 0.86) on seen data; reduced accuracy during unmodeled festival events