Developed by Tom Knight
This dashboard provides a comprehensive 52-week forecast for Influenza and COVID-19. By analysing trends in PCR Positivity, Hospital Admissions, and ICU usage, these models support healthcare resource planning and early warning systems.
The predictions are generated using a Hybrid Machine Learning strategy (HistGradientBoostingRegressor). The system is fully automated: it retrieves live data directly from the UKHSA public API and retrains every week via GitHub Actions.
Last updated: ... • Data reflects specimen dates.
Our forecasting engine utilises a Hybrid Approach. We first isolate the predictable seasonal cycle, and then use Scikit-Learn's HistGradientBoostingRegressor to model the residuals (the difference between the seasonal baseline and reality).
Histogram-Based Gradient Boosting is a sophisticated supervised machine learning algorithm designed to solve complex regression problems efficiently. It is an optimised evolution of traditional Gradient Boosting that groups continuous data into discrete 'bins' (histograms) rather than analysing every individual data point.
Standard decision trees sort the feature values at every node split, which has a complexity of \(O(N \log N)\). HGBR instead discretises the continuous input features into integer-valued bins (max 255 bins).
The model is an ensemble of \(M\) decision trees. It is an additive model where the prediction at step \(m\), denoted as \(F_m(x)\), is the sum of the previous prediction and a new tree \(h_m(x)\).
At each iteration \(m\), the new tree \(h_m(x)\) is trained to predict the negative gradient of the loss function. For our Least Squares loss function, this is the residual:
A single point forecast (the "mean") implies 100% certainty, which is scientifically inaccurate. To address this, we generate 90% Prediction Intervals (the shaded cones). We train three separate instances of the model using different loss functions:
To demystify the "Black Box" nature of machine learning, we employ a technique called Permutation Importance. This evaluates the significance of each input feature (e.g., "Last Week's Cases" vs. "Seasonal Month") by randomly shuffling its values and measuring the resulting drop in model accuracy.
Below are core snippets from the process_data.py script.
METRICS = {
"positivity": { "topic": "Influenza", "metric_id": "influenza_testing_positivityByWeek" },
"hospital": { "topic": "Influenza", "metric_id": "influenza_healthcare_hospitalAdmissionRateByWeek" }
}
We leverage HistGradientBoostingRegressor due to its speed and native support for missing values (NaNs). We define three distinct model configurations to handle the point forecast (Mean) and uncertainty bounds (Quantiles).
from sklearn.ensemble import HistGradientBoostingRegressor
def train_and_forecast(df):
# 1. Main Forecast Model (The Dashed Line)
# Uses 'squared_error' (Least Squares) to predict the expected mean value.
model_resid = HistGradientBoostingRegressor(
loss='squared_error',
learning_rate=0.1,
max_iter=100,
random_state=42
)
# 2. Upper Bound Model (Top of the Cone)
# Uses 'quantile' loss targeting the 95th percentile.
# This predicts a value that 95% of future data points should fall below.
model_upper = HistGradientBoostingRegressor(
loss='quantile',
quantile=0.95,
random_state=42
)
# 3. Lower Bound Model (Bottom of the Cone)
# Uses 'quantile' loss targeting the 5th percentile.
# This predicts a value that only 5% of future data points should fall below.
model_lower = HistGradientBoostingRegressor(
loss='quantile',
quantile=0.05,
random_state=42
)
# Fit all three models on the same training data (X_train, y_train)
model_resid.fit(X_train, y_train)
model_upper.fit(X_train, y_train)
model_lower.fit(X_train, y_train)
from sklearn.inspection import permutation_importance
# Calculate importance on the Residual Model
result = permutation_importance(
model_resid,
X_train,
y_train,
n_repeats=10,
random_state=42
)
# Sort features by their impact on model accuracy
importance_scores = result.importances_mean