England Respiratory Virus Forecast

Developed by Tom Knight

This dashboard provides a comprehensive 52-week forecast for Influenza and COVID-19. By analysing trends in PCR Positivity, Hospital Admissions, and ICU usage, these models support healthcare resource planning and early warning systems.

The predictions are generated using a Hybrid Machine Learning strategy (HistGradientBoostingRegressor). The system is fully automated: it retrieves live data directly from the UKHSA public API and retrains every week via GitHub Actions.

Last updated: ... • Data reflects specimen dates.

Influenza (Flu)LIVE FORECAST

PCR Positivity Rate
Percentage of tests returning positive for Influenza
Hospital Admission Rate
Admissions per 100,000 population (Weekly)
ICU / HDU Admission Rate
Critical care admissions per 100,000 population

COVID-19LIVE FORECAST

PCR Positivity Rate
7-Day Rolling Average (%)
Hospital Admissions
Total Weekly Count (Sum of Daily Admissions)

Model Diagnostics

Accuracy (Backtest)
Actual vs Predicted (Past 52 Weeks)
Explainable AI
Feature Importance Drivers
Waiting for data...
Error Bias
Residual Distribution (Bell Curve = Good)
Waiting for data...

Technical Methodology

Our forecasting engine utilises a Hybrid Approach. We first isolate the predictable seasonal cycle, and then use Scikit-Learn's HistGradientBoostingRegressor to model the residuals (the difference between the seasonal baseline and reality).


1. Histogram-Based Gradient Boosting (HGBR)

Histogram-Based Gradient Boosting is a sophisticated supervised machine learning algorithm designed to solve complex regression problems efficiently. It is an optimised evolution of traditional Gradient Boosting that groups continuous data into discrete 'bins' (histograms) rather than analysing every individual data point.

Step A: Data Discretisation (Binning)

Standard decision trees sort the feature values at every node split, which has a complexity of \(O(N \log N)\). HGBR instead discretises the continuous input features into integer-valued bins (max 255 bins).

$$ X_{binned} = \text{digitize}(X, \text{bins}) \in \{0, \dots, 255\} $$
Step B: The Additive Model

The model is an ensemble of \(M\) decision trees. It is an additive model where the prediction at step \(m\), denoted as \(F_m(x)\), is the sum of the previous prediction and a new tree \(h_m(x)\).

$$ F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x) $$
Step C: Gradient Descent Optimisation

At each iteration \(m\), the new tree \(h_m(x)\) is trained to predict the negative gradient of the loss function. For our Least Squares loss function, this is the residual:

$$ r_i = - \frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)} = y_i - F_{m-1}(x_i) $$
2. Uncertainty Quantification

A single point forecast (the "mean") implies 100% certainty, which is scientifically inaccurate. To address this, we generate 90% Prediction Intervals (the shaded cones). We train three separate instances of the model using different loss functions:

  • Median (0.5): The most likely outcome.
  • Lower Bound (0.05): The 5th percentile (optimistic scenario).
  • Upper Bound (0.95): The 95th percentile (worst-case scenario).
3. Explainable AI (Feature Importance)

To demystify the "Black Box" nature of machine learning, we employ a technique called Permutation Importance. This evaluates the significance of each input feature (e.g., "Last Week's Cases" vs. "Seasonal Month") by randomly shuffling its values and measuring the resulting drop in model accuracy.


4. Python Implementation

Below are core snippets from the process_data.py script.

A. Dynamic Configuration
METRICS = {
    "positivity": { "topic": "Influenza", "metric_id": "influenza_testing_positivityByWeek" },
    "hospital": { "topic": "Influenza", "metric_id": "influenza_healthcare_hospitalAdmissionRateByWeek" }
}
B. Scikit-Learn HGBR Configuration

We leverage HistGradientBoostingRegressor due to its speed and native support for missing values (NaNs). We define three distinct model configurations to handle the point forecast (Mean) and uncertainty bounds (Quantiles).

from sklearn.ensemble import HistGradientBoostingRegressor

def train_and_forecast(df):
    # 1. Main Forecast Model (The Dashed Line)
    # Uses 'squared_error' (Least Squares) to predict the expected mean value.
    model_resid = HistGradientBoostingRegressor(
        loss='squared_error',
        learning_rate=0.1,
        max_iter=100,
        random_state=42
    )

    # 2. Upper Bound Model (Top of the Cone)
    # Uses 'quantile' loss targeting the 95th percentile.
    # This predicts a value that 95% of future data points should fall below.
    model_upper = HistGradientBoostingRegressor(
        loss='quantile',
        quantile=0.95,
        random_state=42
    )

    # 3. Lower Bound Model (Bottom of the Cone)
    # Uses 'quantile' loss targeting the 5th percentile.
    # This predicts a value that only 5% of future data points should fall below.
    model_lower = HistGradientBoostingRegressor(
        loss='quantile',
        quantile=0.05,
        random_state=42
    )
    
    # Fit all three models on the same training data (X_train, y_train)
    model_resid.fit(X_train, y_train)
    model_upper.fit(X_train, y_train)
    model_lower.fit(X_train, y_train)
C. Explainable AI (Permutation Importance)
from sklearn.inspection import permutation_importance

# Calculate importance on the Residual Model
result = permutation_importance(
    model_resid, 
    X_train, 
    y_train, 
    n_repeats=10, 
    random_state=42
)

# Sort features by their impact on model accuracy
importance_scores = result.importances_mean