Respiratory Virus Forecast

Technical Methodology

Our forecasting engine utilises a Hybrid Approach. We first isolate the predictable seasonal cycle, and then use Scikit-Learn's HistGradientBoostingRegressor to model the residuals (the difference between the seasonal baseline and reality).

1. Histogram-Based Gradient Boosting (HGBR)

Histogram-Based Gradient Boosting is a sophisticated supervised machine learning algorithm designed to solve complex regression problems efficiently. It is an optimised evolution of traditional Gradient Boosting that groups continuous data into discrete 'bins' (histograms) rather than analysing every individual data point.

Step A: Data Discretisation (Binning)

Standard decision trees sort the feature values at every node split, which has a complexity of \(O(N \log N)\). HGBR instead discretises the continuous input features into integer-valued bins (max 255 bins).

X_{binned} = \text{digitize}(X, \text{bins}) \in \{0, \dots, 255\}

Step B: The Additive Model

The model is an ensemble of \(M\) decision trees. It is an additive model where the prediction at step \(m\), denoted as \(F_m(x)\), is the sum of the previous prediction and a new tree \(h_m(x)\).

F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)

Step C: Gradient Descent Optimisation

At each iteration \(m\), the new tree \(h_m(x)\) is trained to predict the negative gradient of the loss function. For our Least Squares loss function, this is the residual:

r_i = - \frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)} = y_i - F_{m-1}(x_i)

2. Uncertainty Quantification

A single point forecast (the "mean") implies 100% certainty, which is scientifically inaccurate. To address this, we generate 90% Prediction Intervals (the shaded cones). We train three separate instances of the model using different loss functions:

Median (0.5): The most likely outcome.
Lower Bound (0.05): The 5th percentile (optimistic scenario).
Upper Bound (0.95): The 95th percentile (worst-case scenario).

3. Explainable AI (Feature Importance)

To demystify the "Black Box" nature of machine learning, we employ a technique called Permutation Importance. This evaluates the significance of each input feature (e.g., "Last Week's Cases" vs. "Seasonal Month") by randomly shuffling its values and measuring the resulting drop in model accuracy.

4. Python Implementation

Below are core snippets from the process_data.py script.

A. Dynamic Configuration

METRICS = {
    "positivity": { "topic": "Influenza", "metric_id": "influenza_testing_positivityByWeek" },
    "hospital": { "topic": "Influenza", "metric_id": "influenza_healthcare_hospitalAdmissionRateByWeek" }
}

B. Scikit-Learn HGBR Configuration

We leverage HistGradientBoostingRegressor due to its speed and native support for missing values (NaNs). We define three distinct model configurations to handle the point forecast (Mean) and uncertainty bounds (Quantiles).

from sklearn.ensemble import HistGradientBoostingRegressor

def train_and_forecast(df):
    # 1. Main Forecast Model (The Dashed Line)
    # Uses 'squared_error' (Least Squares) to predict the expected mean value.
    model_resid = HistGradientBoostingRegressor(
        loss='squared_error',
        learning_rate=0.1,
        max_iter=100,
        random_state=42
    )

    # 2. Upper Bound Model (Top of the Cone)
    # Uses 'quantile' loss targeting the 95th percentile.
    # This predicts a value that 95% of future data points should fall below.
    model_upper = HistGradientBoostingRegressor(
        loss='quantile',
        quantile=0.95,
        random_state=42
    )

    # 3. Lower Bound Model (Bottom of the Cone)
    # Uses 'quantile' loss targeting the 5th percentile.
    # This predicts a value that only 5% of future data points should fall below.
    model_lower = HistGradientBoostingRegressor(
        loss='quantile',
        quantile=0.05,
        random_state=42
    )
    
    # Fit all three models on the same training data (X_train, y_train)
    model_resid.fit(X_train, y_train)
    model_upper.fit(X_train, y_train)
    model_lower.fit(X_train, y_train)

C. Explainable AI (Permutation Importance)

from sklearn.inspection import permutation_importance

# Calculate importance on the Residual Model
result = permutation_importance(
    model_resid, 
    X_train, 
    y_train, 
    n_repeats=10, 
    random_state=42
)

# Sort features by their impact on model accuracy
importance_scores = result.importances_mean

Source References & Further Reading:

England Respiratory Virus Forecast

Influenza (Flu)LIVE FORECAST

PCR Positivity Rate

Hospital Admission Rate

ICU / HDU Admission Rate

COVID-19LIVE FORECAST