Using Data Science to Predict NHL Outcomes

Project Overview

This project focuses on building and comparing multiple machine learning models to predict NHL game outcomes, leveraging both bookmaker odds and engineered in-game features. The analysis spans logistic regression (GLM), random forest, XGBoost, and neural networks, culminating in a meta-model ensemble for improved accuracy. Brief earlier experiments also explored player-level scoring predictions, but the main emphasis is on game-level predictions and betting analysis.

Methodology

Data Sources

NHL API: Official team, game, and event data
Bookmaker Odds: Scraped pre-game odds and implied probabilities

Feature Engineering

Created dozens of engineered features capturing momentum, rest, team strengths, and contextual bookmaker information
Data split to ensure prospective, non-leaky training and testing

Modeling & Evaluation

Base Models: Generalized Linear Model (GLM), Random Forest, XGBoost, Neural Network
Ensemble: Meta-model stacking for optimal performance
Evaluation: Employed ROC AUC, calibration plots, and profit/loss analysis using simulated betting strategies to measure real-world value
Used rolling-origin cross-validation to maintain a prospective forecast structure

Key Results

Meta-ensemble model consistently outperformed individual base models on ROC AUC and real betting profit metrics
Demonstrated the potential to identify market inefficiencies and optimize edge in sports betting
Visualized calibration and ROC results to validate practical predictive utility

Tools & Technologies

Languages: R, Python
Frameworks: Tidymodels (R), Scikit-Learn, XGBoost, Keras/TensorFlow
Visualization: ggplot2, Plotly, Matplotlib

Visualizations

Application & Further Reading

Conclusion & Next Steps

Model provided actionable signals for sports betting and identified bookmakers’ inefficiencies.
Future work: Expand to live/in-play predictions, automate data ingestion, and refine ensemble meta-learning for additional sports and markets.