Using Data Science to Predict NHL Outcomes

Using Data Science to Predict NHL Outcomes

Project Overview

This project focuses on building and comparing multiple machine learning models to predict NHL game outcomes, leveraging both bookmaker odds and engineered in-game features. The analysis spans logistic regression (GLM), random forest, XGBoost, and neural networks, culminating in a meta-model ensemble for improved accuracy. Brief earlier experiments also explored player-level scoring predictions, but the main emphasis is on game-level predictions and betting analysis.

Methodology

Data Sources

  • NHL API: Official team, game, and event data
  • Bookmaker Odds: Scraped pre-game odds and implied probabilities

Feature Engineering

  • Created dozens of engineered features capturing momentum, rest, team strengths, and contextual bookmaker information
  • Data split to ensure prospective, non-leaky training and testing

Modeling & Evaluation

  • Base Models: Generalized Linear Model (GLM), Random Forest, XGBoost, Neural Network
  • Ensemble: Meta-model stacking for optimal performance
  • Evaluation: Employed ROC AUC, calibration plots, and profit/loss analysis using simulated betting strategies to measure real-world value
  • Used rolling-origin cross-validation to maintain a prospective forecast structure

Key Results

  • Meta-ensemble model consistently outperformed individual base models on ROC AUC and real betting profit metrics
  • Demonstrated the potential to identify market inefficiencies and optimize edge in sports betting
  • Visualized calibration and ROC results to validate practical predictive utility

Tools & Technologies

  • Languages: R, Python
  • Frameworks: Tidymodels (R), Scikit-Learn, XGBoost, Keras/TensorFlow
  • Visualization: ggplot2, Plotly, Matplotlib

Visualizations

Model Confusion Matrices Model ROCAUC Performance Bookmaker Commission Distribution

Application & Further Reading

Conclusion & Next Steps

  • Model provided actionable signals for sports betting and identified bookmakers’ inefficiencies.
  • Future work: Expand to live/in-play predictions, automate data ingestion, and refine ensemble meta-learning for additional sports and markets.