Using Data Science to Predict NHL Outcomes

Author

Will Schneider

Backdrop

Growing up watching Dallas Stars games with my father nurtured a deep appreciation for hockey. That curiosity later morphed into a data‑science project: could I build a model that predicts NHL outcomes? During graduate school I attempted an ambitious player‑performance model, but the inherent randomness of hockey made it difficult to achieve the accuracy I wanted. Unlike high‑scoring sports such as basketball, hockey outcomes hinge on a handful of events; luck can account for roughly 53% of a team’s record, whereas it contributes only 12% to NBA standings 1. The NHL also imposes a hard salary cap, a mechanism introduced after the 2004–05 lockout that prevents wealthy clubs from buying talent and promotes parity 2. These features make the league both exciting and hard to forecast.

What you’ll learn in this post:

  • How to build and evaluate a prospective sports‑prediction model. We walk through data collection, feature engineering, model training and prospective cross‑validation, illustrating the pitfalls of time‑series prediction.
  • Practical feature engineering for real‑world sports data. Lagged rolling averages, travel‑distance metrics and aggregated player statistics provide the building blocks for our models.
  • Why bookmaker odds are hard to beat. We examine expected value, vig and market structure to explain why most bets have near‑zero EV and why selective filters are essential.
  • How calibration and stacking improve forecasts. Beta calibration and a stacked meta‑model improve probability estimates and discrimination.
  • Lessons for would‑be sports bettors or modelers. We synthesize what worked, what didn’t and how to find a usable edge in an unpredictable sport.

The First Run: Player Point Prediction

My first attempt (August–November 2024) focused on predicting whether a player would earn a point (goal or assist). Because only about 10‑12% of shots across the league result in goals and some players log very little ice time, I focused on points rather than goals or assists. The data were preprocessed so that the dependent variable could easily change to assists and goals if desired. Logistic regression achieved a ROC‑AUC around 0.60, while a random forest reached 0.70, but sensitivity was low; the models correctly predicted a point only about 30%–46% of the time, just above league‑average points per game. In other words, the models could identify “no point” games much better than “point” games. This imbalance, together with the high variance inherent in hockey scoring, led me to broaden the scope to team‑level outcomes.

Revamped Model: Game Prediction

Why predicting NHL games is so challenging

Hockey is a low‑scoring sport and randomness plays a larger role than in other major leagues. As a TSN article notes, scoring goals in the NHL is difficult; games tend to be low scoring and chance events such as deflections, lucky bounces and hot goaltending strongly influence results 3. Moreover, the league’s salary cap and revenue‑sharing policies are designed to encourage competitive balance, which keeps team talent levels relatively compressed and increases year‑to‑year volatility 4. These factors mean that even sophisticated models will have limited discriminative power; predicting hockey games is inherently uncertain.

Data collection and feature engineering

To build a prospective game‑prediction model, I pulled three primary datasets from the NHL API (boxscore summaries, play‑by‑play logs and shift reports) and merged them at the game level. Player‑level statistics (time on ice, shots, face‑offs, etc.) were aggregated to the team level for both the home and away sides. I created rolling averages over 3‑game, 7‑game and 15‑game windows, applying exponential weights at the player level and simple averages at the team level to emphasize recent form while preserving team stability. Geographic information (latitude and longitude of teams and arenas) was used to compute travel distance with the Haversine formula; time‑zone differences were also included to approximate jet‑lag effects. All statistics, except for travel and time‑zone variables (which are known at scheduling time), were lagged so that the model only uses information available before each game.

Modeling Approaches and Feature Selection

To find the best predictor I experimented with the following approaches:

  • Generalized linear model (logistic regression) served as a baseline for binary win/loss outcomes. Logistic models are easy to interpret but assume a linear relationship between features and log‑odds.
  • Random forests introduced non‑parametric flexibility and captured nonlinear interactions. They offered modest gains but still struggled with calibration.
  • Neural networks provided even more flexibility and improved ROC‑AUC, yet calibration issues remained, and the time it takes to tune this model over the rolling splits is very computationally expensive.
  • XGBoost balanced performance and computational cost; tuned carefully it achieved similar ROC‑AUC to neural nets while training faster.
  • Elo rating model. Inspired by FiveThirtyEight’s NFL methodology, each team was assigned a power rating that updates after every game based on the relative strength of the opponent and the result 5. This single number summarizes past performance and naturally weights recent games more heavily through the update factor (the K‑factor). Elo ratings produced win probabilities that proved surprisingly competitive despite their simplicity.
  • Meta‑model (stacking). Finally, I combined the predictions from the logistic, random forest, neural network, XGBoost and Elo models by training a second‑stage logistic regression to weight each model’s output. This ensemble improves discrimination by leveraging different strengths: models that excel in certain contexts can compensate for weaknesses of others.

Feature selection relied on SHAP values to eliminate variables with little impact on predictions. This reduced computation time without sacrificing accuracy and provided insights into which factors were truly predictive.

Cross‑validation (and the woes of prospective modeling)

All models were trained prospectively: the training set contained only games played before those in the test set, preserving temporal order to avoid leaking future information. Cross‑validation used rolling splits where each fold was defined by a cutoff date, ensuring that no observations from the future influenced training. Such leakage is common in sports modeling when analysts use random splits or include season dummies; I avoided these pitfalls by engineering features strictly from past data.

Probability calibration

Producing accurate win probabilities is more valuable than simply predicting winners. Many classifiers output poorly calibrated probabilities; logistic regression tends to be better calibrated because it directly models log‑odds, but more flexible models like random forests or neural nets often require calibration. I experimented with Platt scaling and beta calibration. Beta calibration fits a parametric beta cumulative distribution function to map uncalibrated scores to probabilities. Its two shape parameters allow asymmetric adjustments and it generally yields better calibration than a single‑parameter sigmoid for skewed score distributions. Applying beta calibration to the meta‑model’s outputs improved Brier scores and made the probabilities more realistic.

Model performance

On a hold‑out season, the logistic regression achieved an accuracy of about 0.56 with ROC‑AUC ~0.60. Random forests and neural networks improved sensitivity at the expense of specificity, while XGBoost slightly outperformed them with ROC‑AUC ~0.605. Surprisingly, the simple Elo model reached ROC‑AUC of ~0.614. When the models were stacked in the meta‑model, ROC‑AUC increased modestly to ~0.628. Confidence intervals via 2000 bootstrap samples confirmed that the improvement was statistically significant (95% CI ≈ 0.619–0.637). Thus, while the gain was small, combining models provided the most reliable discriminator.

Deployed model evaluation

I automated the data pipeline to pull new games from the NHL API daily, update features, retrain the models weekly and generate live probability forecasts. In deployed evaluation during the 2024‑25 season, the models performed comparably to cross‑validation. The meta‑model was unfortunately not designed at the time of implementation (I wanted to do a real simulation on the last 2 months of the 2024‑2025 season as if I was actually deploying the model). The similar metrics as the rolling splits eased my concerns that the model was prone to data leakage.

The table below summarizes the performance metrics for three of the models that were evaluated:

model_version Accuracy Sensitivity Specificity Brier Kappa ROC_AUC Precision
glm_v_25 0.582 0.323 0.837 0.292 0.161 0.543 0.662
nn_v_25 0.570 0.587 0.553 0.267 0.139 0.564 0.564
rf_v_25 0.601 0.429 0.770 0.268 0.199 0.619 0.648

Financial Analysis: Betting Context

Because predictions were deployed in a wagering context, I evaluated expected value (EV) and bookmaker “vig.” EV estimates the average return per dollar based on predicted probabilities and offered odds; vig reflects the bookmaker’s commission (the difference between raw implied probability and the fair, vig‑free probability). Even an accurate win‑probability model is not enough to guarantee profitable betting because sportsbooks charge a commission known as the vig (or juice). The vig is essentially a rake that bookmakers take on each wager; instead of paying out the full fair odds, they pay slightly less, creating a margin for themselves 6. For example, in a fair coin‑flip two outcomes should each pay 100 cents for a 100‑cent stake, but bookmakers typically post odds equivalent to –110/–110, paying only about 91 cents for a correct $1 bet and keeping the rest7. When you convert American odds to implied probabilities, the sum of the probabilities on both sides exceeds 100 %; the excess reflects the vig8. To compare the model’s probabilities with the market, I removed the vig from sportsbook odds to obtain vig‑free implied probabilities.

Expected value and edge filters

For each game, the expected value (EV) of a bet was defined as the model’s estimated win probability multiplied by the payoff minus the stake. EV estimates the average return per dollar based on predicted probabilities and offered odds. Model edge was defined as the difference between the model’s win probability and the vig‑free implied probability from the bookmaker. A flat betting strategy across all games was unprofitable; the EV distribution (Figure 1) is heavily right‑skewed and most bets have EV near zero. To extract value, I filtered bets with EV > 0.5 and model edge greater than the vig spread (the difference between the highest and lowest vig across bookmakers for that game). These filters improved profitability but dramatically reduced the number of bettable games.

Expected Value Across Bookmakers

The Exp. Value by Bookmaker boxplot shows that most EVs hover near zero across all bookmakers, but there are occasional high‑EV outliers (up to ~1.5).

The histogram of Expected Value per Bet underscores how rare positive EV opportunities are: the distribution is heavily skewed towards zero, with a long tail of profitable bets.

These figures explain why a flat‑betting strategy failed; positive EV bets were scarce, and blindly wagering on every game essentially produced random results.

Implied Probability Spread and Vig Buffers by Team

When comparing bookmakers’ implied probabilities, some teams exhibited larger discrepancies. The Implied Probability Spread by Team plot reveals that the Utah Hockey Club, New York Islanders and Los Angeles Kings often had spreads exceeding 0.25, whereas clubs like the San Jose Sharks or Vegas Golden Knights had narrower spreads.

Wide spreads imply that bookmakers disagree on a team’s chances, offering opportunities for bettors who have a strong read on those teams.

The Distribution of Vig Buffer by Team and Average Vig Buffer by Team charts highlight how vig varies across teams. Clubs such as the Carolina Hurricanes, Dallas Stars and Edmonton Oilers incur higher vig buffers (~0.025), meaning bookmakers embed larger margins when setting odds for these high‑profile teams.

Lower‑profile teams (e.g., San Jose Sharks or Chicago Blackhawks) have smaller vig buffers (~0.01). This pattern suggests that games involving popular contenders may be less profitable because bookmakers charge more for betting into them.

Bookmaker Vig Landscape

Examining vig at the bookmaker level sheds light on market efficiency. The Vig by Bookmaker boxplot shows that Lowvig and BetOnline.ag consistently offer lower vig (around 0.03–0.035), while MyBookie and William Hill US impose the highest margins (0.05–0.06).

An overlaid histogram across bookmakers confirms this ranking: Lowvig’s distribution is tightly centered at ~0.03, whereas MyBookie and William Hill exhibit heavier tails up to 0.06. Faceted histograms further reinforce that BetMGM, Lowvig and BetOnline provide bettor‑friendly odds, while DraftKings, FanDuel and Fanatics are less generous.

Despite these differences, the Distribution of Vig Spread Across Books shows that the per‑game difference between the highest and lowest vig among bookmakers is usually modest (0.02–0.03).

While shopping lines matters, bettors shouldn’t expect drastic gains from arbitrage alone.

Putting It All Together

By combining predicted probabilities with vig information, I introduced two filtering criteria to identify worthwhile bets:

  • Expected value (EV) > 0.5 – only consider wagers where the model anticipates at least a 50% return per dollar risked.
  • Model edge ≥ vig spread – ensure that the model’s confidence in a team (difference between model probability and vig‑adjusted implied probability) exceeds the bookmaker’s margin.

Applying these filters increased profitability but drastically reduced the number of betting opportunities. In practice, most filtered bets were on underdogs because favourites often carry low returns (for example, risking $300 to win $100). The limited sample size and concentration on underdogs make the strategy vulnerable to variance, but the analysis suggests that selective betting based on model edge and EV can outperform random wagering.

Lessons Learned and Future Directions

Predicting hockey outcomes is intrinsically difficult, but this project yielded several insights and actionable recommendations:

Embrace parity and randomness. Hockey is inherently random; luck influences more than half of a team’s season record and the league’s hard salary cap levels the playing field. No model will eliminate uncertainty entirely, so expectations should be tempered.

Simplicity often wins. Complex models (neural networks, random forests or gradient boosting) offer only marginal gains over simpler baselines. In this project, logistic regression and an Elo‑based power rating performed comparably to, and sometimes better than, more sophisticated methods when paired with careful calibration and cross‑validation.

Focus on model edge rather than raw accuracy. ROC‑AUC values around 0.62 and accuracies around 0.58 may seem unimpressive, but even a small edge can be exploited when paired with rigorous betting filters. The difference between a model’s win probability and the vig‑free implied probability matters more than whether the model classifies winners correctly.

Line shopping matters, but margins are thin. Some bookmakers charge lower vig than others, but the per‑game vig spread rarely exceeds a few percentage points. The biggest opportunities arise from mispriced odds on specific teams or games rather than tiny differences in commission.

Practical feature engineering matters. Features such as lagged and cumulative averages, engineered travel metrics and carefully tuned Elo ratings provide real predictive lift. Feature selection using SHAP values helps remove noise and speed up training without sacrificing performance.

The Power of Elo Ratings

Although it was the simplest of the candidate models, the Elo rating system unexpectedly outperformed more complex machine‑learning models. Each team’s Elo rating is updated after every game based on the opponent’s strength, the margin of victory and a scaling factor called the K‑factor that determines how quickly the rating responds to new information. A high K‑factor makes the rating more sensitive to recent results, while a low K‑factor smooths fluctuations9; I used 2012-2014 data and found out a k of 8-9 stikes good balance between adaptability and stability. This built‑in recency weighting allows Elo ratings to adapt to hot streaks, slumps and roster changes more efficiently than fixed rolling windows of box‑score statistics. Moreover, Elo summarises past performance in a single number and directly converts rating differences into win probabilities, making it both interpretable and practical. In our experiments, the Elo model achieved a higher ROC‑AUC (~0.614) than the neural network or XGBoost, and stacking it alongside other models provided further gains. Future versions of the model will explore Elo variants for specific metrics (e.g., shots on goal or saves) to replace or complement traditional rolling averages.

Next Steps

  • Explore additional features. Context‑specific Elo ratings and additional engineered variables could capture team form more succinctly and reduce variance.
  • Investigate model calibration. Regularly validating probability calibration will help refine EV and edge thresholds, potentially increasing profitability.
  • Analyse favourite vs. underdog bias. Positive‑EV bets often involved underdogs. Understanding why favourites are rarely selected and exploring stake sizing or parlays could improve returns.
  • Develop interactive tools. Public visualizations or interactive apps could allow readers to explore probability forecasts and market odds.

Even though no model guarantees riches, this systematic, data‑driven approach illustrates how careful feature engineering, rigorous validation and realistic assumptions can uncover small but real edges in a sport dominated by parity and chance. If you have feedback, questions or would like to collaborate on sports analytics, feel free to get in touch.

References

Footnotes

  1. Why does luck play such a big role in hockey games? | PBS News
    https://www.pbs.org/newshour/science/why-does-luck-play-such-a-big-role-in-hockey-games↩︎

  2. The Pennsylvania State University
    https://honors.libraries.psu.edu/files/final_submissions/10005↩︎

  3. There’s less parity in the NHL than you think | TSN
    https://www.tsn.ca/nhl/there-s-less-parity-in-the-nhl-than-you-think-1.2174892↩︎

  4. There’s less parity in the NHL than you think | TSN
    https://www.tsn.ca/nhl/there-s-less-parity-in-the-nhl-than-you-think-1.2174892↩︎

  5. How Our NFL Predictions Work | FiveThirtyEight
    https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/↩︎

  6. Betting Theory: Vig and Implied Probabilities | Alacrity
    https://www.alacrity.gg/betting-theory-vig-and-implied-probabilities/↩︎

  7. Betting Theory: Vig and Implied Probabilities | Alacrity
    https://www.alacrity.gg/betting-theory-vig-and-implied-probabilities/↩︎

  8. Betting Theory: Vig and Implied Probabilities | Alacrity
    https://www.alacrity.gg/betting-theory-vig-and-implied-probabilities/↩︎

  9. How Our NFL Predictions Work | FiveThirtyEight
    https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/↩︎