Baseball Data Analysis Challenge

Calling all data analysts, machine learning engineers, and data scientists!

I am working on building some demos and tutorials for machine learning. Of course, I will be sharing everything I do on GitHub. I thought it would be fun to share my input data with all of you before I start and make a little challenge out of this. While not as exciting or lucrative as a Kaggle competition, please feel free to have at it and use whatever techniques and tools you would like to discover any insights and/or make any predictions (even if you do not know anything about baseball).

The input data for this challenge represents 6 years (2016-2021) of Boston Red Sox Major League Baseball (MLB) regular season baseball game results, including a Game_Result column, labeled either 0 or 1, where 0 = Loss and 1 = Win.

The input data for this challenge is available as a CSV file here: https://github.com/ocdqblog/Vertica/blob/main/csv/BRS_2016_2021_Batting_input.csv

The data profiling results for the input data is available as a CSV file here: https://github.com/ocdqblog/Vertica/blob/main/csv/BRS_2016_2021_Batting_profile.csv

The raw data used in this challenge was collected via a paid subscription to: https://stathead.com/baseball/

Update for 2022 MLB Opening Day

I completed my initial work in time for the opening day of the 2022 MLB season, the results of which you can find in this Microsoft Excel file: Baseball Data Analysis Challenge 2022-04-05.xlsx. My baseball data analysis was performed using my employer’s (Vertica) in-database machine learning capabilities, and you can find my SQL scripts on GitHub.

I used logistic regression classification models to calculate win probabilities for the Red Sox across nine (9) game metrics: opponent, opponent’s division, month of year, day of week, runs scored, hits, extra base hits, home runs, and walks versus strikeouts. I also used the input data to train a Naïve Bayes classification model to predict wins and losses with an associated probability based on the runs scored, hits, extra base hits, home runs, and walks versus strikeouts game metrics (all of which are binned ranges of input data values). Its initial accuracy is only 77%, but I plan on making some adjustments. I also plan on using the 2022 baseball season as my test data. So not only will I be watching how many games the Red Sox win or lose this season, but I will also be watching how many games my machine learning model predicts correctly.

Think you can best my model? Game on! The baseball data analysis challenge continues. Play ball!