Predicting March Madness I: Data Collection

March 29th, 2026

I know nothing about basketball.

Okay maybe I know a little bit about basketball. Going to the University of Iowa during the Caitlin Clark era (and having a family like mine) will do that to you. I am a data dork of some renown. I have mildly-checked ADHD. These comorbid conditions mean that, for the first week of March, I am consumed by bracketology.

For the past two NCAA Division I Men’s and Women’s Basketball Tournaments, I have had a singular goal: build the most accurate bracket using boxscores and team ratings alone. I am not (b)all-knowing and do not pretend to be, but I’m good at finding a dataset’s pulse. I did an A/B test in the 2025 tournament to check myself — half my brackets were made by sports-brain, and the other half by spreadsheet. My sports-brain brackets were, unsurprisingly, quite shit. My spreadsheet brackets placed 525th and 1,734th out of ~25 million in the ESPN bracket challenge.

And so began the Madness.

My basic research question is simple: using only boxscores and NET rankings, how much of the Tournament can I correctly predict? I kept the methodology simple in part because there’s no practical way to aggregate injury, tactical, or advanced metrics; the other part is that I am lazy. With those constraints in mind, the experimental setup is straightforward. Roughly,

Compile a database of boxscores for all of this season’s NCAA Division I men’s and women’s college basketball games.
Compile a database of NET rankings for the same.
Put the databases from (1) and (2) together to make a giant database of every single game with specific statistics attached.
For each team in the Tournament:
1. look in the database from (3) for all the games the team has played;
2. find “signal” games to tell us how that team does against certain types of teams (better, worse, just-as-good) and in certain types of games (close games, conference games, neutral-court games);
3. assign each of these “signal” games a score (that I call the ) based on some (weighted) combination of statistics;
4. create a fact sheet for the team for quick-referencing the different types of teams, games, and s.
Make brackets. That means individually assessing each matchup and sometimes taking guesses. Hopefully, with all this data in our back pocket, those guesses will be educated ones.
Post-Tournament post-mortem to assess how this year’s version of the did.

Everything here relies solely on data and my choices about which games count. It completely ignores the (sometimes surprising) discrepancy between algorithmic NET rankings and NCAA Tournament Selection Committee seeding. It incorporates nothing about injury reports, scouting, insider or expert knowledge, Tournament game location, officiating, or recent trends. It is also built to require the least amount of time and effort to use; speed is premium.

It is not a gambling tool or anything remotely designed to turn someone a profit; I cannot underscore this fact enough. (If I give you access to the code, please don’t use it for that.) It is purely and entirely for fun.

Collecting the data, it turns out, is the easy part. I’ve always used ESPN to track game scores (especially now that they cover the NHL to a reasonable degree), so I inquired about whether they have an API. I did not receive a reply. I did dig through the network requests my browser makes whenever I load a boxscore on their website, and found that they just hit a URL with a query string attached. To my knowlege, the API can only spit out game scores by date — i.e. one day’s games at a time — so I just had to figure out the standard date format, remove all the extraneous parameters from the query, and hit the URL for every day of regular-season play for all conferences. For example, here’s part of the the JSON payload for games played on November 4th, 2024:

[
  {
    "id": "401713576",
    "uid": "s:40~l:54~e:401713576",
    "date": "2024-11-05T00:30Z",
    "name": "Michigan Wolverines at South Carolina Gamecocks",
    "shortName": "MICH VS SC",
    "season": {
      "year": 2025,
      "type": 2,
      "slug": "regular-season"
    },
    "competitions": [
      {
        "id": "401713576",
        "uid": "s:40~l:54~e:401713576~c:401713576",
        "date": "2024-11-05T00:30Z",
        "attendance": 0,
        "type": {
          "id": "6",
          "abbreviation": "TRNMNT"
        },
        "timeValid": true,
        "neutralSite": true,
        "conferenceCompetition": false,
        "playByPlayAvailable": false,
        "recent": false,
        "venue": {
          "id": "5060",
          "fullName": "T-Mobile Arena",
          "address": {
            "city": "Las Vegas",
            "state": "NV"
          },
          "indoor": true
        },
        "competitors": [
          {
            "id": "2579",
            "uid": "s:40~l:54~t:2579",
            "type": "team",
            "order": 0,
            "homeAway": "home",
            "winner": true,
            "team": {
              "id": "2579",
              "uid": "s:40~l:54~t:2579",
              "location": "South Carolina",
              "name": "Gamecocks",
              "abbreviation": "SC",
              "displayName": "South Carolina Gamecocks",
              "shortDisplayName": "South Carolina",
              "color": "73000a",
              "alternateColor": "ffffff",
              "isActive": true,
              "venue": {
                "id": "1962"
              },
              ...
      <continues for 18,687 more lines>
  ...
  }
]

Even though the URL says these games were played on the 5th, they were actually played on the 4th — a small time-zone hiccup. In the focused game, South Carolina narrowly defeated a young, up-and-coming Michigan squad 68-62 during the Hall-of-Fame series in Las Vegas. It was Michigan star Syla Swords's first game. South Carolina would go on to post a 35-4 record, win the SEC Regular Season and Conference Tournament Championships, make the Sweet 16 for the 14th time in the last 15 Tournaments, play in their fourth National Championship game (and second in a row), and send three players to the WNBA, all by Dawn Staley's incisive leadership. Michigan would earn a 23-11 record under Kim Barnes Arico, defeat 15th-ranked Maryland in the Big Ten Tournament (a game I got to watch in-person!), and bow out of the NCAA Tournament in the round of 32.

The code for getting these data is pretty simple: loop through all the days of the season (overshooting by a bit on each end to catch stragglers), hit the URL, loop through the games, throw out properties I know I won’t need later, put all the games into one big list, and save it.

# Construct date queries.
dates = pd.date_range(start=madness.start, end=madness.end).to_pydatetime().tolist()
dates = [t.strftime(r"%Y%m%d") for t in dates]

# For each (league, abbreviation) pair, pull down every event from the ESPN
# web API, by day, then dump to a file.
for league, abb in zip(espn.leagues, espn.abbreviations):
	# Grab things from the URL.
	url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/{league}/scoreboard?groups=50&dates="
	events = []

	for date in tqdm(dates, desc=league):
		data = requests.get(url+date)
		games = data.json()["events"]

		# Delete unnecessary data.
		for game in games:
			del game["links"]
			del game["status"]

			for competition in game["competitions"]:
				for competitor in competition["competitors"]:
					try:
						del competitor["team"]["links"]
						del competitor["records"]
					except:
						pass

			events.append(game)

with gzip.open(espn.raw(abb), "w") as f:
  jstr = json.dumps(events).encode("utf-8")
  f.write(jstr)

There were 16 games on November 4th, so each game takes ~1,172 lines of JSON. Ballpark, there are ~1,000 individual properties attached to each game of basketball, most of them completely useless to us. Uncompressed, all the data for all the games clocks in at ~200MB; compressed, it’s ~6MB. Grabbing scores for each of the 12,122 games from ESPN takes ~30 seconds per league (womens-college-basketball and mens-college-basketball).

The next step is finding NET rankings. ESPN doesn’t track these (or, at least, didn’t), so I went straight to the horse’s mouth: the NCAA statistics website. This is an “abandon hope, all ye who enter here” kind of website. It doesn’t have an API (that I could find) to hit for data and, to my knowledge, just has everything stored in some SQL table that gets spit out by PHP or something. I have no idea. I’m just happy to be alive. Anyhow, NET rankings are posted every day from the beginning of conference play to Selection Sunday. I don’t want have to triangulate those dates, so I hit (e.g.) this page and parse the ranking dates out of the table. Then, for each date, I use an automated browser to open the URL for that date’s rankings, parse the rankings from the rendered HTML table, and stick the rankings in a spreadsheet with date-indexed rows and team-indexed columns. Here too, the code is simple.

for league in madness.leagues:
	rankings = {}
	ff = webdriver.Firefox(options=opts)
	
	for href, date in tqdm(Dates[league].items(), desc=league):
		url = "https://stats.ncaa.org/selection_rankings/nitty_gritties/"+href
		ff.get(url)
		doc = BS(ff.page_source, "html.parser")
		table = doc.find("tbody").find_all("tr")
		ranking = []

		for row in table:
			team = row.find_all("td")[0].find("a").get_text().replace("(AQ)","")
			ranking.append(team)
		
		rankings[date] = ranking

	with open(madness.ncaa.rawNETURL(league), "w") as w: json.dump(rankings, w)
	ff.close()

Code for parsing NET rankings from per-day tables on the NCAA statistics website. Not too bad, but not really that good. Querying and parsing takes ~20 minutes, and runs as a background task. Transforming raw rankings into a spreadsheet takes ~4 seconds.

A small problem lies in joining the ESPN and NCAA tables. The two organizations don’t share team names, team name shortenings, team name abbreviations, unique identifiers, anything — so I have to manually match them. It is a pain in the ass. But for my love of the game, I wouldn’t do it. (Honestly it’s just making an ESPN \(\longleftrightarrow\) NCAA name mapping and there’s almost an automated way to do it. Still sucks.)

After figuring out how to match up names, it’s time to iterate through all the games, pick out only the relevant statistics (from ESPN’s data), assign NET rankings (from the NCAA’s data) and make the database from Step 3. The statistics I use are

field goals attempted
field goals made
three-pointers attempted
three-pointers made
free throws attempted
free throws made
assists
rebounds

and from there compute shooting percentages and season averages. I also throw out games where either team’s score was below five points, or either team was a DII team. Then I save the spreadsheet, and save a sub-spreadsheet involving only tournament teams. (I don’t do mid-major erasure here.) Notice that I’m taking only team statistics into account, never individual ones. That’s for fantasy games, and it’ll just bog me down here. In sum total, querying for + paring down these spreadsheets takes ~0 seconds.

Once the big spreadsheets are made, it’s time for Step 4.1 — winnowing the data to whatever I think will be relevant later, and creating quick-lookup spreadsheets so I don’t have to do a bunch of expensive querying. Creating these spreadsheets is harder than might first appear: ESPN reports their games by designating ‘home’ and ‘away’ teams, so in the Step 3 spreadsheets, there’s no ‘in-focus’ (or ‘primary-focus’) team. Why is this troublesome? If, for example, you wanted to compile the South Dakota State University Jackrabbits’ win-loss record, you would have to first query whether SDSU was the ‘home’ or ‘away’ team, then compare the scores of the home and away teams, for each individual game. Both these queries are per-row operations, which are extremely expensive and thus slow. I do not want slow. I am lazy. I want fast. Immediate. So if I could decide SDSU’s win-loss record just by comparing a column of SDSU’s scores to a column of their opponent’s scores, I’d perform zero queries and use vectorized operations.

for team in tqdm(teams, desc=league):
		record = games[(games["home"] == team) | (games["away"] == team)].copy()
		record = record.reset_index(drop=True)

		# Re-work things so we have a "focus" team and an "opposing" team. This
		# way we can easily re-categorize by game type.
		record["designation"] = "home"
		record["focus"] = team
		record["opponent"] = ""

		for gametype in ["home", "away"]:
			gamesoftype = record[(record[gametype] == team)].index
			record.loc[gamesoftype,"designation"] = gametype
			record.loc[gamesoftype,"opponent"] = record.loc[gamesoftype,"away"] if gametype == "home" else record.loc[gamesoftype,"home"]

		record["location"] = "neutral"
		nonneutral = record[record["neutral"] < 1].index
		record.loc[nonneutral,"location"] = record.loc[nonneutral]["designation"].values

		statsbytype = analysis.statistics.averages + analysis.statistics.pergame
		outcomesbytype = analysis.statistics.outcome + analysis.statistics.rank

		record[statsbytype] = 0
		record[statsbytype] = record[statsbytype].astype(float)

		record[outcomesbytype] = 0
		record[outcomesbytype] = record[outcomesbytype].astype(int)

		for gametype in ["home", "away"]:
			gamesoftype = record[(record["designation"] == gametype)].index
			opp = "away" if gametype == "home" else "home"

			for statistic in analysis.statistics.statistics:
				record.loc[gamesoftype,f"focus.{statistic}"] = record.loc[gamesoftype, f"{gametype}.{statistic}"]
				record.loc[gamesoftype,f"focus.avg.{statistic}"] = record.loc[gamesoftype, f"{gametype}.avg.{statistic}"]
				record.loc[gamesoftype,f"opponent.{statistic}"] = record.loc[gamesoftype, f"{opp}.{statistic}"]
				record.loc[gamesoftype,f"opponent.avg.{statistic}"] = record.loc[gamesoftype, f"{opp}.avg.{statistic}"]

			for rating in analysis.statistics.ratings + analysis.statistics.aggregate:
				record.loc[gamesoftype,f"focus.{rating}"] = record.loc[gamesoftype, f"{gametype}.{rating}"]
				record.loc[gamesoftype,f"opponent.{rating}"] = record.loc[gamesoftype, f"{opp}.{rating}"]

		# Increment the NET rating, since it's an order statistic (and thus
		# zero-indexed).
		record["focus.NET"] += 1
		record["opponent.NET"] += 1

		# Add win indicator and quadrant ratings.
		record["win"] = (record["focus.score"] > record["opponent.score"]).astype(bool)
		record["quadrant"] = madness.ncaa.quads(record["location"], record["opponent.NET"])

		# Drop all the unnecessary columns.
		record = record.drop(analysis.statistics.drop(record), axis=1)
		record = record[analysis.columns]

Re-mapping columns and saving individual record spreadsheets for all 136 NCAA tournament teams. This code — of which only a part is shown — takes ~0 seconds to run. More precisely, it performs at least 20 queries, one copy, multiple column renamings, and multiple type casts for all 136 teams — over 11,181 games total — in less time than my laptop can track.

You may have noticed that I’m adding a Quad rating at this stage. Quad ratings are a rule-of-thumb ranking system that accounts for opponent strength and game location based on the principle that it’s harder to beat better teams in their own barn, and easier to beat worse teams in yours.

An example: Colorado (NET 54) is playing Iowa State (NET 32) at Hilton Coliseum, ISU’s home court. If ISU wins, it’s a Quad 2 (Q2) win for ISU and a Q1 loss for Colorado; if Colorado wins, it’s a Q1 win for Colorado and a Q2 loss for ISU. If the game is instead played at the CU Events Center in Boulder and ISU wins, it’s a Q1 win for ISU and a Q2 loss for Colorado; if Colorado wins, it’s a Q2 win for Colorado and a Q1 loss for ISU.

This ranking system can be confusing, and I only include it in case it becomes algorithmically relevant. The Selection Committee uses it as part of their review process, but I haven’t found a way to fit it into mine yet. Regardless, I categorize each team’s games into six non-mutually exclusive types:

Close.: A game is close when the in-focus team is within 15 NET spots of their opponent and the score difference is five points or fewer.
Near-miss.: A near-miss game is one where the in-focus team beats an opponent ranked at least 15 spots higher (worse) in the NET by five points or fewer. (Some near-miss games are close games.)
Upset.: A game is an upset win when the in-focus team beats an opponent ranked at least 15 spots lower (better) in the NET. An upset loss is when the in-focus team loses to an opponent ranked at least 15 spots higher (worse) in the NET.
NET-close.: A NET-close game is one where the in-focus team’s opponent is within 15 spots in the NET. (All close games are NET-close games.)
NET-lower, NET-higher.: NET-lower and NET-higher games are ones where the in-focus team is ranked lower (better) and higher (worse) than its opponents, respectively.

Split out these games from the summaries, write them to file individually (though I’m debating the worth of this at the moment), and I’m done with data processing; this last step takes ~8 seconds. On to computing the s.

Anthony E. Pizzimenti

CV

Contact

Blog

Research

Teaching