Ben Griffis
Ben Griffis

@BeGriffis

27 Tweets 2 reads Jan 24, 2023
I think I mentioned to @DrMukherjeeS last week, but wanted to make a very short thread on this...
I think the lack of free, official, & standardized football data is hurting the acceptance of data in the game on a wider scale.
🧵
Apologies, but I will be forced to use American sports for my examples. Do not reject all things American just because they're American.
We may be weird, but we know how to deliver sports data apparently
First, why would free, official, comprehensive data from 1 source be good? Doesn't that hurt businesses?
You capitalist, you.
Yes, harms businesses who create their own data.
But if the provided data is *quality*, then it's only a massive positive for the sport
With non-standardized data & definitions like we have now in ⚽️, we run into issues like "key pass", a very basic concept, meaning different things for different people.
Opta & StatsBomb use 1 definition, Wyscout another.
Wyscout's "shot assist" = Opta/SB "key pass"
Obviously, it's a problem because if I mainly use StatsBomb and talk to someone who mainly uses Wyscout, we would have to establish ahead of time what set of data definitions we're gonna use when talking
Take another example, something very basic. Possession %
Is that 45% live-ball time? It is in some sources. Is it total game time? It is in some sources. Or is it % of all passes played? It is in some sources.
We also run into issues with more subjective things across data providers.
"big chance" is a weird metric by itself, but partially b/c it's very subjective and completely different for every single data provider
If it was just 1 provider, there'd be fewer issues, but still some
Beyond data definitions, we run into problems with different models.
expected goals (xG) is probably the biggest example we can use to show these problems.
StatsBomb has one model, Opta another, Wyscout another, Understat another...
The differences in the numbers spit out from these models comes both from the data used to train/make the model, as well as the variables used in the model (do they account for position of defenders? or the weather? or the type of assist on the shot? etc.)
For 1 shot, the differences can be relatively small and insignificant
Is there really THAT big a difference between a 0.10 xG shot and a 0.08 xG shot?
But over a season?
We will look at a player with 17 xG differently than a player with 13.6 xG.
But the point of this is not to say we need to all use one data provider. While that's optimal, it's obviously not going to happen as data wasn't inherently democratized in ⚽️ from the start like it (more or less) was in ⚾️🏀🏈 etc
However, if data was standardized and widely available—and I'm talking advanced data—it would be much easier to ingrain data in the sport. And easier for people to learn how to use/manipulate/analyze the data. And help clubs get more talent too b/c of more supply of good analysts
Baseball (MLB) and basketball (NBA) in the US are 2 perfect examples of the benefits of high-quality, free, advanced data from 1 source: the league itself.
MLB has Baseball Savant, which is an insane tool having both an API you can pull from, & pre-made visuals on almost anything you can imagine
Pitch speed, ball spin rate, speed of the ball off the bat, basically anything that can be measured is & is public
baseballsavant.mlb.com
NBA also has advanced stats, as well as an API like MLB.
Less easy to get visuals, as the main thing is the data itself, but they a) have official, high-quality advanced data, and b) visuals like shot charts you can make yourself on the website
nba.com
Naturally, having this type of data standardized & freely available to download and analyze is beneficial
Lots of these advanced metrics are in everyday conversation & newspapers. As a kid I learned too many baseball stat acronyms, never realized just how advanced some were
"Moneyball" is a term everyone knows now, and is used to describe clubs like Brentford & Brighton in the Premier League.
Of course, moneyball wouldn't have taken off in sports without baseball having such a crazy wealth of data for a bunch of kids to play around with
The NBA has taken lots of "moneyball" ideas as well, also partially made possible by the giant pool of data people could easily create and communicate new metrics with.
The increase in 3 point attempts is because of this.
eFG% is because of this.
Of course there are many sources for ⚽️data that people can go to and get data. But there's:
1) not that much, mainly the top of the top leagues
2) the issue of definitions
3) the issue of data quality across sources
4) the issue of accessibility for more metrics/quality data
The best way to address this would be to have FIFA collect & disseminate all the data for all their leagues in the world. Even just top tiers.
That then brings up the issue of the companies currently involved in doing this not for free. Also an infrastructure issue too
This will almost certainly not happen, and of course the onus isn't on the companies providing data to make all their data free.
StatsBomb is honestly incredible at providing copious amounts of data for free. Has helped so many people learn and get jobs. Can't stress that enough
But imagine if we had something like StatsBomb 360, but in real-time (or withing a few minutes of the game ending).
Having that level of quality, accessibility, & statistical coverage is similar to what the MLB and NBA have.
Granted, we don't have this level of data coverage in all basketball or baseball leagues around the world. So imagine if just the top 5 leagues had this standardized and expanded data coverage. FBRef is a peerless tool, but still limited in the metrics it offers
It's very easy for fans of the sport to hate data coming into the game. It's not standardized, it's often not available for people to play with themselves, and often there's no/not great communication from the league/pundits itself on what the data actually is & what it means
Data is often used so incredibly poorly by so many in ⚽️, including media. I don't see the same in ⚾️ & 🏀 in part because of the standardization & availability, coupled of course with the longer time that advanced data has been in the sport
Overall, these are just my thoughts on this. Have been thinking about it since last week when,it came up in a discussion
& like I said, we're at a stage in football data now where there's probably not a route to mirror ⚾️ & 🏀. Data is used by many clubs & from so many providers
I'm not trying to offer a solution, as there really isn't one without essentially forcing data providers to either close doors or shift into just offering consulting services
Just wanted to put my thoughts on data out there
(thread wasn't short at all, that was a poor joke lol)

Loading suggestions...