I want to talk about my data validation for ML journey, and where I’m at now. I have been thinking about this for 6 ish years. It starts with me as an intern at FB. The task was to classify FB profiles with some type (e.g., politician, celebrity). I collected training data,
Split it into train/val/test, iterated on the feature set a bit, and eventually got a good test accuracy. Then I “productionized” it, i.e., put it in a dataswarm pipeline (precursor to Airflow afaik). Then I went back to school before the pipeline ran more than once.
Midway through my intro DB course I realized that all the pipeline was doing was generating new training data and model versions every week. No new labels. So the pipeline made no sense. But whatever, I got into ML research and probably would never do ML in industry again.
Then I was an ML engineer, where I worked directly with clients to identify opportunities for ML. My first production experience was a proof of concept. I cleaned some of the client’s data, randomly split into train/val/test, trained a model, and showed some nice plots.
The client liked it and asked if they could get a weekly report with those plots. And this is when I learned single-use ML and repeated-use ML can both be production ML but are two completely different things. I didn’t know how to do the latter bc all we had was a dump of CSVs.
After several bureaucracy meetings, we got access to their Snowflake. But labels still came only in CSVs, once every few weeks or so. I wrote a DAG that pulled from their Snowflake, trained the model on whatever labels we had available, and sent an email of predictions.
One week into this deployment, many of the predicted scores were the same value. I didn’t know what to do. I looked at the 2 TB of snowflake and wished I were not an engineer. Several hours later, I realized that many rows were completely null-valued.
And only here did I realize that I had to do some “data validation.” First, I tried to get around it by only tracking ML accuracy. But we couldn’t compute that bc we had label delays. Then I wrote a spark job to count the number of nulls for each feature and log it to cloudwatch.
Every morning I got these cloudwatch alerts because there was almost always a nonzero amount of nulls. I didn’t know what a good threshold would be, so I just muted the alerts. For “monitoring,” I simply inspected the top 20 predictions before the email was sent.
My eyeball monitoring plan worked fine until we got some more data scientists, ML engineers, and clients. Then I wrote some ticket to compute bounds for each feature, p50 +- 3 * IQR. I piped the bounds to some yaml file and used this as “data cleaning.”
The data cleaning honestly had little impact. I found another bug where no new data had been added to Snowflake and we were simply sending the same predictions in each email. Then I added a new rule to the pipeline: if data is more than n days old, email me with an alert.
On and on, every time a bug came, I added some new hacky rule. Several months later, I found another bug: the bounds were messed up for a “total” (cumulative) feature. Turns out some distributions shift over time, and the p50 and IQR need to reflect that.
Then I got really excited about distribution shift. I started monitoring all these KL divergences, JS distances, KS test, etc. Meanwhile, I applied to PhD programs because I felt distribution shift would be a cool topic to study.
I would get alerts of these random distance and statistic values, for every feature, and not know what to do with them. This felt worse than the null alerts, where at least the values were interpretable bc they were fractions. A KL divergence makes no sense without context.
Eventually I realized I should just recompute my feature bounds daily. So I did that, and all was fine. I paused the Airflow DAG to compute all the distribution differences. For fun I tried to write the KS test in Spark (w/o UDFs). Clearly I was just waiting to hear from schools.
I got to grad school and thought, actually I should read the data validation literature. I was surprised to find that many of the things I had previously monitored (nulls, bounds) were considered. (In the future maybe I’ll write a thread on dataval basics.)
Year 1, I did an industry collab, as a research scientist, and the interview study. Both were very eye opening. First I thought there must be clever strategies to prevent against distribution shift. I found one: regularly retrain & recompute transformations
Regular retrains only works when you have fresh labels. When there were no new labels, you need other strategies. I will not get into that. Even with regular retrains, I was still surprised why everything was breaking
It turns out that data validation comes back to bite, again. The basic checks (null, types, bounds) are not enough. Sometimes there are errors that do not show up as null or egregious outliers. They could be a default enum value. They could be stale values.
If you retrain on corrupted data, you get a corrupted model. If you serve a corrupted model, you get bad predictions. If you get bad predictions, you get corrupted future training data. (Now I will score a children’s book deal)
So yeah, you need data validation. But you need accurate data validation. It should flag all errors without flagging false +s. And I learned from the interview study that no one has solved this. Forget tying data val alerts to product metrics, we can’t even do it for ML metrics
Looks like I am running out of tweets. Didn’t know that was possible. Next time I’ll tweet about some of the technical ideas of data validation, with links to the relevant literature. Maybe next next time I’ll give thoughts on deploying chatgpt with Kubeflow for data validation
Loading suggestions...