Unit testing for ML is a big category of questions but here are my thoughts on the data validation piece (ensuring model inputs/outputs have good "quality" such that ML performance doesn't suffer). Old work in defining data constraints (e.g., Postgres style) fails us now bc
(1) "quality" is not easily defined by a human---are you gonna comb through every feature column and create bounds?---and (2) the distribution matters; it's hard to look at one record alone and know whether it's "broken"
So in the Software 2.0 world, data constraints are functions over the full relation/table and need to be constantly refit/updated. Smells a lot like ML models themselves, which are also compressed "views" over base relations that need to be constantly updated over time
In the trend of OLAP, the "unit" of human labor is not a record anymore, it is a column. Humans create new feature columns, vertically partition data a la C Store, etc. + the fact that one record alone doesn't indicate broken ML pipeline, and you get: the need to validate columns
So we already/need to do data validation on columns, somehow monitoring the joint distribution of columns autoregressively, so we can prune the minimal set of base records such that resulting ML performance doesn't suffer
In practice not all applications care to actually prune records from being materialized in the ML model view of predictions. Maybe safety critical ones like AV, but it's ok for some ads model to serve crappy predictions for a few hours. What matters is
triggering an alert on broken columns/columns that don't pass the validation so humans can review pipeline transformations to see if there's a bug. Now we get an additional problem that traditional trigger-supporting DBMSes don't solve:
what is the predicate over data validation metrics to trigger an alert (like > what threshold)? To maximize recall of ML pipeline performance drops and precision of alerts such that you don't get alert fatigue? And similarly, since a data validation function are a view over
base data/tables/relations that needs to be rematerialized frequently, the predicate needs to be redetermined. This process is so application specific---basically determined by how frequently your base data naturally drifts, how frequently it gets corrupted, the size of the data,
the frequency of bugs (software included) that get deployed to prod, and more. It is really hard to think about a generalizable solution. People at bigcos are training separate constraint and trigger models for their ML pipelines, all with little degrees of success that I've seen
Surprisingly I think this manually-tuning-constraint-and-trigger-models paradigm can be addressed by deep learning. Imagine the first layer(s) of a DNN to map base records to reasonable inputs/"fix" broken base records before applying a model
And maybe adversarial training can have its "day in the sun" again, where perturbations are varying column completeness fractions across batches, and other common bugs that indicate upstream engineering issues. n/n
I am sure I will regret having written this thread before coffee & as a rant with no proofreading
Loading suggestions...