Shreya Shankar
Shreya Shankar

@sh_reya

10 Tweets Dec 09, 2022
Sunday morning thoughtdump of some reasons why I believe "doing ML in the DBMS" doesn't make sense for traditional OLTP DB architectures (sorry, I'm studying for my prelim and this is helping me learn the concepts):
What does it mean to "do ML in the DBMS?" It's not just the SQL interface layer; what new commands we need, like CREATE MODEL. DBs have more layers (query, optimization, storage). OLTP DBMSes center around a transaction model, where each txn is an atomic sequence of actions
Suppose we had a DBMS responsible for completing ML training jobs & inference jobs. Each transaction could be a job (K8-style), but this is highly inefficient as an aborted transaction (likely to happen in ML workloads) would cause a full restart of the job
Yes one can checkpoint in the job as ML researchers/practitioners do, but it would be an anti-pattern to leave the model checkpointing to the application developer. This is antithetical to the DBMS ethos
So the DBMS would map every training/inference job to a _sequence_ of transactions, but then we’d need to support cascading aborts, which DBMS researchers have gone to length to prevent in OLTP systems with strict 2-phase locking
Now it makes a sense why ML training/inference jobs are scheduled as DAGs on K8s clusters or in mapreduce style. Conceptually it is simple—force users to access and manipulate objects correctly, perform their own optimizations, trigger aborts at fine-grained level, etc
Second, what does memory access look like? We do full scans (repeatedly for training, with different load order into the buffer every epoch) with selectivity factor=1, exclusively alternating between heavy I/O and CPU/GPU. How useful could an off-the-shelf OLTP optimizer even be
The optimizer cost function can’t simply be traditional OLTP I/O + w * (# data points). Maybe it will be something like I/O + w * (# data points * # model params) and force good indexing of feature values/tuples, but this honestly seems really hard to optimize
Now it makes a lot of sense why TF/pytorch/ML libraries control prefetching and record shuffling for ML workloads. They need to optimize I/O and processor time themselves. “Lower-level” ML application programmers schedule this themselves (yikes again antithetical to DBMS ethos)
We need a system that supports workloads with lots of cascading aborts or intra-transaction ckpting, optimizes for low (page fetch + model updates), and works well with existing memory architectures (massive RAM, shared-nothing 4-8 node clusters). This is not traditional OLTP

Loading suggestions...