@radek@sigmoid.social (Mastodon) πŸ‡ΊπŸ‡¦
@radek@sigmoid.social (Mastodon) πŸ‡ΊπŸ‡¦

@radekosmulski

9 Tweets 231 reads Nov 18, 2022
Merlin Dataloader is 119x faster than my own PyTorch Dataset + Dataloader combo!
This is revolutionary for tabular data πŸ₯³
Let's take a closer look at what is going on.
First, a disclaimer.
It is very hard to do benchmarking in a fair way.
I am comparing how *I would* do things in pure Python/PyTorch vs what Merlin Dataloader does for me.
Here is the setup:
I have a large dataframe with 152 million rows and two columns.
I create a simple Dataset and feed it to a PyTorch Dataloader.
For the Merlin Dataloader, I write the dataframe to disk and let Merlin figure it out.
I guess I understand what is happening.
A lot of time and compute is spent on indexing into the numpy array and collating the 65536 examples per batch.
I *guess* I could figure it out myself and do it more efficiently?
I could maybe go with an iterable-style dataset and somehow shuffle data in chunks without indexing.
But frankly speaking, why would I write all this code? πŸ™‚
With Merlin Dataloader I get speed for free!
One reason it is so fast is because it utilizes Dask-based Merlin Datasets.
And I guess there are a couple of other things in Merlin Dataloaders that make it all (including memory transfer) super-fast.
But I would lie if I said I understood everything that is going on πŸ™‚
Will keep my πŸ‘€ and πŸ‘‚s open to learn more!
And to whet your (and mine πŸ˜„) appetite a bit more, remember all those tabular data ops in NVTabular?
We now have a way to feed all that transformed data efficiently to TF, PyTorch, and JAX 🀯
Super excited, examples are coming! πŸ™Œ

Loading suggestions...