will depue
will depue

@willdepue

7 Tweets 92 reads May 26, 2023
Today, I'm announcing Alexandria, an open-source initiative to embed the internet.
To start, we're releasing the embeddings for every research paper on the Arxiv. That's over 4m items, 600m tokens, and 3.07 billion vector dimensions.
We're not stopping here.
A significant number of the world's problems are just search, clustering, recommendation, or classification; all things embeddings are great at.
For example, finding research papers via keywords is hard when there's 10 words that mean the same thing. Embeddings makes this easy.
Embeddings are also a one-time cost and are incredibly cheap. In most cases, you'll never need to compute the same document twice.
At the moment, we're embedding tokens at high performance for $1 per 100,000,000 tokens. That's the length of the Bible, 10 times, per dollar.
I was surprised when I couldn't find any open embedding datasets (research, law, finance, etc.), considering the immense value and low cost.
There's too much to be built here... so we're building an org. and doing it ourselves.
Learn more about us: macrocosm.so
You can download the Arxiv embeddings (titles and abstracts, 6gb and 8gb respectively) at the link below.
There's a lot of datasets to choose from, so we need your help to figure out what to work on next. Let us know by voting!
Download or vote here:
alex.macrocosm.so
Note: Embeddings are most often used for search / question answering, so we're building those ourselves.
Our Arxiv embedding search launches next week, with more to come.
We're also experimenting on a AI agent personal research assistant that helps you learn, teach, and publish.
Big thanks to @jlinbio and @indstryoutsider for helping launch this project, and to @sdand for his help developing the Penrose Analyst plugin (soon to feature Alexandria embeddings).

Loading suggestions...