Rattibha

will depue

7 Tweets 92 reads May 26, 2023

Today, I'm announcing Alexandria, an open-source initiative to embed the internet.
To start, we're releasing the embeddings for every research paper on the Arxiv. That's over 4m items, 600m tokens, and 3.07 billion vector dimensions.
We're not stopping here.

A significant number of the world's problems are just search, clustering, recommendation, or classification; all things embeddings are great at.
For example, finding research papers via keywords is hard when there's 10 words that mean the same thing. Embeddings makes this easy.

Embeddings are also a one-time cost and are incredibly cheap. In most cases, you'll never need to compute the same document twice.
At the moment, we're embedding tokens at high performance for $1 per 100,000,000 tokens. That's the length of the Bible, 10 times, per dollar.

I was surprised when I couldn't find any open embedding datasets (research, law, finance, etc.), considering the immense value and low cost.
There's too much to be built here... so we're building an org. and doing it ourselves.
Learn more about us: macrocosm.so

macrocosm.so

The Macrocosm Consortium

If you zoom out far enough, the human race appears to be just a single organism.

You can download the Arxiv embeddings (titles and abstracts, 6gb and 8gb respectively) at the link below.
There's a lot of datasets to choose from, so we need your help to figure out what to work on next. Let us know by voting!
Download or vote here:
alex.macrocosm.so

alex.macrocosm.so

THE ALEXANDRIA INDEX

From barbarism to civilization requires a century; from civilization to barbarism needs but a day.

Note: Embeddings are most often used for search / question answering, so we're building those ourselves.
Our Arxiv embedding search launches next week, with more to come.
We're also experimenting on a AI agent personal research assistant that helps you learn, teach, and publish.

Big thanks to @jlinbio and @indstryoutsider for helping launch this project, and to @sdand for his help developing the Penrose Analyst plugin (soon to feature Alexandria embeddings).

Loading suggestions...

The Macrocosm Consortium

THE ALEXANDRIA INDEX

Categories

More from this author

Related Threads

Popular Threads

The Macrocosm Consortium

THE ALEXANDRIA INDEX

Categories

More from this author

Related Threads

Popular Threads

Unroll Thread