John Lees
John Lees

@johnlees6

7 Tweets 2 reads Aug 24, 2022
Have we got a philosophical transaction for you!
royalsocietypublishing.org
We implemented an embedding algorithm (similar to t-SNE, PCA, UMAP etc) to work with genetic data. You can use it to rapdily visualise large collections and find clusters within them.
1/7
We extended stochastic cluster embedding. This is neat because
a) the algorithm was designed using a human perception study to tune parameters to make the output look 'clustery'
b) it's solvable by SGD, so we can parallelise and make fast on both CPU and GPU
2/7
We can process sequence alignments, assemblies (via sketching) and gene presence/absence
This let us run (in under a couple of hours) on
- 661k bacterial assemblies
- 1M SARS-CoV-2 genomes
You can make videos of the optimisation too
youtube.com
3/7
New from the preprint:
- We analysed simulated population data, and showed that only mandrake found the expected clusters and structure in all cases compared to PCA, t-SNE and UMAP
4/7
- We explored the 661k bacterial dataset results a bit more, and found that mandrake simulataneously found species clusters, but also strain clusters at different resolutions within this (e.g. Tb in major lineages, Salmonella in hierCC groups)
5/7
Our algorithm is called mandrake, you can find the code here: github.com
It's also available on conda, and as a web app you can run in your browser without any installation (using WebAssembly)
gtonkinhill.github.io
6/7
And of course, I made a stupid new logo and help page
mandrake.readthedocs.io
Thanks to @gerrythill for writing/analysing with me, Jukka Corander and Zhirong Yang for developing SCE
7/7

Loading suggestions...