6 tweets 4 reads Apr 13, 2023
Microsoft's new Kosmos-1 is incredible.
It's a new Multimodal Large Language Model (MLLM).
Their model can understand images, text, images with text, OCR, image captioning, visual QA.
It can even solve IQ tests.
Paper: arxiv.org
Code: github.com
The team also introduced a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.
This is an example of Kosmos-1 solving a visual IQ test.
The Multimodal Chain-of-Thought prompting enables KOSMOS-1 to tackle complex question-answering and reasoning tasks.
The model was evaluated on the following:
If you want to stay up to date with the latest breakthroughs in AI check out AlphaSignal's weekly summary.
We use ML to identify the top papers, news, and repos. It's read by 50,000+ engineers and researchers.
alphasignal.ai
The amazing team behind this @MSFTResearch paper:
@ShaohanHuang, @donglixp, Wenhui Wang, Yaru Hao, @saksham_singhal, Shuming Ma, Tengchao Lv, @wolfshowme, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, @vishrav, Subhojit Som, Xia Song, Furu Wei

Report this thread