Prime Highlights:
InstaDeep and NVIDIA have open-sourced Nucleotide Transformers (NT), which are advanced foundation models for the analysis of genomics data.
The largest NT model is Multispecies 2.5B, having 2.5 billion parameters, and was trained using genetic sequence data from 850 species that include bacteria, fungi, and mammals.
NT clearly outperformed other state-of-the-art genomics models across a range of benchmarks, particularly on the promoter and splicing tasks.
Key Background:
InstaDeep, in association with NVIDIA, launched Nucleotide Transformers, an advanced open-source set of foundation models for genomics data analysis. The biggest model in the NT set is built with 2.5 billion parameters, and it was trained on a huge dataset of genetic sequences from 850 species of bacteria, fungi, invertebrates, and mammals like mice and humans. This new model, for some years now, has done better in most of the critical benchmarks that are compared against the previously known genomics models.
The technical details of the NT models are published in Nature. The NT uses an encoder-only Transformer architecture with a methodology of pretraining similar to masked language model from BERT and thus allows to use the model either for embedding generation that would be used in smaller models, or it might be fine-tuned using some task-specific head. InstaDeep tested NT on 18 downstream tasks, ranging from epigenetic mark prediction and promoter sequence identification, where the model outperformed baselines on all tasks. Notably, it also performed exceptionally well on promoter and splicing tasks, where the model achieved the highest overall performance.
Besides the downstream genomics tasks, in which InstaDeep successfully explored the potential of NT, the NT was applied to predict the severity of genetic mutations. Modest correlations between zero-shot scores derived from cosine distances in the embedding space and the severity of mutations opened new directions for genetic research and disease study. This open-source release is an important step forward in genomic AI and provides researchers with powerful new tools for attacking complex genetic problems.
Another striking feature of NT is the presence of rich contextual embeddings within its intermediate layers that include important genomic features such as promoters and enhancers, which even in unsupervised training don’t exist. It highlights its scope for zero-shot learning to make predictions on how genetic mutations are going to influence a condition; this may result in some insights toward disease mechanisms in ways hitherto unprecedented.
The Multispecies 2.5B model performed better than an equivalent model, which was only trained on human data. A study by InstaDeep outlines how the importance of such multispecies data can
be incorporated in the development process to know more about the human genome. When compared with other genomics foundation models, such as Enformer, HyenaDNA, and DNABERT-2, NT achieved better performance on most of the tasks, but the best results were achieved with Enformer for enhancer prediction and a set of chromatin tasks.