Content

Bio Foundation Models Benchmarking Conclusion

Benchmarking Geneformer v1 vs v2 Bio Foundation Models

Jad Sbaï

Published in

Technology

Biology

min read

November 20, 2024

Imagine being able to decode the unique molecular blueprint of every single cell in the human body, unveiling the mysteries of our biology at a remarkable level of detail. This exciting advancement is becoming possible through the integration of AI across various domains, including molecular biology.

‍

In this short post, we will give you an overview of the most promising open-source single-cell foundation models that you should test and integrate into your research!

In this blog post, we’ll dive into a detailed comparison of two versions of the bio foundation model Geneformer; i.e. Geneformer v1, which was first introduced in 2021 in Nature, and Geneformer v2, released in 2024 in NIH.

This comparative benchmarking will highlight the key differences and improvements made in v2 over its predecessor — both having been developed by the same team.

Bio Foundation Models

Before diving into the details, it’s helpful to revisit what bio foundation models (FMs) are and their role in biological research. Bio FMs are pretrained on massive biological datasets, such as DNA sequences or, in this case, RNA-seq data, which allows them to generate rich and informative embeddings.

These embeddings can then be applied to various downstream tasks, ranging from basic analyses like cell type annotation and batch integration to more complex applications in drug discovery, such as biomarker identification and target selection. If you’re looking for a deeper introduction to the concept, a great resource is available here.

‍

Architecture: The Old and the New

Now that we’ve covered the basics, let’s dive into the specifics. Before examining the results, it’s crucial to first understand the architectural and pretraining differences between Geneformer v1 and v2. Full details can be found in the Helical model card.

Geneformer v1 was pretrained on 29.9 million human single-cell RNA sequences from a wide range of tissues, all sourced from public datasets, available here. It comes in two variants, with 6-layer and 12-layer models, both using an input size of 2048 (genes per cell).

While this input size allows for some context awareness, it falls short in representing a full genome, which typically requires around 20,000 genes per cell (though this includes many genes with zero expression counts). To work around this, Geneformer v1 (as well as v2) employs rank value encodings to sort gene expressions, prioritizing those that better distinguish cell states. More details can be found in the full article.

Geneformer v2 was pretrained on a significantly larger corpus of biological data. The authors assembled a pre-training corpus of 103 million human cells, of which a filtered subset of 95 million cells — based on mutational burden — was used for pre-training. While the full corpus isn’t publicly available yet, a list of the datasets used can be found in the preprint.

One of the most notable improvements in v2 is the increased input size of 4096. This expanded input size captures 93% of the Genecorpus, meaning that 93% of the cells had detectable gene expression for fewer than 4096 genes. This extension enhances the model’s ability to capture biological variability while reducing the influence of zero-count genes.

Three variants of Geneformer v2 were introduced: a 12-layer and a 20-layer base model, along with a cancer-specific 12-layer model. The latter was tuned using continual learning on a separate corpus of 14 million cancerous cells. This model was domain-tuned to account for the variational complexity inherent in cancer-specific cellular landscapes, making it more effective at understanding the heterogeneity and distinct characteristics of cancer cell populations.

Another significant enhancement in Geneformer v2 is the introduction of a CLS token appended to all sequences. This addition allows the model to generate two types of embeddings: the traditional cell embedding, which averages the embeddings of each gene in the sequence (as in v1), and a CLS embedding, which outputs the embedding of the CLS token placed at the end of the sequence.

This means v2 can now accumulate and output meaningful information through the CLS token, providing a more comprehensive transcriptome representation.

It’s important to note that both versions of Geneformer are pretrained exclusively on human data. Therefore, the gene-mappings are limited to human RNA-seq data and hence cannot be easily applied to other species.

Both Geneformer v1 and v2 are available for immediate use through the Helical package. You can start using these models with just a few lines of code:

from helical.models.geneformer.model import Geneformer,GeneformerConfig
import anndata as ad

# Geneformer v2 (12-layer base model)
model_config = GeneformerConfig(model_name="gf-12L-95M-i4096", batch_size=10)
geneformer_v2 = Geneformer(model_config=model_config)

# You can use other model names in the config, such as:
# "gf-12L-30M-i2048" (Version 1.0)
# "gf-20L-95M-i4096" (Version 2.0, 20-layer model)
# "gf-12L-95M-i4096-CLcancer" (Version 2.0, Cancer-tuned)

# Example usage
ann_data = ad.read_h5ad("dataset.h5ad")
dataset = geneformer_v2.process_data(ann_data)
embeddings = geneformer_v2.get_embeddings(dataset)
print("Base model embeddings shape:", embeddings.shape)

For more details on how to use the Helical package, you can consult the full documentation here.

Now, without further ado, let’s dive into the benchmarking!

‍

Benchmarking

Set-up

Here, we present the benchmarking results for the 12-layer version of Geneformer v1 (gf-v1), along with the 12-layer version of Geneformer v2 (gf-v2-cls) and its cancer-tuned counterpart (gf-v2-cancer-cls). For these tests, the v1 model was evaluated using the cell embedding mode, while v2 leveraged the CLS embedding mode.

We’ve excluded results for the 20-layer model, as its performance was similar to the 12-layer version, if not slightly worse, as also noted by the authors: “The largest 20 layer model did not surpass the intermediate-sized 12 layer model”.

The full model configuration parameters are as follows:

"Geneformer v2 (base and cancer-tuned)": {
  "batch_size": 24,
  "emb_layer": -1,
  "emb_mode": "cls",
  "device": "cuda",
  "accelerator": false,
  "input_size": 4096,
  "special_token": true,
  "embsize": 512,
  "nproc": 1
}

"Geneformer v1": {
  "batch_size": 24,
  "emb_layer": -1,
  "emb_mode": "cell",
  "device": "cuda",
  "accelerator": false,
  "input_size": 2048,
  "special_token": false,
  "embsize": 512,
  "nproc": 1
}

In a first attempt to compare performances, we benchmarked these models on cell-type annotation (classification). To achieve this, we trained a classifier head using the embeddings generated by the bio FMs along with the corresponding cell-type labels. We followed an 80–20 train-test split and employed a Support Vector Machine (SVM) from the scikit-learn library, using the following configuration:

"svm": {
  "kernel": "rbf",
  "degree": 3,
  "C": 1,
  "decision_function_shape": "ovr"
}

We report the accuracy and F1 score for each model, along with precision and recall. The models were evaluated on two distinct datasets to provide a more comprehensive understanding of their performance.

The first dataset we used for benchmarking is the Cross-Tissue Adult Immune Atlas, which contains around 300,000 cells from a wide range of tissues, including the lung, liver, bone marrow, and more. You can read the full study here.

The evaluation was aimed at assessing Geneformer’s ability to generalise across a variety of human tissues and its capacity to generate a robust and informative embedding space (capturing a wide range of biological variability). We used the “Manually_curated_cell_type” label for evaluation.

Given the large size of the dataset, we employed stratified sampling by cell type annotation. Thus ensuring that proportional variability within each cell type class remained constant across the 30 iterations, with each run consisting of 15,000 samples.

The second dataset is the CITE-seq Yolk Sac dataset, which is relatively small, containing around 3,000 samples. It focuses on the human yolk sac between 3–8 weeks post-conception.

Despite its size, the dataset presents varying levels of cell type annotation resolution, from LVL1 to LVL3, which allows us to evaluate the ability of each model to recover biologically meaningful information, including higher-order data about cell types and the various sub-states that exist within them. As the dataset is small, we were able to benchmark directly on the full set of samples.

This combination of datasets allows us to test both the generalisation capabilities across diverse tissue types as well as precision in capturing cell type differences at finer resolutions.

Results and Discussion

Now let’s dive into the results. Below is a high-level overview of the performance for cell type annotation on the Cross-Tissue Adult Immune Atlas dataset:

Comparing Geneformer v1 and v2 on Cell type annotation (Cross Tissue Adult Immune Atlas)

And below the results for LVL1 (low resolution) cell-type annotation on the CITE-seq Yolk Sac dataset.

Comparing Geneformer v1 and v2 on **LVL1** Cell type annotation (CITE-seq Yolk Sac dataset)

From these results, it’s evident that both Geneformer v2 variants significantly outperform the v1 model, especially on the larger, more diverse Cross-Tissue Atlas. This highlights that Geneformer v2 generates richer embeddings with an embedding space capable of effectively distinguishing cells within a wide range of tissues.

This improvement can be attributed not only to the larger pretraining corpus but also to the expanded context window, allowing the model to capture more information simultaneously.

It’s clear that Geneformer v2 performs well in general high-level immune cell classification. But the real question is: how does it perform when we zoom in on higher levels of resolution? Does its embedding space possess both breadth AND depth?

Let’s find out by comparing the models on LVL3 (much higher resolution) cell type annotation using the CITE-seq Yolk Sac dataset:

Comparing Geneformer v1 and v2 on **LVL3** Cell type annotation (CITE-seq Yolk Sac dataset)

Biologists typically operate at higher resolution (LVL3) of cell classification, as it provides more detailed information and, crucially, distinguishes between cell sub-states (states within the same type). These sub-states are biologically more challenging to differentiate than high-level cell types, making them more valuable and meaningful for research.

In this context, Geneformer v2 demonstrates a significant improvement over its predecessor, with an F1 score twice that of Geneformer v1. It’s worth noting that this improvement isn’t reflected in overall accuracy due to the substantial class imbalance in LVL3 labels, as illustrated in the following plot:

Distribution of LVL3 Class Labels in the CITE-seq Yolk Sac Dataset

In other words, if a model were to constantly predict the Macrophage label, it would achieve decent accuracy while yielding a near-zero F1 score. This illustrates the importance of considering multiple metrics when evaluating model performance!

It’s also worth noting that the cancer-tuned model performs comparably, if not better, than the base model across all metrics, on both datasets. This suggests that, despite its domain-specific tuning, it retains its knowledge of the immune case (no clear signs of catastrophic forgetting), maintaining high performance in a general context.

Conclusion

Overall, these results indicate that Geneformer v2 learns representations that more accurately reflect the variability inherent in biological complexity, and all of this is achieved without any fine-tuning!

Speaking of fine-tuning, be sure to check out Matthew’s post on how to fine-tune a Geneformer model for your own task and data here.

In our next blog post, we will explore the performance of the cancer-tuned Geneformer, but on cancerous data this time. We’ll be able to evaluate the quality of the domain tuning and compare it to the base model!

Also, be sure to keep an eye out for the release of our official Helical Bio Benchmark Leaderboard, which will cover a wide range of tasks, datasets, and bio foundation models. We’re always cooking!

Get started

About Helical

Helical is an open-core platform for computational biologists and data scientists to effortlessly integrate single-cell & genomics AI Bio Foundation Models in early-stage drug discovery.

Check out our

open-source library

Follow or subscribe to stay up-to-date with the latest developments in Bio Foundation Models.