Virtual Cell Perturbation Metrics Reloaded
Thoughts on metrics for single cell perturbation prediction models
To the question: What is a virtual cell? The consensus answer at the moment is: gene perturbation prediction. It can, of course, entail a lot more than that, but just in terms of vibe check, the current understanding is that virtual cell modeling is about predicting the gene expression response of a cell (line/type) upon a genetic (or chemical) perturbation.
However, if you take a random sample of papers on the topic, you will notice that there is little consensus on the evaluations. While the chosen datasets are generally well-established, the splits and metrics are quite diverse and do not overlap much. And that’s ok. The field is clearly in its infancy, and it will take time for consensus-building and standardization (and initiatives like the Virtual Cell Challenge go in that direction).
Defining a set of standards for measuring progress is central to advancing the field, and metrics play a principal role in this process.
In the context of perturbation predictions, various metrics have been proposed to measure the predictive accuracy of perturbation prediction models. These are some relevant papers that provide a good starting point on the topic:
A benchmark and analysis on perturbation prediction models, including models that leverage gene-regulatory-network inference-type approaches.
A benchmark and TxPert perturbation prediction model from Recursion
A benchmark from Altos Labs on genetic perturbation prediction. Only a few models are surveyed, but there are interesting ablation studies.
A benchmark on distance metrics (not models!), with a focus on their applicability to perturbation prediction tasks.
THE paper that made the point of the low performance of deep-learning-based models for perturbation prediction tasks (it’s in the title). This one is also worth reading; it focuses on “foundation models”.
STATE paper from the Arc Institute, which, in many ways, preceded the Virtual Cell Challenge.
Before we dive in, one important note to consider is that all these metrics effectively only measure how well the predictions capture the mean shift of the perturbations, effectively disregarding the heterogeneity of the cellular response. My understanding is that, in the context of genetic perturbation, this is fine, since the population response is homogeneous (and indeed, almost indistinguishable from control). However, this might change when the perturbagen is a peptide|cytokine|small molecule or in the case of in vivo perturbation, where the cell type variation is a priori more heterogeneous than in cell lines.
Anyway, enough with digressions already…
The problem with evaluating perturbation prediction models, especially in the context of genetic perturbations (CRISPR knockouts/knockdowns or CRISPR activations), is that the perturbagen elicits a small effect on gene expression variability. A qualitative yet insightful way to understand this is to take any perturb-seq paper and look at the UMAP1 visualization of the dataset. In most cases, the UMAP is a “blob”, and it doesn’t really show any real sub-structure or cell state variation. That is because a lot of the genes don’t actually change upon perturbation. Therefore, a model that always predicts the same values is likely to score well on metrics that purely evaluate how closely the true and predicted gene expressions align in some metric space.

The author of the PerturbBench paper highlighted this issue in reference to the concept of mode collapse in generative modeling. Mode collapse refers to an undesirable end state of the learnt (approximate/implicit) posterior distribution, collapsing to the mode of the data density. Samples from the model all look alike. In the context of perturbation prediction, samples from the models are small variations around the mean expression of the cell line/type. Hence, metrics that cannot properly discriminate samples from a “mode collapsed” model are not useful2.
To enable their benchmark to discriminate between well-trained and “mode collapsed” models, the author of PerturbBench proposed the Perturbation Discrimination Score, which is an alternative to the Mean Reciprocal Ranking score3 used as a metric in information retrieval evaluations. In short, the metric ranks how the predicted gene expression profiles compare to the true gene expression profiles by ranking the distances in gene expression space of all predicted perturbations against the truth. In the case where the closest predicted expression profiles correspond to the true perturbation, the rank is equal to 0 (maximum score). A variation of this metric is also used by the VCC, but with a slight modification: they do not rank distances between gene expression profiles, but distances between differences (perturbation v. control) of gene expression profiles, as well as flipping the ranking terms (see also this section of this great blogpost).
Recently, two papers have tried to address this problem, arguing that current metrics are miscalibrated and, hence, mainly quantify a systematic shift in gene expression profiles shared across all perturbations, which is not necessarily interesting or meaningful for accurately identifying the fine-grained cellular response.
The first paper, Systema, highlights how current genetic perturbation datasets suffer from the presence of systematic differences between perturbed and control cells that are not specific to any perturbation, but reflect global sources of variations such as cell cycle, stress response, or epigenetic changes possibly triggered by the induced knock out, down, or up of a specific gene upon genetic engineering. The paper is a tour de force of analysis and visualization on these systematic sources of variation induced by the genetic perturbations, as well as a comprehensive evaluation of how model performance is impacted.
The other paper(s) introduce the idea of metrics weighting, that is, scaling the gene-specific contribution to the metric by some weighting factor. The weighting factor in this case is the normalized effect size of differentially expressed genes (DEGs) for each perturbation. Through extensive benchmarks with both real and simulated data, they demonstrate how weighted metrics are better calibrated and correctly rank model performance by accounting for predictions that capture perturbation-specific effects. They also introduce the “positive control” baselines, which essentially correspond to bootstrapping cells from the same context/perturbation of the test set (similar in scope to a baseline introduced by TxPert), providing an informative upper bound on model performance.
Both papers share a central idea: accounting for perturbation-specific expression changes leads to a better estimate of perturbation prediction performance.
In Systema, this is achieved by swapping the reference from control cells to the average of all the perturbed cells. Such a reference is used for computing various metrics, such as the Delta Pearson Correlation Coefficient or the Mean Absolute Error (MAE).
In Mejia et al., the gene-level weights used to calibrate the metrics are computed from the effect sizes of the differentially expressed genes between the perturbations and everything else (
scanpy.tl.rank_genes_groups(…, reference=’rest’)).
Both papers present compelling evidence that this approach yields better metrics, enabling more faithful recovery of true prediction performance and the ability to discriminate against mode collapse scenarios, making them overall superior metrics. Nevertheless, these naturally become biased performance estimates. The bias is introduced by the fact that the metric is either computed or rescaled based on a reference that is dataset- and split-dependent. This approach also raises the question of what this new reference entails.
To make this more concrete, these metrics require a decision on which perturbations to include for computing the new reference (or the metric scaling). These are the 3 scenarios I see:
using all perturbations as reference. This is problematic because it effectively introduces data leakage, as the test perturbations now contribute to the reference/weighting factors used in the metric calculation.
using only the perturbations in the training set. This is technically correct, but it introduces a bias, since the metric is now dependent on an additional variable (the split). This also raises the problem that metrics across splits are not really comparable.
using an external reference. I think this would make a lot of sense, but I am not sure what this could be at the moment. I will discuss this later on.
While case 2 (suggested by the authors) is sound, I have a lingering sense of discomfort with these solutions. To be more specific, while this approach may provide a solution to the problem of effectively evaluating perturbation prediction models, it diverges from the actual real-world settings in which these models are intended to be deployed.
Let’s take a step back. Why do we ultimately care about predicting perturbation response? Similar to AlphaFold, we would ideally like to use these models to perform in silico experiments, dramatically increasing the hypothesis space and rate of target discovery. Optimizing for relative perturbation shifts is not exactly that, and while improving model performance ranking, it may ultimately exacerbate the complexities of data splits and evaluations, which are already varied and inconsistent. Furthermore, we would probably want metrics to be comparable across datasets and evaluations. By comparable, I don’t mean exactly, since even a metric like MSE is impacted by dataset scale and inherent variance. Yet, using a data-dependent reference for metric normalization simply adds an additional source of variation and, therefore, increases the difficulty in interpreting metrics.
So, what is to be done?
Here are four thoughts (not really solutions) on what could move the field forward:
From sample-level to population-level summary statistics
Nadig et al. really hit the nail on the head regarding the lack of replicability in genetic perturbation screens (if you don’t want to read the paper, the thread is great). The authors argue that this is not dissimilar to effect size estimations in Genome Wide Association Screens (GWAS), and the solution there is to rely on genome-wide population-level statistics, which enable comparisons of GWAS across cohorts. I don’t know enough about the topic, but I think the idea is that by accounting for population-level variation, the genetic association estimates for the phenotype of interest can be compared and integrated across studies. This enabled massive replicability studies and meta-analysis, ultimately leading to a much larger sample size and thus power to discover variants with smaller effect sizes. In the context of Perturb-seq experiments, the authors propose using a shrinkage approach to account for sample and perturbation-level variability, showing that they can recover much more replicable effect sizes across standard datasets commonly used for benchmarking purposes.

Nevertheless, the idea of leveraging external information (aka population-level statistics) is not unreasonable with current approaches: e.g., taking the metric weighting method from Mejia et al., but instead of relying on sample-level statistics, use external population-level statistics, such as e.g. gene-length, or even average gene expression in cell atlases like CXG Census. Such population-reference stats might not provide as clear benefits to the metric scoring, but might still alleviate the lack of mode collapse discrimination.
Leveraging pathway ontologies and protein complexes annotations for biologically-grounded metrics
Another approach that I have seen scattered around in papers, but not fully explored and fleshed out yet, is to tie perturbation prediction performance to the ability to recover biological pathways and ontologies. In an older paper from the Trapnell group, they explored this idea to evaluate how distances in low-dimensional space would highlight shared (or lack thereof) function in genetic perturbation studies in yeast. If two genes encode for two components of the same pathway or protein complex, it is likely that they will elicit the same downstream effect in transcriptomic shift. The idea of evaluating model representations based on gene and pathway ontologies is not new in image-based high-throughput screens, but has not made much inroads in perturbation prediction benchmarks for transcriptomics data (with a couple of notable exceptions).
Target efficiency and gene dosage effects
For CRISPR-i/a (inhibitor, activator), the expected effect of the perturbation on the target gene is to either shut it down or increase its expression. But because it’s biology, and things are never black and white, the outcome is not as precise as you would expect. In fact, the degree of penetrance of the perturbation varies across the cell population. There are, in fact, tools that aim to discriminate between “perturbed” and “escaping” cells based on the perturbation signature. "Perturbed” cells have received the CRISPR guide and do show a knock-out/up/down effect, whereas “escaping” cells, despite reporting a sgRNA molecule, do not show the expected perturbation signature. In the TxPert paper, the authors looked at this aspect and demonstrated that the model predicts the expression of the target genes less accurately than for the other genes. I find this a particularly interesting failure mode, and while it might not be critical for learning the downstream GRN, it could still provide insights into the model’s ability to capture the perturbation effect.
To make a concrete example, let’s consider the knock-down settings, where every time a gene_i is targeted, its expression is reduced to a low level (or zero). I would expect that the model would effectively shortcut this information (geneA target → geneA exp low, geneB target → geneB exp low, …), but, in practice, in many models, there is no path between the perturbation representation and the input features, solvable by something as simple as an indicator path between representations. This architectural change may be a simple trick to improve the prediction of the target gene expression itself, or perhaps increase the overall performance.
With high-quality data, it would be possible to extend this evaluation to “gene dosage” effects, which do seem to manifest in recent high-quality datasets4. This might again be just a technical detail, or it could provide a window into other types of phenomena, such as epigenetic markers and chromatin states.

Indicators of task complexity
An orthogonal approach that is still largely unexplored is finding measures of “difficulty” in the prediction. I believe this would be particularly useful because, on the one hand, it could provide an interpretability angle on model performance (e.g., which might be related to the quantification of gene essentiality), and on the other hand, it could inform model development about the failure modes of current models.
For instance, in the CASP competition, there are explicit categories indicating the difficulty of the task, such as Template-Based Modeling (easy) or Free Modeling (hard). There are also other axes of task complexity, such as Multiple Sequence Alignment (MSA) depth (shallow-hard, deep-easy), sequence homology (high identity to known structure-easy, no homologs-hard), and protein types (globular-easy, intrinsically disordered-hard), among others. Such factors also help understand model failure modes: for example, what we learned is that Alphafold generalizes quite well in the generation of structures not very similar to the training data, which is actually a hard generalization, but it doesn’t generalize well in the sequence direction, cause it requires a lot of sequences for the crucial preprocessing step of MSA alignment. Nevertheless, that’s an easier failure mode to address, since it's easier to get more sequences than to get more structures.
In the TxPert paper (Figure 14), the Pharos knowledge base is leveraged to calculate a “difficulty” score. The database provides a measure of the available information for each gene, based on gene ontologies, literature mining, and other sources. For TxPert, since it leverages such prior information, it is an important experiment to validate this. Not surprisingly, the model shows high prediction error for genes for which little information is available. However, this is not exactly what I have in mind here: what I’m thinking is more of ways to stratify target genes based on how much mutual information they share with each other. For instance, one example would be to create a test set based on the marker genes of a cell type. There would probably be less overlap in the gene co-expression pattern with the training set, and hence, it would be a harder split to predict.
Finally, the post-mortem of the Virtual Cell Challenge competition will also yield valuable learnings. Although it has received some backlash, being the first experiment of its kind in this Virtual Cell era, it will provide useful feedback on what the models' failure modes and hard predictions are across the board.
The title of this post is inspired by the excellent: Metrics reloaded: recommendations for image analysis validation from Meier-Hein et al.
The perspective represents the culmination of a 2-year process undertaken by the bioimage analysis community to catalog and map out metrics and evaluations for bioimage model assessments. By reading it, you can really appreciate the spectacular amount of coordination that went into the manuscript and software, and the care of the members of the community to balance being exhaustive but also practical in delineating the “do’s and don’ts” of bioimage analysis benchmarking.
I’m looking forward to community efforts like that to provide guidance and clarity for perturbation prediction benchmarking in the future.

Thanks to Yuge Ji and Eric Kernfeld for feedback on this post, and Ben Galluser for pointing me to the “Metrics Reloaded” paper.
I know that “squinting at the UMAP” is bad, but this is used for illustrative purposes only.
In Mejia et al. they use the term “mode collapse” as a feature descriptor of the metric itself, and not of the model. It is a bit confusing to me, but I might be wrong, and that terminology can be used to refer to metrics as well
Check this visualization for a quick explanation of the Mean Reciprocal Ranking score
My inner “computational biologist” reminds me that much of this evaluation discussion may ultimately come down to using higher-quality data…


For a more fleshed out example of "Leveraging pathway ontologies and protein complexes annotations for biologically-grounded metrics", see the EFAAR paper from Recursion last year: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012463
This approach focuses on evaluating the embedding space, which has the added benefit of being applicable to multiple modalities.
There is now another challenge on Kaggle, using a score that is "Inspired by Mejia et al.". Let's see how that one holds.
https://www.kaggle.com/competitions/echoes-of-silenced-genes/overview