New computational tool helps scientists interpret complex single-cell data

Researchers from Turku Bioscience Centre at University of Turku have developed a new computational method to interpret complex single-cell data.

The human body contains about 37 trillion cells. Some are more alike than others, yet never exactly the same. Modern single-cell technologies allow characterizing this cellular heterogeneity, measuring dozens to thousands of molecules, such as genes or proteins, across thousands of individual cells simultaneously and providing insights into health and diseases.

A small amount of blood contains billions of red blood cells and millions of immune cells. Each type of cell has its own molecular ‘fingerprint’, which researchers can identify by combining single-cell technologies with computational methods. When studying many different samples, scientists must first match the same cell types across them. This is a demanding step known as data integration.

However, current integration methods often struggle when cell types vary between samples or appear in very different amounts. In such cases of imbalanced data, methods can mistakenly combine distinct cell types.

To solve this, researchers from Professor Laura Elo’s InFLAMES-group at Turku Bioscience Centre, University of Turku, Finland, have now developed a new machine learning-based algorithm that effectively integrates even imbalanced data across samples. The algorithm has been implemented as an open-source software called Coralysis.

“Single-cell technologies let us study the incredible diversity of cells, but comparing them across samples is tricky. This motivated us to develop a method to uncover these hidden patterns reliably”, says Associate Professor Sini Junttila, one of the supervisors of the study.

“We were inspired by the process of assembling a puzzle where one begins by grouping pieces based on low- to high-level features such as colour and shading before looking at shape and patterns. Similarly, our algorithm progressively integrates cellular identities through multiple rounds of divisive clustering”, explains doctoral researcher António Sousa, the lead developer of Coralysis.

At its core, Coralysis relies on machine learning, enabling it to build reference models that can be used to predict cellular identities of new datasets and even estimate how confident the predictions are. This helps researchers avoid the cumbersome and often unreliable task of manually identifying cell types from scratch. Another unique feature of Coralysis is that it enables detection of changing cellular states that might otherwise be missed.

“With Coralysis, we aim to provide the scientific community with a powerful tool to reveal hidden patterns of cellular diversity and gain a deeper understanding of complex single-cell data. By making it openly available, we hope to support collaboration and accelerate discoveries across the global research community”, says Professor Laura Elo, the principal investigator of the project. The study by Elo group has been published in the scientific journal Nucleic Acids Research.