how to use scrublet

3 min read 15-01-2025

Single-cell RNA sequencing (scRNA-seq) is a powerful technique for studying gene expression at the single-cell level. However, scRNA-seq data is often contaminated with doublets—cells that have been mistakenly sequenced as a single cell, but actually represent two or more cells. These doublets can confound downstream analyses and lead to incorrect biological conclusions. Scrublet is a Python package designed to identify and remove these doublets from your scRNA-seq data. This article will guide you through the process of using Scrublet effectively.

1. Installation and Setup

Before you begin, make sure you have the necessary software installed. You'll need Python (preferably version 3.7 or higher) and several Python packages, including scrublet, scanpy, and numpy. You can install them using pip:

pip install scrublet scanpy numpy

You'll also need your scRNA-seq count matrix (typically a .csv or .mtx file). This matrix should have genes as rows and cells as columns, with the values representing the number of transcripts detected for each gene in each cell.

2. Data Preparation

Scrublet works best with normalized data. While not strictly required, normalizing your data using a method like library size normalization (counts per million, CPM) or total count normalization is recommended. Scanpy provides convenient functions for this:

import scanpy as sc
adata = sc.read_10x_mtx("path/to/your/data") # Replace with your data loading method
sc.pp.normalize_total(adata, target_sum=1e4) # Normalize to 10,000 counts per cell
sc.pp.log1p(adata) # Log-transform the data

Remember to replace "path/to/your/data" with the actual path to your data. The specific data loading method depends on your data format.

3. Running Scrublet

Now, let's run Scrublet. This involves creating a Scrublet object, then calling the scrub method. The expected_doublet_rate parameter is crucial—it represents your estimated percentage of doublets in the data. This is often estimated based on the library size distribution of your cells or prior knowledge of your experiment. A good starting point is often around 0.05 (5%).

import scrublet as scr
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.05)
scrub.run_scrublet(min_counts=2, min_cells=3, n_prin_comps=30)

min_counts: Minimum number of counts per gene for inclusion.
min_cells: Minimum number of cells expressing a gene for inclusion.
n_prin_comps: Number of principal components to use for dimensionality reduction. Adjust these parameters based on your dataset's characteristics. Experimentation may be required.

4. Interpreting the Results

After running Scrublet, you'll have access to several important outputs:

scrub.doublet_scores_: A NumPy array containing the doublet scores for each cell. Higher scores indicate a higher probability of being a doublet.
scrub.predicted_doublets_: A boolean array indicating whether each cell is predicted to be a doublet (True) or a singlet (False).

You can visualize these results:

scrub.plot_histogram() # Histograms of doublet scores

This plot helps you visually assess the distribution of doublet scores and choose a threshold for doublet removal. You can then filter your data based on this threshold:

adata = adata[:, ~scrub.predicted_doublets_] # Remove predicted doublets from your AnnData object.

5. Advanced Usage and Considerations

Simulations: Scrublet uses simulated doublets to train its model. You can adjust the simulation parameters to better reflect your specific experimental conditions.
Threshold Selection: The default threshold for doublet identification might not always be optimal. Carefully examine the doublet score histogram and consider adjusting the threshold based on your data's characteristics.
Integration with Other Tools: Scrublet's output can be easily integrated with other scRNA-seq analysis tools like Scanpy for further downstream analysis.

Conclusion

Scrublet provides a robust and user-friendly method for doublet detection in scRNA-seq data. By carefully preparing your data, selecting appropriate parameters, and interpreting the results, you can significantly improve the quality and reliability of your single-cell analyses. Remember to always carefully consider the assumptions and limitations of any computational method, and visually inspect the results before making any final conclusions. Effective use of Scrublet will greatly enhance your ability to analyze and interpret scRNA-seq data with confidence.