CellBender remove-background report¶

This output report from cellbender remove-background contains a summary of the run, including counts remaining, counts removed, further analyses, and any warnings or suggestions if the run seems to be abnormal.

This HTML report is created from a jupyter notebook at

cellbender/cellbender/remove-background/report.ipynb

within the CellBender codebase. Feel free to run the notebook yourself and make any changes you see fit, or use it as a starting point for further analyses.

The commentary in this report is generated using automated heuristics and best guesses based on hundreds of real datasets. If any of the automated commentary in this report seems incorrect for your dataset, please submit a question or an issue at our github repository https://github.com/broadinstitute/CellBender

Cellarium Lab .. Methods Group .. Data Sciences Platform .. Broad Institute


Input and output files¶

(Modify this section if you run this notebook yourself.)

Input file: /net/vast-storage/scratch/vast/kellislab/benjames/raw/pbmc_granulocyte_sorted_10k.h5ad
Output file: /net/vast-storage/scratch/vast/kellislab/benjames/cellbender/pbmc_granulocyte_sorted_10k/cellbender.h5

Report¶

CellBender version 0.3.0¶

2023-12-05 11:57:36

cellbender.h5¶

Loaded dataset¶

AnnData object with n_obs × n_vars = 18310 × 36601
    obs: 'background_fraction', 'cell_probability', 'cell_size', 'droplet_efficiency', 'n_raw', 'n_cellbender'
    var: 'ambient_expression', 'feature_type', 'genome', 'gene_id', 'cellbender_analyzed', 'n_raw', 'n_cellbender'
    uns: 'cell_size_lognormal_std', 'empty_droplet_size_lognormal_loc', 'empty_droplet_size_lognormal_scale', 'swapping_fraction_dist_params', 'estimator', 'features_analyzed_inds', 'fraction_data_used_for_testing', 'learning_curve_learning_rate_epoch', 'learning_curve_learning_rate_value', 'learning_curve_test_elbo', 'learning_curve_test_epoch', 'learning_curve_train_elbo', 'learning_curve_train_epoch', 'target_false_positive_rate'
    obsm: 'cellbender_embedding'
    layers: 'raw', 'cellbender'

Examine how many counts were removed in total¶

removed 1833889 counts from non-empty droplets
removed 3.44% of the counts in non-empty droplets
Rough estimate of expectations based on nothing but the plot above:
roughly 1215847 noise counts should be in non-empty droplets
that is approximately 2.28% of the counts in non-empty droplets
with a false positive rate [FPR] of 1.0%, we would expect to remove about 3.28% of the counts in non-empty droplets

It looks like the algorithm did a great job meeting that expectation.

Assessing convergence of the algorithm¶

The learning curve tells us about the progress of the algorithm in inferring all the latent variables in our model. We want to see the ELBO increasing as training epochs increase. Generally it is desirable for the ELBO to converge at some high plateau, and be fairly stable.

What to watch out for:

1. large downward spikes in the ELBO (of value more than a few hundred) 2. the test ELBO can be smaller than the train ELBO, but generally we want to see both curves increasing and reaching a stable plateau. We do not want the test ELBO to dip way back down at the end. 3. lack of convergence, where it looks like the ELBO would change quite a bit if training went on for more epochs.

Automated assessment --------

Summary:

This learning curve looks normal.

Examine count removal per gene¶

Pearson correlation coefficient for the above is 0.9763

This meets expectations.

Table of top genes removed¶

Ranked by fraction removed, and excluding genes with fewer than 2892 total raw counts (90th percentile)

ambient_expression feature_type genome gene_id cellbender_analyzed n_raw n_cellbender n_removed fraction_removed fraction_remaining n_raw_cells n_cellbender_cells n_removed_cells fraction_removed_cells fraction_remaining_cells
gene_name
S100A8 0.004762 Gene Expression GRCh38 ENSG00000143546 True 33848 21860 11988 0.354172 0.645828 28784 21860 6924 0.240550 0.759450
S100A9 0.006040 Gene Expression GRCh38 ENSG00000163220 True 48890 33856 15034 0.307507 0.692493 42397 33856 8541 0.201453 0.798547
RPS29 0.012873 Gene Expression GRCh38 ENSG00000213741 True 113718 82469 31249 0.274794 0.725206 100510 82469 18041 0.179495 0.820505
CD14 0.000561 Gene Expression GRCh38 ENSG00000170458 True 5404 3945 1459 0.269985 0.730015 4829 3945 884 0.183061 0.816939
RPL39 0.006768 Gene Expression GRCh38 ENSG00000198918 True 62573 46089 16484 0.263436 0.736564 55589 46089 9500 0.170897 0.829103
ATP5F1E 0.002720 Gene Expression GRCh38 ENSG00000124172 True 26589 19994 6595 0.248035 0.751965 23779 19994 3785 0.159174 0.840826
RPL32 0.007683 Gene Expression GRCh38 ENSG00000144713 True 77471 58777 18694 0.241303 0.758697 69550 58777 10773 0.154896 0.845104
RPS21 0.006728 Gene Expression GRCh38 ENSG00000171858 True 69437 52964 16473 0.237237 0.762763 62559 52964 9595 0.153375 0.846625
ATP5ME 0.000488 Gene Expression GRCh38 ENSG00000169020 True 5008 3820 1188 0.237220 0.762780 4495 3820 675 0.150167 0.849833
COX7B 0.000453 Gene Expression GRCh38 ENSG00000131174 True 4689 3591 1098 0.234165 0.765835 4208 3591 617 0.146625 0.853375

Cell probabilities¶

The inferred posterior probability that each droplet is non-empty.

We sometimes write "non-empty" instead of "cell" because dead cells and other cellular debris can still lead to a "non-empty" droplet, which will have a high posterior cell probability. But these kinds of low-quality droplets should be removed during cell QC to retain only high-quality cells for downstream analyses.

Concordance of data before and after remove-background¶

The intent is to change the input data as little as possible while achieving noise removal. These plots show general summary statistics about similarity of the input and output data. We expect to see the data lying close to a straight line (gray). There may be outlier genes/features, which are often those highest-expressed in the ambient RNA.

The plots here show data for inferred cell-containing droplets, and exclude the empty droplets.

PCA of encoded gene expression¶

We are not looking for anything specific in the PCA plot of the gene expression embedding, but often we see clusters that correspond to different cell types. If you see only a single large blob, then the dataset might contain only one cell type, or perhaps there are few counts per droplet.

Summary of warnings:¶

None.