remove-background
report¶This output report from cellbender remove-background
contains a summary of the run, including counts remaining, counts removed, further analyses, and any warnings or suggestions if the run seems to be abnormal.
This HTML report is created from a jupyter notebook at
cellbender/cellbender/remove-background/report.ipynb
within the CellBender codebase. Feel free to run the notebook yourself and make any changes you see fit, or use it as a starting point for further analyses.
The commentary in this report is generated using automated heuristics and best guesses based on hundreds of real datasets. If any of the automated commentary in this report seems incorrect for your dataset, please submit a question or an issue at our github repository https://github.com/broadinstitute/CellBender
Cellarium Lab .. Methods Group .. Data Sciences Platform .. Broad Institute
(Modify this section if you run this notebook yourself.)
Input file: /om2/user/benjames/raw/human_brain_3k.h5ad Output file: /net/vast-storage/scratch/vast/kellislab/benjames/cellbender/human_brain_3k/cellbender.h5
2023-12-05 21:10:01
AnnData object with n_obs × n_vars = 8159 × 36601 obs: 'background_fraction', 'cell_probability', 'cell_size', 'droplet_efficiency', 'n_raw', 'n_cellbender' var: 'ambient_expression', 'feature_type', 'genome', 'gene_id', 'cellbender_analyzed', 'n_raw', 'n_cellbender' uns: 'cell_size_lognormal_std', 'empty_droplet_size_lognormal_loc', 'empty_droplet_size_lognormal_scale', 'swapping_fraction_dist_params', 'estimator', 'features_analyzed_inds', 'fraction_data_used_for_testing', 'learning_curve_learning_rate_epoch', 'learning_curve_learning_rate_value', 'learning_curve_test_elbo', 'learning_curve_test_epoch', 'learning_curve_train_elbo', 'learning_curve_train_epoch', 'target_false_positive_rate' obsm: 'cellbender_embedding' layers: 'raw', 'cellbender'
removed 428626 counts from non-empty droplets removed 1.11% of the counts in non-empty droplets
Rough estimate of expectations based on nothing but the plot above: roughly 54476 noise counts should be in non-empty droplets that is approximately 0.14% of the counts in non-empty droplets with a false positive rate [FPR] of 1.0%, we would expect to remove about 1.14% of the counts in non-empty droplets
It looks like the algorithm did a great job meeting that expectation.
The learning curve tells us about the progress of the algorithm in inferring all the latent variables in our model. We want to see the ELBO increasing as training epochs increase. Generally it is desirable for the ELBO to converge at some high plateau, and be fairly stable.
What to watch out for:
1. large downward spikes in the ELBO (of value more than a few hundred) 2. the test ELBO can be smaller than the train ELBO, but generally we want to see both curves increasing and reaching a stable plateau. We do not want the test ELBO to dip way back down at the end. 3. lack of convergence, where it looks like the ELBO would change quite a bit if training went on for more epochs.
Automated assessment --------
Summary:
This is slightly unusual behavior, and a reduced --learning-rate might be indicated. Consider re-running with half the current learning rate to compare the results.
Pearson correlation coefficient for the above is 0.9317
This meets expectations.
Ranked by fraction removed, and excluding genes with fewer than 2171 total raw counts (90th percentile)
ambient_expression | feature_type | genome | gene_id | cellbender_analyzed | n_raw | n_cellbender | n_removed | fraction_removed | fraction_remaining | n_raw_cells | n_cellbender_cells | n_removed_cells | fraction_removed_cells | fraction_remaining_cells | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_name | |||||||||||||||
MT-ND3 | 0.004209 | Gene Expression | GRCh38 | ENSG00000198840 | True | 5977 | 4868 | 1109 | 0.185545 | 0.814455 | 5154 | 4868 | 286 | 0.055491 | 0.944509 |
MT-ATP6 | 0.006080 | Gene Expression | GRCh38 | ENSG00000198899 | True | 9017 | 7382 | 1635 | 0.181324 | 0.818676 | 7804 | 7382 | 422 | 0.054075 | 0.945925 |
MT-CYB | 0.004366 | Gene Expression | GRCh38 | ENSG00000198727 | True | 6383 | 5241 | 1142 | 0.178913 | 0.821087 | 5544 | 5241 | 303 | 0.054654 | 0.945346 |
MT-CO2 | 0.010887 | Gene Expression | GRCh38 | ENSG00000198712 | True | 16173 | 13326 | 2847 | 0.176034 | 0.823966 | 14060 | 13326 | 734 | 0.052205 | 0.947795 |
MT-ND2 | 0.004485 | Gene Expression | GRCh38 | ENSG00000198763 | True | 6975 | 5799 | 1176 | 0.168602 | 0.831398 | 6118 | 5799 | 319 | 0.052141 | 0.947859 |
MT-CO1 | 0.011273 | Gene Expression | GRCh38 | ENSG00000198804 | True | 17899 | 15014 | 2885 | 0.161182 | 0.838818 | 15808 | 15014 | 794 | 0.050228 | 0.949772 |
MT-ND1 | 0.004616 | Gene Expression | GRCh38 | ENSG00000198888 | True | 7543 | 6334 | 1209 | 0.160281 | 0.839719 | 6666 | 6334 | 332 | 0.049805 | 0.950195 |
MT-CO3 | 0.007446 | Gene Expression | GRCh38 | ENSG00000198938 | True | 12440 | 10462 | 1978 | 0.159003 | 0.840997 | 10995 | 10462 | 533 | 0.048477 | 0.951523 |
MT-ND5 | 0.001510 | Gene Expression | GRCh38 | ENSG00000198786 | True | 2656 | 2235 | 421 | 0.158509 | 0.841491 | 2357 | 2235 | 122 | 0.051761 | 0.948239 |
MT-ND4 | 0.005869 | Gene Expression | GRCh38 | ENSG00000198886 | True | 11474 | 9935 | 1539 | 0.134129 | 0.865871 | 10361 | 9935 | 426 | 0.041116 | 0.958884 |
The inferred posterior probability that each droplet is non-empty.
We sometimes write "non-empty" instead of "cell" because dead cells and other cellular debris can still lead to a "non-empty" droplet, which will have a high posterior cell probability. But these kinds of low-quality droplets should be removed during cell QC to retain only high-quality cells for downstream analyses.
remove-background
¶The intent is to change the input data as little as possible while achieving noise removal. These plots show general summary statistics about similarity of the input and output data. We expect to see the data lying close to a straight line (gray). There may be outlier genes/features, which are often those highest-expressed in the ambient RNA.
The plots here show data for inferred cell-containing droplets, and exclude the empty droplets.
We are not looking for anything specific in the PCA plot of the gene expression embedding, but often we see clusters that correspond to different cell types. If you see only a single large blob, then the dataset might contain only one cell type, or perhaps there are few counts per droplet.
Final test ELBO is much lower than the max test ELBO.