Picard Metrics Definitions

AlignmentSummaryMetrics: High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".
CollectIonTorrentBaseCallMetrics.IonBaseCallLibraryMetric: Metrics to describe molecular-barcode-specific data about an Ion Torrent basecalling run.
CollectIonTorrentBaseCallMetrics.IonBaseCallRunMetric: Metrics to describe an Ion Torrent basecalling run, including metrics about first 2 test fragments.
CollectIonTorrentBaseCallMetrics.IonRunMetric: Overview of an Ion Torrent basecalling run.
CollectOxoGMetrics.CpcgMetrics: Metrics class for outputs.
CollectQualityYieldMetrics.QualityYieldMetrics: A set of metrics used to describe the general quality of a BAM file
CollectVariantCallingMetrics.VariantCallingDetailMetrics: A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.
CollectVariantCallingMetrics.VariantCallingSummaryMetrics: A collection of metrics relating to snps and indels within a variant-calling file (VCF).
CollectWgsMetrics.WgsMetrics: Metrics for evaluating the performance of whole genome sequencing experiments.
ContaminationMetrics.FauxMetric: A metric-like container representing a contamination metrics file entry, though the contamination file is not a metrics file.
CoverageMetric:
DbSnpMatchMetrics: Metrics about how genotypes called by the pipeline match up to dbSNP, created by the CollectDbSnpMatches program and usually stored in a file with the extension ".dbsnp_matches".
DuplicationMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.
ExtractIlluminaBarcodes.BarcodeMetric: Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.
FingerprintDetailMetrics: A collection of metrics about how the reads in a BAM filematched up with the known genotypes for a particular fingerprint panel
FingerprintSummaryMetrics: A collection of metrics that summarize the match of reads in a particular BAM file against various fingerprint panels.
FingerprintingDetailMetrics: Detailed metrics about an individual SNP/Haplotype comparison within a fingerprint comparison.
FingerprintingSummaryMetrics: Summary fingerprinting metrics and statistics about the comparison of the sequence data from a single read group (lane or index within a lane) vs.
GcBiasDetailMetrics: Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.
GcBiasSummaryMetrics: High level metrics that capture how biased the coverage in a certain lane is.
GenotypeConcordanceMetrics: Statistics about how well a given set of input genotypes matches a set of well known reference genotypes.
GenotypeFreeContaminationMetric:
GenotypeFreeLikelihoodPlotMetric:
HsMetrics: The set of metrics captured that are specific to a hybrid selection analysis.
IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.
IlluminaLaneMetrics: Embodies characteristics that describe a lane.
IlluminaPhasingMetrics: Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis.
InsertSizeMetrics: Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".
InternalControlCycleMetrics: Metrics about observations of an internal control sequence in an individual cycle.
InternalControlSummaryMetrics: Summary metrics about internal controls within a lane.
JumpingLibraryMetrics: High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".
KmerMetrics: Metrics about an individual kmer in a SAM or BAM file
LowPassConcordanceMetrics: Concordance statistics for a set of low coverage reads against well known well known genotypes for the same sample for the purpose of ensuring that the sample being sequenced is the sample we think it is.
MendelianViolationMetrics: Describes the type and number of mendelian violations found within a Trio.
MotifCoverageMetric:
MultilevelMetrics:
RnaSeqMetrics: Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".
RrbsCpgDetailMetrics: Holds information about CpG sites encountered for RRBS processing QC
RrbsSummaryMetrics: Holds summary statistics from RRBS processing QC
SamFileValidator.ValidationMetrics:
ScreenSamReads.ScreenSamReadsMetrics: SAM or BAM read screening Metrics
SpikeInMetrics: Created by IntelliJ IDEA.
TargetedPcrMetrics: Metrics class for targeted pcr runs such as TSCA runs
UploadAggregationMetrics.Metrics.AggregationFauxMetric: A transient metric to encapsulate the data within the METRICS.AGGREGATION table.
UploadAggregationMetrics.Metrics.ForeignKeyFauxMetric: A transient metric to be merged with other metrics to augment them with a foreign key value in order to conform to database schema.
UploadAggregationMetrics.Metrics.ReadGroupFauxMetric: A transient metric to encapsulate the data within the METRICS.AGGREGATION_READ_GROUP table.
UploadFour54ScreeningMetrics.Four54ScreeningMetrics:
UploadIlluminaScreeningMetrics.IlluminaScreeningMetrics:
VariantCallingMetricsUploader.Metric.VariantCallingAnalysisFauxMetric: A bean for the METRIC.VARIANT_CALLING_ANALYSIS database table represented as a MetricBase so that it can be uploaded via the metrics uploader.
VariantCallingMetricsUploader.Metric.VariantCallingAnalysisForeignKeyFauxMetric:
VariantCallingSampleMetadataMetric:

High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".

Column Definitions

CATEGORY: One of either UNPAIRED (for a fragment run), FIRST_OF_PAIR when metrics are for only the first read in a paired run, SECOND_OF_PAIR when the metrics are for only the second read in a paired run or PAIR when the metrics are aggregated for both first and second reads in a pair.

TOTAL_READS: The total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR this value will be 2x the number of clusters.

PF_READS: The number of PF reads where PF is defined as passing Illumina's filter.

PCT_PF_READS: The percentage of reads that are PF (PF_READS / TOTAL_READS)

PF_NOISE_READS: The number of PF reads that are marked as noise reads. A noise read is one which is composed entirely of A bases and/or N bases. These reads are marked as they are usually artifactual and are of no use in downstream analysis.

PF_READS_ALIGNED: The number of PF reads that were aligned to the reference sequence. This includes reads that aligned with low quality (i.e. their alignments are ambiguous).

PCT_PF_READS_ALIGNED: The percentage of PF reads that aligned to the reference sequence. PF_READS_ALIGNED / PF_READS

PF_ALIGNED_BASES: The total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence.

PF_HQ_ALIGNED_READS: The number of PF reads that were aligned to the reference sequence with a mapping quality of Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the alignment is wrong.

PF_HQ_ALIGNED_BASES: The number of bases aligned to the reference sequence in reads that were mapped at high quality. Will usually approximate PF_HQ_ALIGNED_READS * READ_LENGTH but may differ when either mixed read lengths are present or many reads are aligned with gaps.

PF_HQ_ALIGNED_Q20_BASES: The subset of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher.

PF_HQ_MEDIAN_MISMATCHES: The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS).

PF_MISMATCH_RATE: The rate of bases mismatching the reference for all bases aligned to the reference sequence.

PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads.

PF_INDEL_RATE: The number of insertion and deletion events per 100 aligned bases. Uses the number of events as the numerator, not the number of inserted or deleted bases.

MEAN_READ_LENGTH: The mean read length of the set of reads examined. When looking at the data for a single lane with equal length reads this number is just the read length. When looking at data for merged lanes with differing read lengths this is the mean read length of all reads.

READS_ALIGNED_IN_PAIRS: The number of aligned reads whose mate pair was also aligned to the reference.

PCT_READS_ALIGNED_IN_PAIRS: The percentage of reads whose mate pair was also aligned to the reference. READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED

BAD_CYCLES: The number of instrument cycles in which 80% or more of base calls were no-calls.

STRAND_BALANCE: The number of PF reads aligned to the positive strand of the genome divided by the number of PF reads aligned to the genome.

PCT_CHIMERAS: The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes.

PCT_ADAPTER: The percentage of PF reads that are unaligned and match to a known adapter sequence right from the start of the read.

CollectIonTorrentBaseCallMetrics.IonBaseCallLibraryMetric

Metrics to describe molecular-barcode-specific data about an Ion Torrent basecalling run. Currently there is not much here because we haven't done any indexed Ion Torrent runs yet.

Column Definitions

RUN_NAME:

MOLECULAR_INDEX_NAME:

MEAN_READ_LENGTH:

TOTAL_NUM_BASES:

CollectIonTorrentBaseCallMetrics.IonBaseCallRunMetric

Metrics to describe an Ion Torrent basecalling run, including metrics about first 2 test fragments.

Column Definitions

RUN_NAME: The full name of the Ion run consisting of run date (analysis date?), operator, and PGM run number.

TOTAL_NUM_BASES: Number of filtered and trimmed base pairs reported in the SFF and FASTQ files.

NUM_Q17_BASES: Number of bases with predicted quality of Q17 or greater.

NUM_Q20_BASES: Number of bases with predicted quality of Q20 or greater.

TOTAL_NUM_READS: Total number of filtered and trimmed reads independent of length reported in the SFF and FASTQ files.

MEAN_READ_LENGTH: Average length, in base pairs, of all filtered and trimmed library reads reported in the SFF and FASTQ files.

LONGEST_READ: Maximum length, in base pairs, of all filtered and trimmed library reads reported in the file.

LIBRARY_CF: percentage of reads affected by Carry Forward events?

LIBRARY_IE: percentage of reads affected by Incomplete Extension?

NUMBER_TF: Number of Test Fragment Reads (defined by key sequence) before filtering and trimming

NUMBER_LIB: Number of Library Reads (defined by key sequence TCAG) before filtering and trimming

KEYPASS_ALL_BEADS: Total number of reads with any key sequence before filtering and trimming

TOTAL_ADDRESSABLE_WELLS: Total number of addressable wells

WELLS_WITH_ISPS: Number of wells that were determined to be "positive" for the presence of an ISP within the well. "Positive" is determined by measuring the diffusion rate of a flow with a different pH. Wells containing ISPs have a delayed pH change due to the presence of an ISP slowing the detection of the pH change from the solution.

PCT_WELLS_WITH_ISPS: Percent of Addressable wells loaded with a bead (WELLS_WITH_ISPS/TOTAL_ADDRESSABLE_WELLS)

LIVE_ISPS: Number of wells that contained an ISP with a signal of sufficient strength and composition to be associated with the library or Test Fragment key. This value is the sum of the following categories: Test Fragment Library

PCT_LIVE_ISPS: Percent of wells with ISPs that have live ISPs (LIVE_ISPS/WELLS_WITH_ISPS)

TEST_FRAGMENT_ISPS: Number of Live ISPs with a key signal that was identical to the Test Fragment key signal.

PCT_TEST_FRAGMENT_ISPS: Percent of live ISPs that are test fragments

LIBRARY_ISPS: Number of Live ISPs that have a key signal identical to the library key signal. These reads are input into the Library filtering process.

PCT_LIBRARY_ISPS: Percent of live ISPs that are library

FLTRD_TOO_SHORT:

PCT_FLTRD_TOO_SHORT:

FLTRD_KEYPASS_FAILURE:

PCT_FLTRD_KEYPASS_FAILURE:

FLTRD_LOW_SIGNAL:

PCT_FLTRD_LOW_SIGNAL:

FLTRD_POOR_SIGNAL_PROFILE:

PCT_FLTRD_POOR_SIGNAL_PROFILE:

FLTRD_3_PRIME_ADAPTER_TRIM:

PCT_FLTRD_3_PRIME_ADAPTER_TRIM:

FLTRD_3_PRIME_QUAL_TRIM:

PCT_FLTRD_3_PRIME_QUAL_TRIM:

FINAL_LIBRARY_READS: Number of Library reads passing all filters, which are recorded in the SFF and FASTQ files.

PCT_FINAL_LIBRARY_READS: Percentage of library reads passing all filters, which are recorded in the SFF and FASTQ files.

CHIP_CHECK: A series of tests on reference wells (about 10% of the chip in non-addressable areas) is performed to ensure that the chip is functioning at a basic level. The value of this field is either Passed or Failed.

CHIP_TYPE: Chip type (314,316,318)

FLOW_ORDER: Nucelotide flow order

LIBRARY_KEY: A short known sequence of bases used to distinguish the library fragment from the Test Fragment. Example: "TCAG"

ANALYSIS_VERSION: Version of the Analysis Pipeline used to generate the analysis.

DBREPORTS_VERSION: Version of the ion-dbreports package.

TF1_NAME: Name of 1st TF

TF1_Q10_MEAN: Mean read length of all Q10 or greater Test Fragments (type 1)

TF1_Q17_MEAN: Mean read length of all Q17 or greater Test Fragments (type 1)

TF1_Q10_MODE: Mode of read lengths of all Q10 or greater Test Fragments (type 1)

TF1_Q17_MODE: Mode of read lengths of all Q17 or greater Test Fragments (type 1)

TF1_SYSTEM_SNR:

TF1_50Q10_READS: Number of Test Fragments (type1) that at 50bp have a quality score of Q10 or greater

TF1_50Q17_READS: Number of Test Fragments (type1) that at 50bp have a quality score of Q17 or greater

TF1_KEYPASS_READS: Total number of type 1 TF reads

TF1_CF: Percent of TF type 2 reads affected by Carry Forward

TF1_IE: Percent of TF type 2 reads affected by Incomplete Extension

TF1_DR:

TF1_KEY_PEAK_COUNTS:

TF2_NAME: Name of 2nd TF

TF2_Q10_MEAN: Mean read length of all Q10 or greater Test Fragments (type 2)

TF2_Q17_MEAN: Mean read length of all Q17 or greater Test Fragments (type 2)

TF2_Q10_MODE: Mode of read lengths of all Q10 or greater Test Fragments (type 2)

TF2_Q17_MODE: Mode of read lengths of all Q17 or greater Test Fragments (type 2)

TF2_SYSTEM_SNR:

TF2_50Q10_READS: Number of Test Fragments (type2) that at 50bp have a quality score of Q10 or greater

TF2_50Q17_READS: Number of Test Fragments (type2) that at 50bp have a quality score of Q17 or greater

TF2_KEYPASS_READS: Total number of type 2 TF reads

TF2_CF: Percent of TF type 2 reads affected by Carry Forward

TF2_IE: Percent of TF type 2 reads affected by Incomplete Extension

TF2_DR:

TF2_KEY_PEAK_COUNTS:

ION_RUN_ID:

CollectIonTorrentBaseCallMetrics.IonRunMetric

Overview of an Ion Torrent basecalling run. Eventually this will go away and be replaced by a subclass of NEXT_GENERATION_RUN

Column Definitions

RUN_NAME: aka Experiment name in Ion-speak

ID: Arbitrary identifier for database joins.

CollectOxoGMetrics.CpcgMetrics

Metrics class for outputs.

Column Definitions

SAMPLE_ALIAS: The name of the sample being assayed.

LIBRARY: The name of the library being assayed.

CONTEXT: The sequence context being reported on.

TOTAL_SITES: The total number of sites that had at least one base covering them.

TOTAL_BASES: The total number of basecalls observed at all sites.

REF_NONOXO_BASES: The number of reference alleles observed as C in read 1 and G in read 2.

REF_OXO_BASES: The number of reference alleles observed as G in read 1 and C in read 2.

REF_TOTAL_BASES: The total number of reference alleles observed

ALT_NONOXO_BASES: The count of observed A basecalls at C reference positions and T basecalls at G reference bases that are correlated to instrument read number in a way that rules out oxidation as the cause

ALT_OXO_BASES: The count of observed A basecalls at C reference positions and T basecalls at G reference bases that are correlated to instrument read number in a way that is consistent with oxidative damage.

OXIDATION_ERROR_RATE: The oxo error rate, calculated as max(ALT_OXO_BASES - ALT_NONOXO_BASES, 1) / TOTAL_BASES

OXIDATION_Q: -10 * log10(OXIDATION_ERROR_RATE)

C_REF_REF_BASES: The number of ref basecalls observed at sites where the genome reference == C.

G_REF_REF_BASES: The number of ref basecalls observed at sites where the genome reference == G.

C_REF_ALT_BASES: The number of alt (A/T) basecalls observed at sites where the genome reference == C.

G_REF_ALT_BASES: The number of alt (A/T) basecalls observed at sites where the genome reference == G.

C_REF_OXO_ERROR_RATE: The rate at which C>A and G>T substitutions are observed at C reference sites above the expected rate if there were no bias between sites with a C reference base vs. a G reference base.

C_REF_OXO_Q: C_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.

G_REF_OXO_ERROR_RATE: The rate at which C>A and G>T substitutions are observed at G reference sites above the expected rate if there were no bias between sites with a C reference base vs. a G reference base.

G_REF_OXO_Q: G_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.

CollectQualityYieldMetrics.QualityYieldMetrics

A set of metrics used to describe the general quality of a BAM file

Column Definitions

TOTAL_READS: The total number of reads in the input file

PF_READS: The number of reads that are PF - pass filter

READ_LENGTH: The average read length of all the reads (will be fixed for a lane)

TOTAL_BASES: The total number of bases in all reads

PF_BASES: The total number of bases in all PF reads

Q20_BASES: The number of bases in all reads that achieve quality score 20 or higher

PF_Q20_BASES: The number of bases in PF reads that achieve quality score 20 or higher

Q30_BASES: The number of bases in all reads that achieve quality score 20 or higher

PF_Q30_BASES: The number of bases in PF reads that achieve quality score 20 or higher

Q20_EQUIVALENT_YIELD: The sum of quality scores of all bases divided by 20

PF_Q20_EQUIVALENT_YIELD: The sum of quality scores of all bases divided by 20

CollectVariantCallingMetrics.VariantCallingDetailMetrics

A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.

Column Definitions

SAMPLE_ALIAS: The name of the sample being assayed

HET_HOMVAR_RATIO: (count of hets)/(count of homozygous non-ref) for this sample

CollectVariantCallingMetrics.VariantCallingSummaryMetrics

A collection of metrics relating to snps and indels within a variant-calling file (VCF).

Column Definitions

TOTAL_SNPS: The number of high confidence SNPs calls (i.e. non-reference genotypes) that were examined

NUM_IN_DB_SNP: The number of high confidence SNPs found in dbSNP

NOVEL_SNPS: The number of high confidence SNPS called that were not found in dbSNP

FILTERED_SNPS: The number of SNPs that are also filtered

PCT_DBSNP: The percentage of high confidence SNPs in dbSNP

DBSNP_TITV: The Transition/Transversion ratio of the SNP calls made at dbSNP sites

NOVEL_TITV: The Transition/Transversion ratio of the SNP calls made at non-dbSNP sites

TOTAL_INDELS: The number of high confidence Indel calls that were examined

NOVEL_INDELS: The number of high confidence Indels called that were not found in dbSNP

FILTERED_INDELS: The number of indels that are also filtered

PCT_DBSNP_INDELS: The percentage of high confidence Indels in dbSNP

NUM_IN_DB_SNP_INDELS: The number of high confidence Indels found in dbSNP

DBSNP_INS_DEL_RATIO: The Insertion/Deletion ratio of the Indel calls made at dbSNP sites

NOVEL_INS_DEL_RATIO: The Insertion/Deletion ratio of the Indel calls made at non-dbSNP sites

TOTAL_MULTIALLELIC_SNPS: The number of high confidence multiallelic SNP calls that were examined

NUM_IN_DB_SNP_MULTIALLELIC: The number of high confidence multiallelic SNPs found in dbSNP

TOTAL_COMPLEX_INDELS: The number of high confidence complex Indel calls that were examined

NUM_IN_DB_SNP_COMPLEX_INDELS: The number of high confidence complex Indels found in dbSNP

SNP_REFERENCE_BIAS: The rate at which reference bases are observed at ref/alt heterozygous SNP sites.

NUM_SINGLETONS: For summary metrics, the number of variants that appear in only one sample. For detail metrics, the number of variants that appear only in the current sample.

CollectWgsMetrics.WgsMetrics

Metrics for evaluating the performance of whole genome sequencing experiments.

Column Definitions

GENOME_TERRITORY: The number of non-N bases in the genome reference over which coverage will be evaluated.

MEAN_COVERAGE: The mean coverage in bases of the genome territory, after all filters are applied.

SD_COVERAGE: The standard deviation of coverage of the genome after all filters are applied.

MEDIAN_COVERAGE: The median coverage in bases of the genome territory, after all filters are applied.

MAD_COVERAGE: The median absolute deviation of coverage of the genome after all filters are applied.

PCT_EXC_MAPQ: The fraction of aligned bases that were filtered out because they were in reads with low mapping quality (default is < 20).

PCT_EXC_DUPE: The fraction of aligned bases that were filtered out because they were in reads marked as duplicates.

PCT_EXC_UNPAIRED: The fraction of aligned bases that were filtered out because they were in reads without a mapped mate pair.

PCT_EXC_BASEQ: The fraction of aligned bases that were filtered out because they were of low base quality (default is < 20).

PCT_EXC_OVERLAP: The fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads.

PCT_EXC_CAPPED: The fraction of aligned bases that were filtered out because they would have raised coverage above the capped value (default cap = 250x).

PCT_EXC_TOTAL: The total fraction of aligned bases excluded due to all filters.

PCT_5X: The fraction of bases that attained at least 5X sequence coverage in post-filtering bases.

PCT_10X: The fraction of bases that attained at least 10X sequence coverage in post-filtering bases.

PCT_20X: The fraction of bases that attained at least 20X sequence coverage in post-filtering bases.

PCT_30X: The fraction of bases that attained at least 30X sequence coverage in post-filtering bases.

PCT_40X: The fraction of bases that attained at least 40X sequence coverage in post-filtering bases.

PCT_50X: The fraction of bases that attained at least 50X sequence coverage in post-filtering bases.

PCT_60X: The fraction of bases that attained at least 60X sequence coverage in post-filtering bases.

PCT_70X: The fraction of bases that attained at least 70X sequence coverage in post-filtering bases.

PCT_80X: The fraction of bases that attained at least 80X sequence coverage in post-filtering bases.

PCT_90X: The fraction of bases that attained at least 90X sequence coverage in post-filtering bases.

PCT_100X: The fraction of bases that attained at least 100X sequence coverage in post-filtering bases.

ContaminationMetrics.FauxMetric

A metric-like container representing a contamination metrics file entry, though the contamination file is not a metrics file.

Column Definitions

Column Definitions

COVERAGE_PER_READ_BASE:

DELETIONS_PER_READ_BASE:

INSERTIONS_PER_READ_BASE:

CLIPS_PER_READ_BASE:

MEAN_COVERAGE:

MEAN_QUALITY:

MEAN_MISMATCHES_PER_COVERAGE:

MEAN_DELETES_PER_COVERAGE:

MEAN_INSERTS_PER_COVERAGE:

MEAN_CLIPS_PER_COVERAGE:

MEAN_READ_STARTS:

VAR_READ_STARTS:

DbSnpMatchMetrics

Metrics about how genotypes called by the pipeline match up to dbSNP, created by the CollectDbSnpMatches program and usually stored in a file with the extension ".dbsnp_matches".

Column Definitions

TOTAL_SNPS: The number of high confidence SNPs calls (i.e. non reference genotypes) that were examined.

NOVEL_SNPS: The number of high confidence SNPS called that were not found in dbSNP

PCT_DBSNP: The percentage of high confidence SNPs in dbSNP

NUM_IN_DB_SNP: The number of high confidence SNPs found in dbSNP

DBSNP_TITV: The Transition/Transversion ratio of the SNP calls made at dbSNP sites.

NOVEL_TITV: The Transition/Transversion ratio of the SNP calls made at non-dbSNP sites.

DuplicationMetrics

Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.

Column Definitions

LIBRARY: The library on which the duplicate marking was performed.

UNPAIRED_READS_EXAMINED: The number of mapped reads examined which did not have a mapped mate pair, either because the read is unpaired, or the read is paired to an unmapped mate.

READ_PAIRS_EXAMINED: The number of mapped read pairs examined.

UNMAPPED_READS: The total number of unmapped reads examined.

UNPAIRED_READ_DUPLICATES: The number of fragments that were marked as duplicates.

READ_PAIR_DUPLICATES: The number of read pairs that were marked as duplicates.

READ_PAIR_OPTICAL_DUPLICATES: The number of read pairs duplicates that were caused by optical duplication. Value is always < READ_PAIR_DUPLICATES, which counts all duplicates regardless of source.

PERCENT_DUPLICATION: The percentage of mapped sequence that is marked as duplicate.

ESTIMATED_LIBRARY_SIZE: The estimated number of unique molecules in the library based on PE duplication.

ExtractIlluminaBarcodes.BarcodeMetric

Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.

Column Definitions

BARCODE: The barcode (from the set of expected barcodes) for which the following metrics apply. Note that the "symbolic" barcode of NNNNNN is used to report metrics for all reads that do not match a barcode.

BARCODE_NAME:

LIBRARY_NAME:

READS: The total number of reads matching the barcode.

PF_READS: The number of PF reads matching this barcode (always less than or equal to READS).

PERFECT_MATCHES: The number of all reads matching this barcode that matched with 0 errors or no-calls.

PF_PERFECT_MATCHES: The number of PF reads matching this barcode that matched with 0 errors or no-calls.

ONE_MISMATCH_MATCHES: The number of all reads matching this barcode that matched with 1 error or no-call.

PF_ONE_MISMATCH_MATCHES: The number of PF reads matching this barcode that matched with 1 error or no-call.

PCT_MATCHES: The percentage of all reads in the lane that matched to this barcode.

RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT: The rate of all reads matching this barcode to all reads matching the most prevelant barcode. For the most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more orphan reads than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation between barcodes.

PF_PCT_MATCHES: The percentage of PF reads in the lane that matched to this barcode.

PF_RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT: The rate of PF reads matching this barcode to PF reads matching the most prevelant barcode. For the most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more orphan reads than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation of PF reads between barcodes.

PF_NORMALIZED_MATCHES: The "normalized" matches to each barcode. This is calculated as the number of pf reads matching this barcode over the sum of all pf reads matching any barcode (excluding orphans). If all barcodes are represented equally this will be 1.

FingerprintDetailMetrics

A collection of metrics about how the reads in a BAM filematched up with the known genotypes for a particular fingerprint panel

Column Definitions

SNP: The name of the SNP

FINGERPRINT_GENOTYPE: The genotype in the fingerprint file

SEQUENCED_GENOTYPE: The genotype sequenced in the BAM file

LOD: The best-to-second-best LOD for the called genotype

READS: The number of reads covering the locus in the BAM file

FingerprintSummaryMetrics

A collection of metrics that summarize the match of reads in a particular BAM file against various fingerprint panels.

Column Definitions

PANEL_NAME: The name of the fingerprint panel

PANEL_SNPS: The number of SNPs contained in the panel

CONFIDENT_CALLS: The number of panel SNPs that can be confidently called in the BAM file

CONFIDENT_MATCHING_SNPS: The number of confidently matching SNPs in the BAM file

CONFIDENT_CALLED_PCT: The number of confidently called SNPs as a percentage of the total number of SNPs on the panel

CONFIDENT_MATCHING_SNPS_PCT: The number of confidently called matching SNPs as a percentage of the total number of confidently called SNPs on the panel

FingerprintingDetailMetrics

Detailed metrics about an individual SNP/Haplotype comparison within a fingerprint comparison.

Column Definitions

READ_GROUP: The sequencing read group from which sequence data was fingerprinted.

SAMPLE: The name of the sample who's genotypes the sequence data was compared to.

SNP: The name of a representative SNP within the haplotype that was compared. Will usually be the exact SNP that was genotyped externally.

SNP_ALLELES: The possible alleles for the SNP.

CHROM: The chromosome on which the SNP resides.

POSITION: The position of the SNP on the chromosome.

EXPECTED_GENOTYPE: The expected genotype of the sample at the SNP locus.

OBSERVED_GENOTYPE: The most likely genotype given the observed evidence at the SNP locus in the sequencing data.

LOD: The LOD score for OBSERVED_GENOTYPE vs. the next most likely genotype in the sequencing data.

OBS_A: The number of observations of the first, or A, allele of the SNP in the sequencing data.

OBS_B: The number of observations of the second, or B, allele of the SNP in the sequencing data.

FingerprintingSummaryMetrics

Summary fingerprinting metrics and statistics about the comparison of the sequence data from a single read group (lane or index within a lane) vs. a set of known genotypes for the expected sample.

Column Definitions

READ_GROUP: The read group from which sequence data was drawn for comparison.

SAMPLE: The sample whose known genotypes the sequence data was compared to.

LL_EXPECTED_SAMPLE: The Log Likelihood of the sequence data given the expected sample's genotypes.

LL_RANDOM_SAMPLE: The Log Likelihood of the sequence data given a random sample from the human population.

LOD_EXPECTED_SAMPLE: The LOD for Expected Sample vs. Random Sample. A positive LOD indicates that the sequence data is more likely to come from the expected sample vs. a random sample from the population, by LOD logs. I.e. a value of 6 indicates that the sequence data is 1,000,000 more likely to come from the expected sample than from a random sample. A negative LOD indicates the reverse - that the sequence data is more likely to come from a random sample than from the expected sample.

HAPLOTYPES_WITH_GENOTYPES: The number of haplotypes that had expected genotypes to compare to.

HAPLOTYPES_CONFIDENTLY_CHECKED: The subset of genotyped haplotypes for which there was sufficient sequence data to confidently genotype the haplotype. Note: all haplotypes with sequence coverage contribute to the LOD score, even if they cannot be "confidently checked" individually.

HAPLOTYPES_CONFIDENTLY_MATCHING: The subset of confidently checked haplotypes that match the expected genotypes.

GcBiasDetailMetrics

Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.

Column Definitions

GC: The G+C content of the reference sequence represented by this bin. Values are from 0% to 100%

WINDOWS: The number of windows on the reference genome that have this G+C content.

READ_STARTS: The number of reads whose start position is at the start of a window of this GC.

MEAN_BASE_QUALITY: The mean quality (determined via the error rate) of all bases of all reads that are assigned to windows of this GC.

NORMALIZED_COVERAGE: The ration of "coverage" in this GC bin vs. the mean coverage of all GC bins. A number of 1 represents mean coverage, a number less than one represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than one represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average).

ERROR_BAR_WIDTH: The radius of error bars in this bin based on the number of observations made. For example if the normalized coverage is 0.75 and the error bar width is 0.1 then the error bars would be drawn from 0.65 to 0.85.

GcBiasSummaryMetrics

High level metrics that capture how biased the coverage in a certain lane is.

Column Definitions

WINDOW_SIZE: The window size on the genome used to calculate the GC of the sequence.

TOTAL_CLUSTERS: The total number of clusters that were seen in the gc bias calculation.

ALIGNED_READS: The total number of aligned reads used to compute the gc bias metrics.

AT_DROPOUT: Illumina-style AT dropout metric. Calculated by taking each GC bin independently and calculating (%ref_at_gc - %reads_at_gc) and summing all positive values for GC=[0..50].

GC_DROPOUT: Illumina-style GC dropout metric. Calculated by taking each GC bin independently and calculating (%ref_at_gc - %reads_at_gc) and summing all positive values for GC=[50..100].

GenotypeConcordanceMetrics

Statistics about how well a given set of input genotypes matches a set of well known reference genotypes.

Column Definitions

CATEGORY: One of the following:

HOMOZYGOUS_REFERENCE: the metrics represent concordance at sites where the reference genotypes are homozygous reference
HETEROZYGOUS: the metrics represent concordance at sites where the reference genotypes are heterozygous
HOMOZYGOUS_NON_REFERENCE: the metrics represent concordance at sites where the reference genotypes are homozygous non-reference.

OBSERVATIONS: Number of genotypes in this category at the same locus in each of the input and reference

AGREE: Of the number of observations, how many agreed

DISAGREE: Of the number of observations, how many disagreed

PCT_CONCORDANCE: Ratio of agreed to observations

GenotypeFreeContaminationMetric

Column Definitions

STRATIFICATION: A name describing the subset of reads/bases from the bam that were included in this model

CONTAMINATION_ESTIMATE: The estimated contamination (with the greatest likelihood)

LL_CONTAMINATION_ESTIMATE: The likelihood of the putative contamination

CONFIDENCE_INTERVAL_CONTAMINATION_LOWER: The lower bound of the contamination confidence interval

CONFIDENCE_INTERVAL_CONTAMINATION_UPPER: The upper bound of the contamination confidence interval

LL_NULL_CONTAMINATION: The likelihood of the null-hypothesis level of contamination

TOTAL_BASE_DEPTH: The total number of bases inspected in this model

AVERAGE_BASE_DEPTH: The average base read/base depth of each inspected locus

STDDEV_BASE_DEPTH: The standard deviation of the read/base depth of each inspected locus

GenotypeFreeLikelihoodPlotMetric

Column Definitions

STRATIFIER:

PUTATIVE_CONTAMINATION:

LIKELIHOOD:

HsMetrics

The set of metrics captured that are specific to a hybrid selection analysis.

Column Definitions

BAIT_SET: The name of the bait set used in the hybrid selection.

GENOME_SIZE: The number of bases in the reference genome used for alignment.

BAIT_TERRITORY: The number of bases which have one or more baits on top of them.

TARGET_TERRITORY: The unique number of target bases in the experiment where target is usually exons etc.

BAIT_DESIGN_EFFICIENCY: Target terrirtoy / bait territory. 1 == perfectly efficient, 0.5 = half of baited bases are not target.

TOTAL_READS: The total number of reads in the SAM or BAM file examine.

PF_READS: The number of reads that pass the vendor's filter.

PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates.

PCT_PF_READS: PF reads / total reads. The percent of reads passing filter.

PCT_PF_UQ_READS: PF Unique Reads / Total Reads.

PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.

PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads.

PF_UQ_BASES_ALIGNED: The number of bases in the PF aligned reads that are mapped to a reference base. Accounts for clipping and gaps.

ON_BAIT_BASES: The number of PF aligned bases that mapped to a baited region of the genome.

NEAR_BAIT_BASES: The number of PF aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region.

OFF_BAIT_BASES: The number of PF aligned bases that mapped to neither on or near a bait.

ON_TARGET_BASES: The number of PF aligned bases that mapped to a targeted region of the genome.

PCT_SELECTED_BASES: On+Near Bait Bases / PF Bases Aligned.

PCT_OFF_BAIT: The percentage of aligned PF bases that mapped neither on or near a bait.

ON_BAIT_VS_SELECTED: The percentage of on+near bait bases that are on as opposed to near.

MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment.

MEAN_TARGET_COVERAGE: The mean coverage of targets that received at least coverage depth = 2 at one base.

PCT_USABLE_BASES_ON_BAIT: The number of aligned, de-duped, on-bait bases out of the PF bases available.

PCT_USABLE_BASES_ON_TARGET: The number of aligned, de-duped, on-target bases out of the PF bases available.

FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background.

ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base.

FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.

PCT_TARGET_BASES_2X: The percentage of ALL target bases achieving 2X or greater coverage.

PCT_TARGET_BASES_10X: The percentage of ALL target bases achieving 10X or greater coverage.

PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage.

PCT_TARGET_BASES_30X: The percentage of ALL target bases achieving 30X or greater coverage.

PCT_TARGET_BASES_40X: The percentage of ALL target bases achieving 40X or greater coverage.

PCT_TARGET_BASES_50X: The percentage of ALL target bases achieving 50X or greater coverage.

PCT_TARGET_BASES_100X: The percentage of ALL target bases achieving 100X or greater coverage.

HS_LIBRARY_SIZE: The estimated number of unique molecules in the selected part of the library.

HS_PENALTY_10X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 10X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 10 * HS_PENALTY_10X.

HS_PENALTY_20X: The "hybrid selection penalty" incurred to get 80% of target bases to 20X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 20X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 20 * HS_PENALTY_20X.

HS_PENALTY_30X: The "hybrid selection penalty" incurred to get 80% of target bases to 30X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 30X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 30 * HS_PENALTY_30X.

HS_PENALTY_40X: The "hybrid selection penalty" incurred to get 80% of target bases to 40X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 40X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 40 * HS_PENALTY_40X.

HS_PENALTY_50X: The "hybrid selection penalty" incurred to get 80% of target bases to 50X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 50X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 50 * HS_PENALTY_50X.

HS_PENALTY_100X: The "hybrid selection penalty" incurred to get 80% of target bases to 100X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 100X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 100 * HS_PENALTY_100X.

AT_DROPOUT: A measure of how undercovered <= 50% GC regions are relative to the mean. For each GC bin [0..50] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. AT DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC<=50% regions mapped elsewhere.

GC_DROPOUT: A measure of how undercovered >= 50% GC regions are relative to the mean. For each GC bin [50..100] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. GC DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC>=50% regions mapped elsewhere.

IlluminaBasecallingMetrics

Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis. Averages and means are taken over all tiles.

Column Definitions

LANE: The lane for which the metrics were calculated.

MOLECULAR_BARCODE_SEQUENCE_1: The barcode sequence for which the metrics were calculated.

MOLECULAR_BARCODE_NAME: The barcode name for which the metrics were calculated.

TOTAL_BASES: The total number of bases assigned to the index.

PF_BASES: The total number of passing-filter bases assigned to the index.

TOTAL_READS: The total number of reads assigned to the index.

PF_READS: The total number of passing-filter reads assigned to the index.

TOTAL_CLUSTERS: The total number of clusters assigned to the index.

PF_CLUSTERS: The total number of PF clusters assigned to the index.

MEAN_CLUSTERS_PER_TILE: The mean number of clusters per tile.

SD_CLUSTERS_PER_TILE: The standard deviation of clusters per tile.

MEAN_PCT_PF_CLUSTERS_PER_TILE: The mean percentage of pf clusters per tile.

SD_PCT_PF_CLUSTERS_PER_TILE: The standard deviation in percentage of pf clusters per tile.

MEAN_PF_CLUSTERS_PER_TILE: The mean number of pf clusters per tile.

SD_PF_CLUSTERS_PER_TILE: The standard deviation in number of pf clusters per tile.

IlluminaLaneMetrics

Embodies characteristics that describe a lane.

Column Definitions

CLUSTER_DENSITY: The number of clusters per unit area on the this lane expressed in units of [cluster / mm^2].

LANE: This lane's number.

IlluminaPhasingMetrics

Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis. For each lane/template read # (i.e. FIRST, SECOND) combination we will store the median values of both the phasing and prephasing values for every tile in that lane/template read pair.

Column Definitions

Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics". In addition the insert size distribution is plotted to a file with the extension ".insert_size_Histogram.pdf".

Column Definitions

MEDIAN_INSERT_SIZE: The MEDIAN insert size of all paired end reads where both ends mapped to the same chromosome.

MEDIAN_ABSOLUTE_DEVIATION: The median absolute deviation of the distribution. If the distribution is essentially normal then the standard deviation can be estimated as ~1.4826 * MAD.

MIN_INSERT_SIZE: The minimum measured insert size. This is usually 1 and not very useful as it is likely artifactual.

MAX_INSERT_SIZE: The maximum measure insert size by alignment. This is usually very high representing either an artifact or possibly the presence of a structural re-arrangement.

MEAN_INSERT_SIZE: The mean insert size of the "core" of the distribution. Artefactual outliers in the distribution often cause calculation of nonsensical mean and stdev values. To avoid this the distribution is first trimmed to a "core" distribution of +/- N median absolute deviations around the median insert size. By default N=10, but this is configurable.

STANDARD_DEVIATION: Standard deviation of insert sizes over the "core" of the distribution.

READ_PAIRS: The total number of read pairs that were examined in the entire distribution.

PAIR_ORIENTATION: The pair orientation of the reads in this data category.

WIDTH_OF_10_PERCENT: The "width" of the bins, centered around the median, that encompass 10% of all read pairs.

WIDTH_OF_20_PERCENT: The "width" of the bins, centered around the median, that encompass 20% of all read pairs.

WIDTH_OF_30_PERCENT: The "width" of the bins, centered around the median, that encompass 30% of all read pairs.

WIDTH_OF_40_PERCENT: The "width" of the bins, centered around the median, that encompass 40% of all read pairs.

WIDTH_OF_50_PERCENT: The "width" of the bins, centered around the median, that encompass 50% of all read pairs.

WIDTH_OF_60_PERCENT: The "width" of the bins, centered around the median, that encompass 60% of all read pairs.

WIDTH_OF_70_PERCENT: The "width" of the bins, centered around the median, that encompass 70% of all read pairs. This metric divided by 2 should approximate the standard deviation when the insert size distribution is a normal distribution.

WIDTH_OF_80_PERCENT: The "width" of the bins, centered around the median, that encompass 80% of all read pairs.

WIDTH_OF_90_PERCENT: The "width" of the bins, centered around the median, that encompass 90% of all read pairs.

WIDTH_OF_99_PERCENT: The "width" of the bins, centered around the median, that encompass 100% of all read pairs.

InternalControlCycleMetrics

Metrics about observations of an internal control sequence in an individual cycle.

Column Definitions

INTERNAL_CONTROL: The name of the internal control sequence.

READ: The read (1 or 2) that the metrics are for.

CYCLE: The cycle number within the read that the metrics are for.

OBSERVATIONS: The number of reads observed that were matched to this internal control.

ERRORS: The number of mismatches (including no-calls) contained within the observed reads at this cycle.

ERROR_RATE: The error rate in this IC in this cycle, i.e. ERRORS/OBSERVATIONS

SUM_OF_ERROR_PROBS: The sum of the error probabilities observed - in an ideal system this should match ERRORS.

QUALITY_ESTIMATE: The ratio of SUM_OF_ERROR_PROBS to ERRORS. A number > 1 indicates that the control had fewer errors than would be predicted by the bases quality scores, a number < 1 indicates more errors than expected.

REF_BASE: The reference base of the internal control at this position.

A: The number of 'A' basecalls at this cycle for this internal control.

C: The number of 'C' basecalls at this cycle for this internal control.

G: The number of 'G' basecalls at this cycle for this internal control.

T: The number of 'T' basecalls at this cycle for this internal control.

InternalControlSummaryMetrics

Summary metrics about internal controls within a lane.

Column Definitions

INTERNAL_CONTROL: The name of the internal control sequence.

MATCHES: The number of reads matching this internal control.

PCT_MATCHES: The percentage of all internal control reads matching this internal control.

MEAN_READ1_ERROR_RATE: The mean error rate over read 1.

MEAN_READ2_ERROR_RATE: The mean error rate over read 2.

JumpingLibraryMetrics

High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".

Column Definitions

JUMP_PAIRS: The number of outward-facing pairs in the SAM file

JUMP_DUPLICATE_PAIRS: The number of outward-facing pairs that are duplicates

JUMP_DUPLICATE_PCT: The percentage of outward-facing pairs that are marked as duplicates

JUMP_LIBRARY_SIZE: The estimated library size for outward-facing pairs

JUMP_MEAN_INSERT_SIZE: The mean insert size for outward-facing pairs

JUMP_STDEV_INSERT_SIZE: The standard deviation on the insert size for outward-facing pairs

NONJUMP_PAIRS: The number of inward-facing pairs in the SAM file

NONJUMP_DUPLICATE_PAIRS: The number of inward-facing pais that are duplicates

NONJUMP_DUPLICATE_PCT: The percentage of inward-facing pairs that are marked as duplicates

NONJUMP_LIBRARY_SIZE: The estimated library size for inward-facing pairs

NONJUMP_MEAN_INSERT_SIZE: The mean insert size for inward-facing pairs

NONJUMP_STDEV_INSERT_SIZE: The standard deviation on the insert size for inward-facing pairs

CHIMERIC_PAIRS: The number of pairs where either (a) the ends fall on different chromosomes or (b) the insert size is greater than the maximum of 100000 or 2 times the mode of the insert size for outward-facing pairs.

FRAGMENTS: The number of fragments in the SAM file

PCT_JUMPS: The number of outward-facing pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.

PCT_NONJUMPS: The number of inward-facing pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.

PCT_CHIMERAS: The number of chimeric pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.

KmerMetrics

Metrics about an individual kmer in a SAM or BAM file

Column Definitions

FREQUENCY: The number of times the kmer is seen in the data set

DISTINCT_KMERS: The number of distinct kmers occurring at this frequency

PCT_DISTINCT_KMERS: The percent of distinct kmers occurring at this frequency

DISTINCT_KMERS_THIS_FREQUENCY_OR_LESS: The number of distinct kmers ocurring at this frequency or less

PCT_DISTINCT_KMERS_THIS_FREQUENCY_OR_LESS: The percent of distinct kmers ocurring at this frequency or less

TOTAL_KMERS: The total number of kmers seen at this frequency

PCT_TOTAL_KMERS: The percent of total kmers seen at this frequency

TOTAL_KMERS_THIS_FREQUENCY_OR_LESS: The total number of kmers occurring at this frequency or less

PCT_TOTAL_KMERS_THIS_FREQUENCY_OR_LESS: The percent of total kmers seen at this frequency or less

LowPassConcordanceMetrics

Concordance statistics for a set of low coverage reads against well known well known genotypes for the same sample for the purpose of ensuring that the sample being sequenced is the sample we think it is.

Column Definitions

CATEGORY: One of the following:

HOMOZYGOUS_REFERENCE: the metrics represent concordance at sites where the reference genotypes are homozygous reference
HETEROZYGOUS: the metrics represent concordance at sites where the reference genotypes are heterozygous
HOMOZYGOUS_NON_REFERENCE: the metrics represent concordance at sites where the reference genotypes are homozygous non-reference.

REFERENCE: Number of base calls matching the reference genotypes

NON_REFERENCE: Number of base calls not matching the reference genotypes

PCT_CONCORDANCE: A number rating how these statistics match with expectations. A value of close to 1 is good. How much below 1 is bad depends on the category.

MendelianViolationMetrics

Describes the type and number of mendelian violations found within a Trio.

Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".

Column Definitions

PF_BASES: The total number of PF bases including non-aligned reads.

PF_ALIGNED_BASES: The total number of aligned PF bases. Non-primary alignments are not counted. Bases in aligned reads that do not correspond to reference (e.g. soft clips, insertions) are not counted.

RIBOSOMAL_BASES: Number of bases in primary aligments that align to ribosomal sequence.

CODING_BASES: Number of bases in primary aligments that align to a non-UTR coding base for some gene, and not ribosomal sequence.

UTR_BASES: Number of bases in primary aligments that align to a UTR base for some gene, and not a coding base.

INTRONIC_BASES: Number of bases in primary aligments that align to an intronic base for some gene, and not a coding or UTR base.

INTERGENIC_BASES: Number of bases in primary aligments that do not align to any gene.

IGNORED_READS: Number of primary alignments that map to a sequence specified on command-line as IGNORED_SEQUENCE. These are not counted in PF_ALIGNED_BASES, CORRECT_STRAND_READS, INCORRECT_STRAND_READS, or any of the base-counting metrics. These reads are counted in PF_BASES.

CORRECT_STRAND_READS: Number of aligned reads that map to the correct strand. 0 if library is not strand-specific.

INCORRECT_STRAND_READS: Number of aligned reads that map to the incorrect strand. 0 if library is not strand-specific.

PCT_RIBOSOMAL_BASES: RIBOSOMAL_BASES / PF_ALIGNED_BASES

PCT_CODING_BASES: CODING_BASES / PF_ALIGNED_BASES

PCT_UTR_BASES: UTR_BASES / PF_ALIGNED_BASES

PCT_INTRONIC_BASES: INTRONIC_BASES / PF_ALIGNED_BASES

PCT_INTERGENIC_BASES: INTERGENIC_BASES / PF_ALIGNED_BASES

PCT_MRNA_BASES: PCT_UTR_BASES + PCT_CODING_BASES

PCT_USABLE_BASES: The percentage of bases mapping to mRNA divided by the total number of PF bases.

PCT_CORRECT_STRAND_READS: CORRECT_STRAND_READS/(CORRECT_STRAND_READS + INCORRECT_STRAND_READS). 0 if library is not strand-specific.

MEDIAN_CV_COVERAGE: The median CV of coverage of the 1000 most highly expressed transcripts. Ideal value = 0.

MEDIAN_5PRIME_BIAS: The median 5 prime bias of the 1000 most highly expressed transcripts, where 5 prime bias is calculated per transcript as: mean coverage of the 5' most 100 bases divided by the mean coverage of the whole transcript.

MEDIAN_3PRIME_BIAS: The median 3 prime bias of the 1000 most highly expressed transcripts, where 3 prime bias is calculated per transcript as: mean coverage of the 3' most 100 bases divided by the mean coverage of the whole transcript.

MEDIAN_5PRIME_TO_3PRIME_BIAS: The ratio of coverage at the 5' end of to the 3' end based on the 1000 most highly expressed transcripts.

RrbsCpgDetailMetrics

Holds information about CpG sites encountered for RRBS processing QC

Column Definitions

SEQUENCE_NAME: Sequence the CpG is seen in

POSITION: Position within the sequence of the CpG site

TOTAL_SITES: Number of times this CpG site was encountered

CONVERTED_SITES: Number of times this CpG site was converted (TG for + strand, CA for - strand)

PCT_CONVERTED: TOTAL_BASES / CONVERTED_BASES

RrbsSummaryMetrics

Holds summary statistics from RRBS processing QC

Column Definitions

READS_ALIGNED: Number of mapped reads processed

NON_CPG_BASES: Number of times a non-CpG cytosine was encountered

NON_CPG_CONVERTED_BASES: Number of times a non-CpG cytosine was converted (C->T for +, G->A for -)

PCT_NON_CPG_BASES_CONVERTED: NON_CPG_BASES / NON_CPG_CONVERTED_BASES

CPG_BASES_SEEN: Number of CpG sites encountered

CPG_BASES_CONVERTED: Number of CpG sites that were converted (TG for +, CA for -)

PCT_CPG_BASES_CONVERTED: CPG_BASES_SEEN / CPG_BASES_CONVERTED

MEAN_CPG_COVERAGE: Mean coverage of CpG sites

MEDIAN_CPG_COVERAGE: Median coverage of CpG sites

READS_WITH_NO_CPG: Number of reads discarded for having no CpG sites

READS_IGNORED_SHORT: Number of reads discarded due to being too short

READS_IGNORED_MISMATCHES: Number of reads discarded for exceeding the mismatch threshold

SamFileValidator.ValidationMetrics

Column Definitions

ScreenSamReads.ScreenSamReadsMetrics

SAM or BAM read screening Metrics

Column Definitions

START_REFERENCE: The reference of the un-screened source file.

START_BASES: The number of bases in the un-screened source data file.

PCT_START_ALIGNED: The % of mapped bases in the un-screened source data file (of start bases).

PCT_START_UNMAPPED: The % of unmapped bases in the un-screened source data file (of start bases).

POSITIVE_REFERENCE: The positive reference used during alignment of un-screened reads.

PCT_POSITIVE_ALIGNED: The % of bases that mapped to the positive reference.

PCT_POSITIVE_UNMAPPED: The % of bases that did not map to the positive reference.

NEGATIVE_REFERENCE: The negative reference used during alignment of un-screened reads.

PCT_NEGATIVE_ALIGNED: The % of bases that mapped to the negative reference.

PCT_NEGATIVE_UNMAPPED: The % of bases that did not map to the negative reference.

END_BASES: The number of bases in the screened data file.

PCT_END_ALIGNED: The % of mapped bases in the screened data file (of end bases).

PCT_END_UNMAPPED: The % of unmapped bases in the screened data file (of end bases).

PCT_PASSING_SCREEN: The % of bases that passed the screen (of start bases).

PCT_FAILING_SCREEN: The % of bases that failed the screen (of start bases).

SpikeInMetrics

Created by IntelliJ IDEA. User: ktibbett Date: Nov 17, 2009 Time: 4:17:25 PM To change this template use File | Settings | File Templates.

Column Definitions

TOTAL_PLASMID_READS: The number of reads in the BAM file that map to plasmids

EXPECTED_PLASMID: The name of the plasmid that was spiked into the lane; all plasmid reads are expected to align to this reference.

MEDIAN_COVERAGE_EXPECTED_PLASMID: The median number of reads covering each "bin" of the expected plasmid

EXPECTED_PLASMID_COUNT: The number of reads mapping to the expected plasmid

BEST_PLASMID: The name of the plasmid that to which the most reads aligned.

BEST_PLASMID_MEDIAN_COVERAGE: The median number of reads covering each "bin" of the best plasmid

BEST_PLASMID_COUNT: The number of reads mapping to the plasmid with the most reads aligned

SECOND_BEST_PLASMID: The name of the plasmid to which the second-highest number of reads aligned.

SECOND_BEST_PLASMID_MEDIAN_COVERAGE: The median number of reads covering each "bin" of the second-best plasmid

SECOND_BEST_PLASMID_COUNT: The number of reads mapping to the plasmid with the second-highest number of reads aligned.

TargetedPcrMetrics

Metrics class for targeted pcr runs such as TSCA runs

Column Definitions

CUSTOM_AMPLICON_SET: The name of the amplicon set used in this metrics collection run

GENOME_SIZE: The number of bases in the reference genome used for alignment.

AMPLICON_TERRITORY: The number of unique bases covered by the intervals of all amplicons in the amplicon set

TARGET_TERRITORY: The number of unique bases covered by the intervals of all targets that should be covered

TOTAL_READS: The total number of reads in the SAM or BAM file examine.

PF_READS: The number of reads that pass the vendor's filter.

PF_BASES: THe number of bases in the SAM or BAM file to be examined

PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates.

PCT_PF_READS: PF reads / total reads. The percent of reads passing filter.

PCT_PF_UQ_READS: PF Unique Reads / Total Reads.

PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.

PF_SELECTED_PAIRS: Tracks the number of read pairs that we see that are PF (used to calculate library size)

PF_SELECTED_UNIQUE_PAIRS: Tracks the number of unique PF reads pairs we see (used to calc library size)

PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads.

PF_UQ_BASES_ALIGNED: The number of PF unique bases that are aligned with mapping score > 0 to the reference genome.

ON_AMPLICON_BASES: The number of PF aligned amplified that mapped to an amplified region of the genome.

NEAR_AMPLICON_BASES: The number of PF aligned bases that mapped to within a fixed interval of an amplified region, but not on a baited region.

OFF_AMPLICON_BASES: The number of PF aligned bases that mapped to neither on or near an amplicon.

ON_TARGET_BASES: The number of PF aligned bases that mapped to a targeted region of the genome.

ON_TARGET_FROM_PAIR_BASES: The number of PF aligned bases that are mapped in pair to a targeted region of the genome.

PCT_AMPLIFIED_BASES: On+Near Amplicon Bases / PF Bases Aligned.

PCT_OFF_AMPLICON: The percentage of aligned PF bases that mapped neither on or near an amplicon.

ON_AMPLICON_VS_SELECTED: The percentage of on+near amplicon bases that are on as opposed to near.

MEAN_AMPLICON_COVERAGE: The mean coverage of all amplicons in the experiment.

MEAN_TARGET_COVERAGE: The mean coverage of targets that recieved at least coverage depth = 2 at one base.

FOLD_ENRICHMENT: The fold by which the amplicon region has been amplified above genomic background.

ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base.

FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.

PCT_TARGET_BASES_2X: The percentage of ALL target bases achieving 2X or greater coverage.

PCT_TARGET_BASES_10X: The percentage of ALL target bases achieving 10X or greater coverage.

PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage.

PCT_TARGET_BASES_30X: The percentage of ALL target bases achieving 30X or greater coverage.

UploadAggregationMetrics.Metrics.AggregationFauxMetric

A transient metric to encapsulate the data within the METRICS.AGGREGATION table. This class must be public to ensure visibility for reflection.

Column Definitions

UploadAggregationMetrics.Metrics.ForeignKeyFauxMetric

A transient metric to be merged with other metrics to augment them with a foreign key value in order to conform to database schema.

Column Definitions

AGGREGATION_ID:

UploadAggregationMetrics.Metrics.ReadGroupFauxMetric

A transient metric to encapsulate the data within the METRICS.AGGREGATION_READ_GROUP table. This class must be public to ensure visibility for reflection.