Picard Metrics Definitions

Table Of Contents

  1. AlignmentSummaryMetrics: High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".
  2. CollectIonTorrentBaseCallMetrics.IonBaseCallLibraryMetric: Metrics to describe molecular-barcode-specific data about an Ion Torrent basecalling run.
  3. CollectIonTorrentBaseCallMetrics.IonBaseCallRunMetric: Metrics to describe an Ion Torrent basecalling run, including metrics about first 2 test fragments.
  4. CollectIonTorrentBaseCallMetrics.IonRunMetric: Overview of an Ion Torrent basecalling run.
  5. CollectOxoGMetrics.CpcgMetrics: Metrics class for outputs.
  6. CollectQualityYieldMetrics.QualityYieldMetrics: A set of metrics used to describe the general quality of a BAM file
  7. CollectVariantCallingMetrics.VariantCallingDetailMetrics: A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.
  8. CollectVariantCallingMetrics.VariantCallingSummaryMetrics: A collection of metrics relating to snps and indels within a variant-calling file (VCF).
  9. CollectWgsMetrics.WgsMetrics: Metrics for evaluating the performance of whole genome sequencing experiments.
  10. ContaminationMetrics.FauxMetric: A metric-like container representing a contamination metrics file entry, though the contamination file is not a metrics file.
  11. CoverageMetric:
  12. DbSnpMatchMetrics: Metrics about how genotypes called by the pipeline match up to dbSNP, created by the CollectDbSnpMatches program and usually stored in a file with the extension ".dbsnp_matches".
  13. DuplicationMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.
  14. ExtractIlluminaBarcodes.BarcodeMetric: Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.
  15. FingerprintDetailMetrics: A collection of metrics about how the reads in a BAM filematched up with the known genotypes for a particular fingerprint panel
  16. FingerprintSummaryMetrics: A collection of metrics that summarize the match of reads in a particular BAM file against various fingerprint panels.
  17. FingerprintingDetailMetrics: Detailed metrics about an individual SNP/Haplotype comparison within a fingerprint comparison.
  18. FingerprintingSummaryMetrics: Summary fingerprinting metrics and statistics about the comparison of the sequence data from a single read group (lane or index within a lane) vs.
  19. GcBiasDetailMetrics: Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.
  20. GcBiasSummaryMetrics: High level metrics that capture how biased the coverage in a certain lane is.
  21. GenotypeConcordanceMetrics: Statistics about how well a given set of input genotypes matches a set of well known reference genotypes.
  22. GenotypeFreeContaminationMetric:
  23. GenotypeFreeLikelihoodPlotMetric:
  24. HsMetrics: The set of metrics captured that are specific to a hybrid selection analysis.
  25. IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.
  26. IlluminaLaneMetrics: Embodies characteristics that describe a lane.
  27. IlluminaPhasingMetrics: Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis.
  28. InsertSizeMetrics: Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".
  29. InternalControlCycleMetrics: Metrics about observations of an internal control sequence in an individual cycle.
  30. InternalControlSummaryMetrics: Summary metrics about internal controls within a lane.
  31. JumpingLibraryMetrics: High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".
  32. KmerMetrics: Metrics about an individual kmer in a SAM or BAM file
  33. LowPassConcordanceMetrics: Concordance statistics for a set of low coverage reads against well known well known genotypes for the same sample for the purpose of ensuring that the sample being sequenced is the sample we think it is.
  34. MendelianViolationMetrics: Describes the type and number of mendelian violations found within a Trio.
  35. MotifCoverageMetric:
  36. MultilevelMetrics:
  37. RnaSeqMetrics: Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".
  38. RrbsCpgDetailMetrics: Holds information about CpG sites encountered for RRBS processing QC
  39. RrbsSummaryMetrics: Holds summary statistics from RRBS processing QC
  40. SamFileValidator.ValidationMetrics:
  41. ScreenSamReads.ScreenSamReadsMetrics: SAM or BAM read screening Metrics
  42. SpikeInMetrics: Created by IntelliJ IDEA.
  43. TargetedPcrMetrics: Metrics class for targeted pcr runs such as TSCA runs
  44. UploadAggregationMetrics.Metrics.AggregationFauxMetric: A transient metric to encapsulate the data within the METRICS.AGGREGATION table.
  45. UploadAggregationMetrics.Metrics.ForeignKeyFauxMetric: A transient metric to be merged with other metrics to augment them with a foreign key value in order to conform to database schema.
  46. UploadAggregationMetrics.Metrics.ReadGroupFauxMetric: A transient metric to encapsulate the data within the METRICS.AGGREGATION_READ_GROUP table.
  47. UploadFour54ScreeningMetrics.Four54ScreeningMetrics:
  48. UploadIlluminaScreeningMetrics.IlluminaScreeningMetrics:
  49. VariantCallingMetricsUploader.Metric.VariantCallingAnalysisFauxMetric: A bean for the METRIC.VARIANT_CALLING_ANALYSIS database table represented as a MetricBase so that it can be uploaded via the metrics uploader.
  50. VariantCallingMetricsUploader.Metric.VariantCallingAnalysisForeignKeyFauxMetric:
  51. VariantCallingSampleMetadataMetric:
AlignmentSummaryMetrics

High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".

Column Definitions

CATEGORY: One of either UNPAIRED (for a fragment run), FIRST_OF_PAIR when metrics are for only the first read in a paired run, SECOND_OF_PAIR when the metrics are for only the second read in a paired run or PAIR when the metrics are aggregated for both first and second reads in a pair.
TOTAL_READS: The total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR this value will be 2x the number of clusters.
PF_READS: The number of PF reads where PF is defined as passing Illumina's filter.
PCT_PF_READS: The percentage of reads that are PF (PF_READS / TOTAL_READS)
PF_NOISE_READS: The number of PF reads that are marked as noise reads. A noise read is one which is composed entirely of A bases and/or N bases. These reads are marked as they are usually artifactual and are of no use in downstream analysis.
PF_READS_ALIGNED: The number of PF reads that were aligned to the reference sequence. This includes reads that aligned with low quality (i.e. their alignments are ambiguous).
PCT_PF_READS_ALIGNED: The percentage of PF reads that aligned to the reference sequence. PF_READS_ALIGNED / PF_READS
PF_ALIGNED_BASES: The total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence.
PF_HQ_ALIGNED_READS: The number of PF reads that were aligned to the reference sequence with a mapping quality of Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the alignment is wrong.
PF_HQ_ALIGNED_BASES: The number of bases aligned to the reference sequence in reads that were mapped at high quality. Will usually approximate PF_HQ_ALIGNED_READS * READ_LENGTH but may differ when either mixed read lengths are present or many reads are aligned with gaps.
PF_HQ_ALIGNED_Q20_BASES: The subset of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher.
PF_HQ_MEDIAN_MISMATCHES: The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS).
PF_MISMATCH_RATE: The rate of bases mismatching the reference for all bases aligned to the reference sequence.
PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads.
PF_INDEL_RATE: The number of insertion and deletion events per 100 aligned bases. Uses the number of events as the numerator, not the number of inserted or deleted bases.
MEAN_READ_LENGTH: The mean read length of the set of reads examined. When looking at the data for a single lane with equal length reads this number is just the read length. When looking at data for merged lanes with differing read lengths this is the mean read length of all reads.
READS_ALIGNED_IN_PAIRS: The number of aligned reads whose mate pair was also aligned to the reference.
PCT_READS_ALIGNED_IN_PAIRS: The percentage of reads whose mate pair was also aligned to the reference. READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED
BAD_CYCLES: The number of instrument cycles in which 80% or more of base calls were no-calls.
STRAND_BALANCE: The number of PF reads aligned to the positive strand of the genome divided by the number of PF reads aligned to the genome.
PCT_CHIMERAS: The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes.
PCT_ADAPTER: The percentage of PF reads that are unaligned and match to a known adapter sequence right from the start of the read.
CollectIonTorrentBaseCallMetrics.IonBaseCallLibraryMetric

Metrics to describe molecular-barcode-specific data about an Ion Torrent basecalling run. Currently there is not much here because we haven't done any indexed Ion Torrent runs yet.

Column Definitions

RUN_NAME:
MOLECULAR_INDEX_NAME:
MEAN_READ_LENGTH:
TOTAL_NUM_BASES:
CollectIonTorrentBaseCallMetrics.IonBaseCallRunMetric

Metrics to describe an Ion Torrent basecalling run, including metrics about first 2 test fragments.

Column Definitions

RUN_NAME: The full name of the Ion run consisting of run date (analysis date?), operator, and PGM run number.
TOTAL_NUM_BASES: Number of filtered and trimmed base pairs reported in the SFF and FASTQ files.
NUM_Q17_BASES: Number of bases with predicted quality of Q17 or greater.
NUM_Q20_BASES: Number of bases with predicted quality of Q20 or greater.
TOTAL_NUM_READS: Total number of filtered and trimmed reads independent of length reported in the SFF and FASTQ files.
MEAN_READ_LENGTH: Average length, in base pairs, of all filtered and trimmed library reads reported in the SFF and FASTQ files.
LONGEST_READ: Maximum length, in base pairs, of all filtered and trimmed library reads reported in the file.
LIBRARY_CF: percentage of reads affected by Carry Forward events?
LIBRARY_IE: percentage of reads affected by Incomplete Extension?
LIBRARY_DR:
LIBRARY_SNR:
NUMBER_AMBIGUOUS:
NUMBER_DUD:
NUMBER_TF: Number of Test Fragment Reads (defined by key sequence) before filtering and trimming
NUMBER_LIB: Number of Library Reads (defined by key sequence TCAG) before filtering and trimming
KEYPASS_ALL_BEADS: Total number of reads with any key sequence before filtering and trimming
TOTAL_ADDRESSABLE_WELLS: Total number of addressable wells
WELLS_WITH_ISPS: Number of wells that were determined to be "positive" for the presence of an ISP within the well. "Positive" is determined by measuring the diffusion rate of a flow with a different pH. Wells containing ISPs have a delayed pH change due to the presence of an ISP slowing the detection of the pH change from the solution.
PCT_WELLS_WITH_ISPS: Percent of Addressable wells loaded with a bead (WELLS_WITH_ISPS/TOTAL_ADDRESSABLE_WELLS)
LIVE_ISPS: Number of wells that contained an ISP with a signal of sufficient strength and composition to be associated with the library or Test Fragment key. This value is the sum of the following categories: Test Fragment Library
PCT_LIVE_ISPS: Percent of wells with ISPs that have live ISPs (LIVE_ISPS/WELLS_WITH_ISPS)
TEST_FRAGMENT_ISPS: Number of Live ISPs with a key signal that was identical to the Test Fragment key signal.
PCT_TEST_FRAGMENT_ISPS: Percent of live ISPs that are test fragments
LIBRARY_ISPS: Number of Live ISPs that have a key signal identical to the library key signal. These reads are input into the Library filtering process.
PCT_LIBRARY_ISPS: Percent of live ISPs that are library
FLTRD_TOO_SHORT:
PCT_FLTRD_TOO_SHORT:
FLTRD_KEYPASS_FAILURE:
PCT_FLTRD_KEYPASS_FAILURE:
FLTRD_LOW_SIGNAL:
PCT_FLTRD_LOW_SIGNAL:
FLTRD_POOR_SIGNAL_PROFILE:
PCT_FLTRD_POOR_SIGNAL_PROFILE:
FLTRD_3_PRIME_ADAPTER_TRIM:
PCT_FLTRD_3_PRIME_ADAPTER_TRIM:
FLTRD_3_PRIME_QUAL_TRIM:
PCT_FLTRD_3_PRIME_QUAL_TRIM:
FINAL_LIBRARY_READS: Number of Library reads passing all filters, which are recorded in the SFF and FASTQ files.
PCT_FINAL_LIBRARY_READS: Percentage of library reads passing all filters, which are recorded in the SFF and FASTQ files.
CHIP_CHECK: A series of tests on reference wells (about 10% of the chip in non-addressable areas) is performed to ensure that the chip is functioning at a basic level. The value of this field is either Passed or Failed.
CHIP_TYPE: Chip type (314,316,318)
FLOW_ORDER: Nucelotide flow order
LIBRARY_KEY: A short known sequence of bases used to distinguish the library fragment from the Test Fragment. Example: "TCAG"
ANALYSIS_VERSION: Version of the Analysis Pipeline used to generate the analysis.
DBREPORTS_VERSION: Version of the ion-dbreports package.
TF1_NAME: Name of 1st TF
TF1_Q10_MEAN: Mean read length of all Q10 or greater Test Fragments (type 1)
TF1_Q17_MEAN: Mean read length of all Q17 or greater Test Fragments (type 1)
TF1_Q10_MODE: Mode of read lengths of all Q10 or greater Test Fragments (type 1)
TF1_Q17_MODE: Mode of read lengths of all Q17 or greater Test Fragments (type 1)
TF1_SYSTEM_SNR:
TF1_50Q10_READS: Number of Test Fragments (type1) that at 50bp have a quality score of Q10 or greater
TF1_50Q17_READS: Number of Test Fragments (type1) that at 50bp have a quality score of Q17 or greater
TF1_KEYPASS_READS: Total number of type 1 TF reads
TF1_CF: Percent of TF type 2 reads affected by Carry Forward
TF1_IE: Percent of TF type 2 reads affected by Incomplete Extension
TF1_DR:
TF1_KEY_PEAK_COUNTS:
TF2_NAME: Name of 2nd TF
TF2_Q10_MEAN: Mean read length of all Q10 or greater Test Fragments (type 2)
TF2_Q17_MEAN: Mean read length of all Q17 or greater Test Fragments (type 2)
TF2_Q10_MODE: Mode of read lengths of all Q10 or greater Test Fragments (type 2)
TF2_Q17_MODE: Mode of read lengths of all Q17 or greater Test Fragments (type 2)
TF2_SYSTEM_SNR:
TF2_50Q10_READS: Number of Test Fragments (type2) that at 50bp have a quality score of Q10 or greater
TF2_50Q17_READS: Number of Test Fragments (type2) that at 50bp have a quality score of Q17 or greater
TF2_KEYPASS_READS: Total number of type 2 TF reads
TF2_CF: Percent of TF type 2 reads affected by Carry Forward
TF2_IE: Percent of TF type 2 reads affected by Incomplete Extension
TF2_DR:
TF2_KEY_PEAK_COUNTS:
ION_RUN_ID:
CollectIonTorrentBaseCallMetrics.IonRunMetric

Overview of an Ion Torrent basecalling run. Eventually this will go away and be replaced by a subclass of NEXT_GENERATION_RUN

Column Definitions

RUN_NAME: aka Experiment name in Ion-speak
PGM_NAME:
RUN_DATE:
RAW_DATA_DIRECTORY:
CHIP_BARCODE:
ID: Arbitrary identifier for database joins.
CollectOxoGMetrics.CpcgMetrics

Metrics class for outputs.

Column Definitions

SAMPLE_ALIAS: The name of the sample being assayed.
LIBRARY: The name of the library being assayed.
CONTEXT: The sequence context being reported on.
TOTAL_SITES: The total number of sites that had at least one base covering them.
TOTAL_BASES: The total number of basecalls observed at all sites.
REF_NONOXO_BASES: The number of reference alleles observed as C in read 1 and G in read 2.
REF_OXO_BASES: The number of reference alleles observed as G in read 1 and C in read 2.
REF_TOTAL_BASES: The total number of reference alleles observed
ALT_NONOXO_BASES: The count of observed A basecalls at C reference positions and T basecalls at G reference bases that are correlated to instrument read number in a way that rules out oxidation as the cause
ALT_OXO_BASES: The count of observed A basecalls at C reference positions and T basecalls at G reference bases that are correlated to instrument read number in a way that is consistent with oxidative damage.
OXIDATION_ERROR_RATE: The oxo error rate, calculated as max(ALT_OXO_BASES - ALT_NONOXO_BASES, 1) / TOTAL_BASES
OXIDATION_Q: -10 * log10(OXIDATION_ERROR_RATE)
C_REF_REF_BASES: The number of ref basecalls observed at sites where the genome reference == C.
G_REF_REF_BASES: The number of ref basecalls observed at sites where the genome reference == G.
C_REF_ALT_BASES: The number of alt (A/T) basecalls observed at sites where the genome reference == C.
G_REF_ALT_BASES: The number of alt (A/T) basecalls observed at sites where the genome reference == G.
C_REF_OXO_ERROR_RATE: The rate at which C>A and G>T substitutions are observed at C reference sites above the expected rate if there were no bias between sites with a C reference base vs. a G reference base.
C_REF_OXO_Q: C_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.
G_REF_OXO_ERROR_RATE: The rate at which C>A and G>T substitutions are observed at G reference sites above the expected rate if there were no bias between sites with a C reference base vs. a G reference base.
G_REF_OXO_Q: G_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.
CollectQualityYieldMetrics.QualityYieldMetrics

A set of metrics used to describe the general quality of a BAM file

Column Definitions

TOTAL_READS: The total number of reads in the input file
PF_READS: The number of reads that are PF - pass filter
READ_LENGTH: The average read length of all the reads (will be fixed for a lane)
TOTAL_BASES: The total number of bases in all reads
PF_BASES: The total number of bases in all PF reads
Q20_BASES: The number of bases in all reads that achieve quality score 20 or higher
PF_Q20_BASES: The number of bases in PF reads that achieve quality score 20 or higher
Q30_BASES: The number of bases in all reads that achieve quality score 20 or higher
PF_Q30_BASES: The number of bases in PF reads that achieve quality score 20 or higher
Q20_EQUIVALENT_YIELD: The sum of quality scores of all bases divided by 20
PF_Q20_EQUIVALENT_YIELD: The sum of quality scores of all bases divided by 20
CollectVariantCallingMetrics.VariantCallingDetailMetrics

A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.

Column Definitions

SAMPLE_ALIAS: The name of the sample being assayed
HET_HOMVAR_RATIO: (count of hets)/(count of homozygous non-ref) for this sample
CollectVariantCallingMetrics.VariantCallingSummaryMetrics

A collection of metrics relating to snps and indels within a variant-calling file (VCF).

Column Definitions

TOTAL_SNPS: The number of high confidence SNPs calls (i.e. non-reference genotypes) that were examined
NUM_IN_DB_SNP: The number of high confidence SNPs found in dbSNP
NOVEL_SNPS: The number of high confidence SNPS called that were not found in dbSNP
FILTERED_SNPS: The number of SNPs that are also filtered
PCT_DBSNP: The percentage of high confidence SNPs in dbSNP
DBSNP_TITV: The Transition/Transversion ratio of the SNP calls made at dbSNP sites
NOVEL_TITV: The Transition/Transversion ratio of the SNP calls made at non-dbSNP sites
TOTAL_INDELS: The number of high confidence Indel calls that were examined
NOVEL_INDELS: The number of high confidence Indels called that were not found in dbSNP
FILTERED_INDELS: The number of indels that are also filtered
PCT_DBSNP_INDELS: The percentage of high confidence Indels in dbSNP
NUM_IN_DB_SNP_INDELS: The number of high confidence Indels found in dbSNP
DBSNP_INS_DEL_RATIO: The Insertion/Deletion ratio of the Indel calls made at dbSNP sites
NOVEL_INS_DEL_RATIO: The Insertion/Deletion ratio of the Indel calls made at non-dbSNP sites
TOTAL_MULTIALLELIC_SNPS: The number of high confidence multiallelic SNP calls that were examined
NUM_IN_DB_SNP_MULTIALLELIC: The number of high confidence multiallelic SNPs found in dbSNP
TOTAL_COMPLEX_INDELS: The number of high confidence complex Indel calls that were examined
NUM_IN_DB_SNP_COMPLEX_INDELS: The number of high confidence complex Indels found in dbSNP
SNP_REFERENCE_BIAS: The rate at which reference bases are observed at ref/alt heterozygous SNP sites.
NUM_SINGLETONS: For summary metrics, the number of variants that appear in only one sample. For detail metrics, the number of variants that appear only in the current sample.
CollectWgsMetrics.WgsMetrics

Metrics for evaluating the performance of whole genome sequencing experiments.

Column Definitions

GENOME_TERRITORY: The number of non-N bases in the genome reference over which coverage will be evaluated.
MEAN_COVERAGE: The mean coverage in bases of the genome territory, after all filters are applied.
SD_COVERAGE: The standard deviation of coverage of the genome after all filters are applied.
MEDIAN_COVERAGE: The median coverage in bases of the genome territory, after all filters are applied.
MAD_COVERAGE: The median absolute deviation of coverage of the genome after all filters are applied.
PCT_EXC_MAPQ: The fraction of aligned bases that were filtered out because they were in reads with low mapping quality (default is < 20).
PCT_EXC_DUPE: The fraction of aligned bases that were filtered out because they were in reads marked as duplicates.
PCT_EXC_UNPAIRED: The fraction of aligned bases that were filtered out because they were in reads without a mapped mate pair.
PCT_EXC_BASEQ: The fraction of aligned bases that were filtered out because they were of low base quality (default is < 20).
PCT_EXC_OVERLAP: The fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads.
PCT_EXC_CAPPED: The fraction of aligned bases that were filtered out because they would have raised coverage above the capped value (default cap = 250x).
PCT_EXC_TOTAL: The total fraction of aligned bases excluded due to all filters.
PCT_5X: The fraction of bases that attained at least 5X sequence coverage in post-filtering bases.
PCT_10X: The fraction of bases that attained at least 10X sequence coverage in post-filtering bases.
PCT_20X: The fraction of bases that attained at least 20X sequence coverage in post-filtering bases.
PCT_30X: The fraction of bases that attained at least 30X sequence coverage in post-filtering bases.
PCT_40X: The fraction of bases that attained at least 40X sequence coverage in post-filtering bases.
PCT_50X: The fraction of bases that attained at least 50X sequence coverage in post-filtering bases.
PCT_60X: The fraction of bases that attained at least 60X sequence coverage in post-filtering bases.
PCT_70X: The fraction of bases that attained at least 70X sequence coverage in post-filtering bases.
PCT_80X: The fraction of bases that attained at least 80X sequence coverage in post-filtering bases.
PCT_90X: The fraction of bases that attained at least 90X sequence coverage in post-filtering bases.
PCT_100X: The fraction of bases that attained at least 100X sequence coverage in post-filtering bases.
ContaminationMetrics.FauxMetric

A metric-like container representing a contamination metrics file entry, though the contamination file is not a metrics file.

Column Definitions

SAMPLE_ALIAS:
NUM_SNPS:
NUM_READS:
MEAN_DEPTH:
PCT_CONTAMINATION:
LL_PREDICTED_CONTAM:
LL_NO_CONTAM:
CoverageMetric

Column Definitions

UNAMBIGUOUS_BASES:
EXCLUDED_BASES:
TOTAL_BASES:
TOTAL_READS:
MAPPED_READS:
READS_PER_BASE:
COVERAGE_PER_READ_BASE:
DELETIONS_PER_READ_BASE:
INSERTIONS_PER_READ_BASE:
CLIPS_PER_READ_BASE:
MEAN_COVERAGE:
MEAN_QUALITY:
MEAN_MISMATCHES_PER_COVERAGE:
MEAN_DELETES_PER_COVERAGE:
MEAN_INSERTS_PER_COVERAGE:
MEAN_CLIPS_PER_COVERAGE:
MEAN_READ_STARTS:
VAR_READ_STARTS:
DbSnpMatchMetrics

Metrics about how genotypes called by the pipeline match up to dbSNP, created by the CollectDbSnpMatches program and usually stored in a file with the extension ".dbsnp_matches".

Column Definitions

TOTAL_SNPS: The number of high confidence SNPs calls (i.e. non reference genotypes) that were examined.
NOVEL_SNPS: The number of high confidence SNPS called that were not found in dbSNP
PCT_DBSNP: The percentage of high confidence SNPs in dbSNP
NUM_IN_DB_SNP: The number of high confidence SNPs found in dbSNP
DBSNP_TITV: The Transition/Transversion ratio of the SNP calls made at dbSNP sites.
NOVEL_TITV: The Transition/Transversion ratio of the SNP calls made at non-dbSNP sites.
DuplicationMetrics

Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.

Column Definitions

LIBRARY: The library on which the duplicate marking was performed.
UNPAIRED_READS_EXAMINED: The number of mapped reads examined which did not have a mapped mate pair, either because the read is unpaired, or the read is paired to an unmapped mate.
READ_PAIRS_EXAMINED: The number of mapped read pairs examined.
UNMAPPED_READS: The total number of unmapped reads examined.
UNPAIRED_READ_DUPLICATES: The number of fragments that were marked as duplicates.
READ_PAIR_DUPLICATES: The number of read pairs that were marked as duplicates.
READ_PAIR_OPTICAL_DUPLICATES: The number of read pairs duplicates that were caused by optical duplication. Value is always < READ_PAIR_DUPLICATES, which counts all duplicates regardless of source.
PERCENT_DUPLICATION: The percentage of mapped sequence that is marked as duplicate.
ESTIMATED_LIBRARY_SIZE: The estimated number of unique molecules in the library based on PE duplication.
ExtractIlluminaBarcodes.BarcodeMetric

Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.

Column Definitions

BARCODE: The barcode (from the set of expected barcodes) for which the following metrics apply. Note that the "symbolic" barcode of NNNNNN is used to report metrics for all reads that do not match a barcode.
BARCODE_NAME:
LIBRARY_NAME:
READS: The total number of reads matching the barcode.
PF_READS: The number of PF reads matching this barcode (always less than or equal to READS).
PERFECT_MATCHES: The number of all reads matching this barcode that matched with 0 errors or no-calls.
PF_PERFECT_MATCHES: The number of PF reads matching this barcode that matched with 0 errors or no-calls.
ONE_MISMATCH_MATCHES: The number of all reads matching this barcode that matched with 1 error or no-call.
PF_ONE_MISMATCH_MATCHES: The number of PF reads matching this barcode that matched with 1 error or no-call.
PCT_MATCHES: The percentage of all reads in the lane that matched to this barcode.
RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT: The rate of all reads matching this barcode to all reads matching the most prevelant barcode. For the most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more orphan reads than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation between barcodes.
PF_PCT_MATCHES: The percentage of PF reads in the lane that matched to this barcode.
PF_RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT: The rate of PF reads matching this barcode to PF reads matching the most prevelant barcode. For the most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more orphan reads than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation of PF reads between barcodes.
PF_NORMALIZED_MATCHES: The "normalized" matches to each barcode. This is calculated as the number of pf reads matching this barcode over the sum of all pf reads matching any barcode (excluding orphans). If all barcodes are represented equally this will be 1.
FingerprintDetailMetrics

A collection of metrics about how the reads in a BAM filematched up with the known genotypes for a particular fingerprint panel

Column Definitions

SNP: The name of the SNP
FINGERPRINT_GENOTYPE: The genotype in the fingerprint file
SEQUENCED_GENOTYPE: The genotype sequenced in the BAM file
LOD: The best-to-second-best LOD for the called genotype
READS: The number of reads covering the locus in the BAM file
FingerprintSummaryMetrics

A collection of metrics that summarize the match of reads in a particular BAM file against various fingerprint panels.

Column Definitions

PANEL_NAME: The name of the fingerprint panel
PANEL_SNPS: The number of SNPs contained in the panel
CONFIDENT_CALLS: The number of panel SNPs that can be confidently called in the BAM file
CONFIDENT_MATCHING_SNPS: The number of confidently matching SNPs in the BAM file
CONFIDENT_CALLED_PCT: The number of confidently called SNPs as a percentage of the total number of SNPs on the panel
CONFIDENT_MATCHING_SNPS_PCT: The number of confidently called matching SNPs as a percentage of the total number of confidently called SNPs on the panel
FingerprintingDetailMetrics

Detailed metrics about an individual SNP/Haplotype comparison within a fingerprint comparison.

Column Definitions

READ_GROUP: The sequencing read group from which sequence data was fingerprinted.
SAMPLE: The name of the sample who's genotypes the sequence data was compared to.
SNP: The name of a representative SNP within the haplotype that was compared. Will usually be the exact SNP that was genotyped externally.
SNP_ALLELES: The possible alleles for the SNP.
CHROM: The chromosome on which the SNP resides.
POSITION: The position of the SNP on the chromosome.
EXPECTED_GENOTYPE: The expected genotype of the sample at the SNP locus.
OBSERVED_GENOTYPE: The most likely genotype given the observed evidence at the SNP locus in the sequencing data.
LOD: The LOD score for OBSERVED_GENOTYPE vs. the next most likely genotype in the sequencing data.
OBS_A: The number of observations of the first, or A, allele of the SNP in the sequencing data.
OBS_B: The number of observations of the second, or B, allele of the SNP in the sequencing data.
FingerprintingSummaryMetrics

Summary fingerprinting metrics and statistics about the comparison of the sequence data from a single read group (lane or index within a lane) vs. a set of known genotypes for the expected sample.

Column Definitions

READ_GROUP: The read group from which sequence data was drawn for comparison.
SAMPLE: The sample whose known genotypes the sequence data was compared to.
LL_EXPECTED_SAMPLE: The Log Likelihood of the sequence data given the expected sample's genotypes.
LL_RANDOM_SAMPLE: The Log Likelihood of the sequence data given a random sample from the human population.
LOD_EXPECTED_SAMPLE: The LOD for Expected Sample vs. Random Sample. A positive LOD indicates that the sequence data is more likely to come from the expected sample vs. a random sample from the population, by LOD logs. I.e. a value of 6 indicates that the sequence data is 1,000,000 more likely to come from the expected sample than from a random sample. A negative LOD indicates the reverse - that the sequence data is more likely to come from a random sample than from the expected sample.
HAPLOTYPES_WITH_GENOTYPES: The number of haplotypes that had expected genotypes to compare to.
HAPLOTYPES_CONFIDENTLY_CHECKED: The subset of genotyped haplotypes for which there was sufficient sequence data to confidently genotype the haplotype. Note: all haplotypes with sequence coverage contribute to the LOD score, even if they cannot be "confidently checked" individually.
HAPLOTYPES_CONFIDENTLY_MATCHING: The subset of confidently checked haplotypes that match the expected genotypes.
GcBiasDetailMetrics

Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.

Column Definitions

GC: The G+C content of the reference sequence represented by this bin. Values are from 0% to 100%
WINDOWS: The number of windows on the reference genome that have this G+C content.
READ_STARTS: The number of reads whose start position is at the start of a window of this GC.
MEAN_BASE_QUALITY: The mean quality (determined via the error rate) of all bases of all reads that are assigned to windows of this GC.
NORMALIZED_COVERAGE: The ration of "coverage" in this GC bin vs. the mean coverage of all GC bins. A number of 1 represents mean coverage, a number less than one represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than one represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average).
ERROR_BAR_WIDTH: The radius of error bars in this bin based on the number of observations made. For example if the normalized coverage is 0.75 and the error bar width is 0.1 then the error bars would be drawn from 0.65 to 0.85.
GcBiasSummaryMetrics

High level metrics that capture how biased the coverage in a certain lane is.

Column Definitions

WINDOW_SIZE: The window size on the genome used to calculate the GC of the sequence.
TOTAL_CLUSTERS: The total number of clusters that were seen in the gc bias calculation.
ALIGNED_READS: The total number of aligned reads used to compute the gc bias metrics.
AT_DROPOUT: Illumina-style AT dropout metric. Calculated by taking each GC bin independently and calculating (%ref_at_gc - %reads_at_gc) and summing all positive values for GC=[0..50].
GC_DROPOUT: Illumina-style GC dropout metric. Calculated by taking each GC bin independently and calculating (%ref_at_gc - %reads_at_gc) and summing all positive values for GC=[50..100].
GenotypeConcordanceMetrics

Statistics about how well a given set of input genotypes matches a set of well known reference genotypes.

Column Definitions

CATEGORY: One of the following:
OBSERVATIONS: Number of genotypes in this category at the same locus in each of the input and reference
AGREE: Of the number of observations, how many agreed
DISAGREE: Of the number of observations, how many disagreed
PCT_CONCORDANCE: Ratio of agreed to observations
GenotypeFreeContaminationMetric

Column Definitions

STRATIFICATION: A name describing the subset of reads/bases from the bam that were included in this model
CONTAMINATION_ESTIMATE: The estimated contamination (with the greatest likelihood)
LL_CONTAMINATION_ESTIMATE: The likelihood of the putative contamination
CONFIDENCE_INTERVAL_CONTAMINATION_LOWER: The lower bound of the contamination confidence interval
CONFIDENCE_INTERVAL_CONTAMINATION_UPPER: The upper bound of the contamination confidence interval
LL_NULL_CONTAMINATION: The likelihood of the null-hypothesis level of contamination
TOTAL_BASE_DEPTH: The total number of bases inspected in this model
AVERAGE_BASE_DEPTH: The average base read/base depth of each inspected locus
STDDEV_BASE_DEPTH: The standard deviation of the read/base depth of each inspected locus
GenotypeFreeLikelihoodPlotMetric

Column Definitions

STRATIFIER:
PUTATIVE_CONTAMINATION:
LIKELIHOOD:
HsMetrics

The set of metrics captured that are specific to a hybrid selection analysis.

Column Definitions

BAIT_SET: The name of the bait set used in the hybrid selection.
GENOME_SIZE: The number of bases in the reference genome used for alignment.
BAIT_TERRITORY: The number of bases which have one or more baits on top of them.
TARGET_TERRITORY: The unique number of target bases in the experiment where target is usually exons etc.
BAIT_DESIGN_EFFICIENCY: Target terrirtoy / bait territory. 1 == perfectly efficient, 0.5 = half of baited bases are not target.
TOTAL_READS: The total number of reads in the SAM or BAM file examine.
PF_READS: The number of reads that pass the vendor's filter.
PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates.
PCT_PF_READS: PF reads / total reads. The percent of reads passing filter.
PCT_PF_UQ_READS: PF Unique Reads / Total Reads.
PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.
PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads.
PF_UQ_BASES_ALIGNED: The number of bases in the PF aligned reads that are mapped to a reference base. Accounts for clipping and gaps.
ON_BAIT_BASES: The number of PF aligned bases that mapped to a baited region of the genome.
NEAR_BAIT_BASES: The number of PF aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region.
OFF_BAIT_BASES: The number of PF aligned bases that mapped to neither on or near a bait.
ON_TARGET_BASES: The number of PF aligned bases that mapped to a targeted region of the genome.
PCT_SELECTED_BASES: On+Near Bait Bases / PF Bases Aligned.
PCT_OFF_BAIT: The percentage of aligned PF bases that mapped neither on or near a bait.
ON_BAIT_VS_SELECTED: The percentage of on+near bait bases that are on as opposed to near.
MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment.
MEAN_TARGET_COVERAGE: The mean coverage of targets that received at least coverage depth = 2 at one base.
PCT_USABLE_BASES_ON_BAIT: The number of aligned, de-duped, on-bait bases out of the PF bases available.
PCT_USABLE_BASES_ON_TARGET: The number of aligned, de-duped, on-target bases out of the PF bases available.
FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background.
ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base.
FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.
PCT_TARGET_BASES_2X: The percentage of ALL target bases achieving 2X or greater coverage.
PCT_TARGET_BASES_10X: The percentage of ALL target bases achieving 10X or greater coverage.
PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage.
PCT_TARGET_BASES_30X: The percentage of ALL target bases achieving 30X or greater coverage.
PCT_TARGET_BASES_40X: The percentage of ALL target bases achieving 40X or greater coverage.
PCT_TARGET_BASES_50X: The percentage of ALL target bases achieving 50X or greater coverage.
PCT_TARGET_BASES_100X: The percentage of ALL target bases achieving 100X or greater coverage.
HS_LIBRARY_SIZE: The estimated number of unique molecules in the selected part of the library.
HS_PENALTY_10X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 10X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 10 * HS_PENALTY_10X.
HS_PENALTY_20X: The "hybrid selection penalty" incurred to get 80% of target bases to 20X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 20X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 20 * HS_PENALTY_20X.
HS_PENALTY_30X: The "hybrid selection penalty" incurred to get 80% of target bases to 30X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 30X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 30 * HS_PENALTY_30X.
HS_PENALTY_40X: The "hybrid selection penalty" incurred to get 80% of target bases to 40X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 40X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 40 * HS_PENALTY_40X.
HS_PENALTY_50X: The "hybrid selection penalty" incurred to get 80% of target bases to 50X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 50X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 50 * HS_PENALTY_50X.
HS_PENALTY_100X: The "hybrid selection penalty" incurred to get 80% of target bases to 100X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 100X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 100 * HS_PENALTY_100X.
AT_DROPOUT: A measure of how undercovered <= 50% GC regions are relative to the mean. For each GC bin [0..50] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. AT DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC<=50% regions mapped elsewhere.
GC_DROPOUT: A measure of how undercovered >= 50% GC regions are relative to the mean. For each GC bin [50..100] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. GC DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC>=50% regions mapped elsewhere.
IlluminaBasecallingMetrics

Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis. Averages and means are taken over all tiles.

Column Definitions

LANE: The lane for which the metrics were calculated.
MOLECULAR_BARCODE_SEQUENCE_1: The barcode sequence for which the metrics were calculated.
MOLECULAR_BARCODE_NAME: The barcode name for which the metrics were calculated.
TOTAL_BASES: The total number of bases assigned to the index.
PF_BASES: The total number of passing-filter bases assigned to the index.
TOTAL_READS: The total number of reads assigned to the index.
PF_READS: The total number of passing-filter reads assigned to the index.
TOTAL_CLUSTERS: The total number of clusters assigned to the index.
PF_CLUSTERS: The total number of PF clusters assigned to the index.
MEAN_CLUSTERS_PER_TILE: The mean number of clusters per tile.
SD_CLUSTERS_PER_TILE: The standard deviation of clusters per tile.
MEAN_PCT_PF_CLUSTERS_PER_TILE: The mean percentage of pf clusters per tile.
SD_PCT_PF_CLUSTERS_PER_TILE: The standard deviation in percentage of pf clusters per tile.
MEAN_PF_CLUSTERS_PER_TILE: The mean number of pf clusters per tile.
SD_PF_CLUSTERS_PER_TILE: The standard deviation in number of pf clusters per tile.
IlluminaLaneMetrics

Embodies characteristics that describe a lane.

Column Definitions

CLUSTER_DENSITY: The number of clusters per unit area on the this lane expressed in units of [cluster / mm^2].
LANE: This lane's number.
IlluminaPhasingMetrics

Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis. For each lane/template read # (i.e. FIRST, SECOND) combination we will store the median values of both the phasing and prephasing values for every tile in that lane/template read pair.

Column Definitions

LANE:
TYPE_NAME:
PHASING_APPLIED:
PREPHASING_APPLIED:
InsertSizeMetrics

Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics". In addition the insert size distribution is plotted to a file with the extension ".insert_size_Histogram.pdf".

Column Definitions

MEDIAN_INSERT_SIZE: The MEDIAN insert size of all paired end reads where both ends mapped to the same chromosome.
MEDIAN_ABSOLUTE_DEVIATION: The median absolute deviation of the distribution. If the distribution is essentially normal then the standard deviation can be estimated as ~1.4826 * MAD.
MIN_INSERT_SIZE: The minimum measured insert size. This is usually 1 and not very useful as it is likely artifactual.
MAX_INSERT_SIZE: The maximum measure insert size by alignment. This is usually very high representing either an artifact or possibly the presence of a structural re-arrangement.
MEAN_INSERT_SIZE: The mean insert size of the "core" of the distribution. Artefactual outliers in the distribution often cause calculation of nonsensical mean and stdev values. To avoid this the distribution is first trimmed to a "core" distribution of +/- N median absolute deviations around the median insert size. By default N=10, but this is configurable.
STANDARD_DEVIATION: Standard deviation of insert sizes over the "core" of the distribution.
READ_PAIRS: The total number of read pairs that were examined in the entire distribution.
PAIR_ORIENTATION: The pair orientation of the reads in this data category.
WIDTH_OF_10_PERCENT: The "width" of the bins, centered around the median, that encompass 10% of all read pairs.
WIDTH_OF_20_PERCENT: The "width" of the bins, centered around the median, that encompass 20% of all read pairs.
WIDTH_OF_30_PERCENT: The "width" of the bins, centered around the median, that encompass 30% of all read pairs.
WIDTH_OF_40_PERCENT: The "width" of the bins, centered around the median, that encompass 40% of all read pairs.
WIDTH_OF_50_PERCENT: The "width" of the bins, centered around the median, that encompass 50% of all read pairs.
WIDTH_OF_60_PERCENT: The "width" of the bins, centered around the median, that encompass 60% of all read pairs.
WIDTH_OF_70_PERCENT: The "width" of the bins, centered around the median, that encompass 70% of all read pairs. This metric divided by 2 should approximate the standard deviation when the insert size distribution is a normal distribution.
WIDTH_OF_80_PERCENT: The "width" of the bins, centered around the median, that encompass 80% of all read pairs.
WIDTH_OF_90_PERCENT: The "width" of the bins, centered around the median, that encompass 90% of all read pairs.
WIDTH_OF_99_PERCENT: The "width" of the bins, centered around the median, that encompass 100% of all read pairs.
InternalControlCycleMetrics

Metrics about observations of an internal control sequence in an individual cycle.

Column Definitions

INTERNAL_CONTROL: The name of the internal control sequence.
READ: The read (1 or 2) that the metrics are for.
CYCLE: The cycle number within the read that the metrics are for.
OBSERVATIONS: The number of reads observed that were matched to this internal control.
ERRORS: The number of mismatches (including no-calls) contained within the observed reads at this cycle.
ERROR_RATE: The error rate in this IC in this cycle, i.e. ERRORS/OBSERVATIONS
SUM_OF_ERROR_PROBS: The sum of the error probabilities observed - in an ideal system this should match ERRORS.
QUALITY_ESTIMATE: The ratio of SUM_OF_ERROR_PROBS to ERRORS. A number > 1 indicates that the control had fewer errors than would be predicted by the bases quality scores, a number < 1 indicates more errors than expected.
REF_BASE: The reference base of the internal control at this position.
A: The number of 'A' basecalls at this cycle for this internal control.
C: The number of 'C' basecalls at this cycle for this internal control.
G: The number of 'G' basecalls at this cycle for this internal control.
T: The number of 'T' basecalls at this cycle for this internal control.
InternalControlSummaryMetrics

Summary metrics about internal controls within a lane.

Column Definitions

INTERNAL_CONTROL: The name of the internal control sequence.
MATCHES: The number of reads matching this internal control.
PCT_MATCHES: The percentage of all internal control reads matching this internal control.
MEAN_READ1_ERROR_RATE: The mean error rate over read 1.
MEAN_READ2_ERROR_RATE: The mean error rate over read 2.
JumpingLibraryMetrics

High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".

Column Definitions

JUMP_PAIRS: The number of outward-facing pairs in the SAM file
JUMP_DUPLICATE_PAIRS: The number of outward-facing pairs that are duplicates
JUMP_DUPLICATE_PCT: The percentage of outward-facing pairs that are marked as duplicates
JUMP_LIBRARY_SIZE: The estimated library size for outward-facing pairs
JUMP_MEAN_INSERT_SIZE: The mean insert size for outward-facing pairs
JUMP_STDEV_INSERT_SIZE: The standard deviation on the insert size for outward-facing pairs
NONJUMP_PAIRS: The number of inward-facing pairs in the SAM file
NONJUMP_DUPLICATE_PAIRS: The number of inward-facing pais that are duplicates
NONJUMP_DUPLICATE_PCT: The percentage of inward-facing pairs that are marked as duplicates
NONJUMP_LIBRARY_SIZE: The estimated library size for inward-facing pairs
NONJUMP_MEAN_INSERT_SIZE: The mean insert size for inward-facing pairs
NONJUMP_STDEV_INSERT_SIZE: The standard deviation on the insert size for inward-facing pairs
CHIMERIC_PAIRS: The number of pairs where either (a) the ends fall on different chromosomes or (b) the insert size is greater than the maximum of 100000 or 2 times the mode of the insert size for outward-facing pairs.
FRAGMENTS: The number of fragments in the SAM file
PCT_JUMPS: The number of outward-facing pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.
PCT_NONJUMPS: The number of inward-facing pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.
PCT_CHIMERAS: The number of chimeric pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.
KmerMetrics

Metrics about an individual kmer in a SAM or BAM file

Column Definitions

FREQUENCY: The number of times the kmer is seen in the data set
DISTINCT_KMERS: The number of distinct kmers occurring at this frequency
PCT_DISTINCT_KMERS: The percent of distinct kmers occurring at this frequency
DISTINCT_KMERS_THIS_FREQUENCY_OR_LESS: The number of distinct kmers ocurring at this frequency or less
PCT_DISTINCT_KMERS_THIS_FREQUENCY_OR_LESS: The percent of distinct kmers ocurring at this frequency or less
TOTAL_KMERS: The total number of kmers seen at this frequency
PCT_TOTAL_KMERS: The percent of total kmers seen at this frequency
TOTAL_KMERS_THIS_FREQUENCY_OR_LESS: The total number of kmers occurring at this frequency or less
PCT_TOTAL_KMERS_THIS_FREQUENCY_OR_LESS: The percent of total kmers seen at this frequency or less
LowPassConcordanceMetrics

Concordance statistics for a set of low coverage reads against well known well known genotypes for the same sample for the purpose of ensuring that the sample being sequenced is the sample we think it is.

Column Definitions

CATEGORY: One of the following:
REFERENCE: Number of base calls matching the reference genotypes
NON_REFERENCE: Number of base calls not matching the reference genotypes
PCT_CONCORDANCE: A number rating how these statistics match with expectations. A value of close to 1 is good. How much below 1 is bad depends on the category.
MendelianViolationMetrics

Describes the type and number of mendelian violations found within a Trio.

Column Definitions

FAMILY_ID: The family ID assigned to the trio for which these metrics are calculated.
MOTHER: The ID of the mother within the trio.
FATHER: The ID of the father within the trio.
OFFSPRING: The ID of the offspring within the trio..
OFFSPRING_SEX: The sex of the offspring.
NUM_VARIANT_SITES: The number of sites at which all relevant samples exceeded the minimum genotype quality and at least one of the samples was variant.
NUM_DIPLOID_DENOVO: The number of diploid sites at which a potential de-novo mutation was observed (i.e. both parents are hom-ref, offspring is not homref.
NUM_HOMVAR_HOMVAR_HET: The number of sites at which both parents are homozygous for a non-reference allele and the offspring is heterozygous.
NUM_HOMREF_HOMVAR_HOM: The number of sites at which the one parent is homozygous reference, the other homozygous variant and the offspring is homozygous.
NUM_HOM_HET_HOM: The number of sites at which one parent is homozygous, the other is heterozygous and the offspring is the alternative homozygote.
NUM_HAPLOID_DENOVO: The number of sites at which the offspring is haploid, the parent is homozygous reference and the offspring is non-reference.
NUM_HAPLOID_OTHER: The number os sites at which the offspring is haploid and exhibits a reference allele that is not present in the parent.
NUM_OTHER: The number of otherwise unclassified events.
TOTAL_MENDELIAN_VIOLATIONS: The total of all mendelian violations observed.
MotifCoverageMetric

Column Definitions

NAME:
MOTIF:
BASES:
LOCATIONS:
RELATIVE_COVERAGE:
RELATIVE_QUALITY:
RELATIVE_MISMATCHES:
RELATIVE_DELETES:
RELATIVE_INSERTIONS:
RELATIVE_CLIPS:
MultilevelMetrics

Column Definitions

SAMPLE: The sample to which these metrics apply. If null, it means they apply to all reads in the file.
LIBRARY: The library to which these metrics apply. If null, it means that the metrics were accumulated at the sample level.
READ_GROUP: The read group to which these metrics apply. If null, it means that the metrics were accumulated at the library or sample level.
RnaSeqMetrics

Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".

Column Definitions

PF_BASES: The total number of PF bases including non-aligned reads.
PF_ALIGNED_BASES: The total number of aligned PF bases. Non-primary alignments are not counted. Bases in aligned reads that do not correspond to reference (e.g. soft clips, insertions) are not counted.
RIBOSOMAL_BASES: Number of bases in primary aligments that align to ribosomal sequence.
CODING_BASES: Number of bases in primary aligments that align to a non-UTR coding base for some gene, and not ribosomal sequence.
UTR_BASES: Number of bases in primary aligments that align to a UTR base for some gene, and not a coding base.
INTRONIC_BASES: Number of bases in primary aligments that align to an intronic base for some gene, and not a coding or UTR base.
INTERGENIC_BASES: Number of bases in primary aligments that do not align to any gene.
IGNORED_READS: Number of primary alignments that map to a sequence specified on command-line as IGNORED_SEQUENCE. These are not counted in PF_ALIGNED_BASES, CORRECT_STRAND_READS, INCORRECT_STRAND_READS, or any of the base-counting metrics. These reads are counted in PF_BASES.
CORRECT_STRAND_READS: Number of aligned reads that map to the correct strand. 0 if library is not strand-specific.
INCORRECT_STRAND_READS: Number of aligned reads that map to the incorrect strand. 0 if library is not strand-specific.
PCT_RIBOSOMAL_BASES: RIBOSOMAL_BASES / PF_ALIGNED_BASES
PCT_CODING_BASES: CODING_BASES / PF_ALIGNED_BASES
PCT_UTR_BASES: UTR_BASES / PF_ALIGNED_BASES
PCT_INTRONIC_BASES: INTRONIC_BASES / PF_ALIGNED_BASES
PCT_INTERGENIC_BASES: INTERGENIC_BASES / PF_ALIGNED_BASES
PCT_MRNA_BASES: PCT_UTR_BASES + PCT_CODING_BASES
PCT_USABLE_BASES: The percentage of bases mapping to mRNA divided by the total number of PF bases.
PCT_CORRECT_STRAND_READS: CORRECT_STRAND_READS/(CORRECT_STRAND_READS + INCORRECT_STRAND_READS). 0 if library is not strand-specific.
MEDIAN_CV_COVERAGE: The median CV of coverage of the 1000 most highly expressed transcripts. Ideal value = 0.
MEDIAN_5PRIME_BIAS: The median 5 prime bias of the 1000 most highly expressed transcripts, where 5 prime bias is calculated per transcript as: mean coverage of the 5' most 100 bases divided by the mean coverage of the whole transcript.
MEDIAN_3PRIME_BIAS: The median 3 prime bias of the 1000 most highly expressed transcripts, where 3 prime bias is calculated per transcript as: mean coverage of the 3' most 100 bases divided by the mean coverage of the whole transcript.
MEDIAN_5PRIME_TO_3PRIME_BIAS: The ratio of coverage at the 5' end of to the 3' end based on the 1000 most highly expressed transcripts.
RrbsCpgDetailMetrics

Holds information about CpG sites encountered for RRBS processing QC

Column Definitions

SEQUENCE_NAME: Sequence the CpG is seen in
POSITION: Position within the sequence of the CpG site
TOTAL_SITES: Number of times this CpG site was encountered
CONVERTED_SITES: Number of times this CpG site was converted (TG for + strand, CA for - strand)
PCT_CONVERTED: TOTAL_BASES / CONVERTED_BASES
RrbsSummaryMetrics

Holds summary statistics from RRBS processing QC

Column Definitions

READS_ALIGNED: Number of mapped reads processed
NON_CPG_BASES: Number of times a non-CpG cytosine was encountered
NON_CPG_CONVERTED_BASES: Number of times a non-CpG cytosine was converted (C->T for +, G->A for -)
PCT_NON_CPG_BASES_CONVERTED: NON_CPG_BASES / NON_CPG_CONVERTED_BASES
CPG_BASES_SEEN: Number of CpG sites encountered
CPG_BASES_CONVERTED: Number of CpG sites that were converted (TG for +, CA for -)
PCT_CPG_BASES_CONVERTED: CPG_BASES_SEEN / CPG_BASES_CONVERTED
MEAN_CPG_COVERAGE: Mean coverage of CpG sites
MEDIAN_CPG_COVERAGE: Median coverage of CpG sites
READS_WITH_NO_CPG: Number of reads discarded for having no CpG sites
READS_IGNORED_SHORT: Number of reads discarded due to being too short
READS_IGNORED_MISMATCHES: Number of reads discarded for exceeding the mismatch threshold
SamFileValidator.ValidationMetrics

Column Definitions

ScreenSamReads.ScreenSamReadsMetrics

SAM or BAM read screening Metrics

Column Definitions

START_REFERENCE: The reference of the un-screened source file.
START_BASES: The number of bases in the un-screened source data file.
PCT_START_ALIGNED: The % of mapped bases in the un-screened source data file (of start bases).
PCT_START_UNMAPPED: The % of unmapped bases in the un-screened source data file (of start bases).
POSITIVE_REFERENCE: The positive reference used during alignment of un-screened reads.
PCT_POSITIVE_ALIGNED: The % of bases that mapped to the positive reference.
PCT_POSITIVE_UNMAPPED: The % of bases that did not map to the positive reference.
NEGATIVE_REFERENCE: The negative reference used during alignment of un-screened reads.
PCT_NEGATIVE_ALIGNED: The % of bases that mapped to the negative reference.
PCT_NEGATIVE_UNMAPPED: The % of bases that did not map to the negative reference.
END_BASES: The number of bases in the screened data file.
PCT_END_ALIGNED: The % of mapped bases in the screened data file (of end bases).
PCT_END_UNMAPPED: The % of unmapped bases in the screened data file (of end bases).
PCT_PASSING_SCREEN: The % of bases that passed the screen (of start bases).
PCT_FAILING_SCREEN: The % of bases that failed the screen (of start bases).
SpikeInMetrics

Created by IntelliJ IDEA. User: ktibbett Date: Nov 17, 2009 Time: 4:17:25 PM To change this template use File | Settings | File Templates.

Column Definitions

TOTAL_PLASMID_READS: The number of reads in the BAM file that map to plasmids
EXPECTED_PLASMID: The name of the plasmid that was spiked into the lane; all plasmid reads are expected to align to this reference.
MEDIAN_COVERAGE_EXPECTED_PLASMID: The median number of reads covering each "bin" of the expected plasmid
EXPECTED_PLASMID_COUNT: The number of reads mapping to the expected plasmid
BEST_PLASMID: The name of the plasmid that to which the most reads aligned.
BEST_PLASMID_MEDIAN_COVERAGE: The median number of reads covering each "bin" of the best plasmid
BEST_PLASMID_COUNT: The number of reads mapping to the plasmid with the most reads aligned
SECOND_BEST_PLASMID: The name of the plasmid to which the second-highest number of reads aligned.
SECOND_BEST_PLASMID_MEDIAN_COVERAGE: The median number of reads covering each "bin" of the second-best plasmid
SECOND_BEST_PLASMID_COUNT: The number of reads mapping to the plasmid with the second-highest number of reads aligned.
TargetedPcrMetrics

Metrics class for targeted pcr runs such as TSCA runs

Column Definitions

CUSTOM_AMPLICON_SET: The name of the amplicon set used in this metrics collection run
GENOME_SIZE: The number of bases in the reference genome used for alignment.
AMPLICON_TERRITORY: The number of unique bases covered by the intervals of all amplicons in the amplicon set
TARGET_TERRITORY: The number of unique bases covered by the intervals of all targets that should be covered
TOTAL_READS: The total number of reads in the SAM or BAM file examine.
PF_READS: The number of reads that pass the vendor's filter.
PF_BASES: THe number of bases in the SAM or BAM file to be examined
PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates.
PCT_PF_READS: PF reads / total reads. The percent of reads passing filter.
PCT_PF_UQ_READS: PF Unique Reads / Total Reads.
PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.
PF_SELECTED_PAIRS: Tracks the number of read pairs that we see that are PF (used to calculate library size)
PF_SELECTED_UNIQUE_PAIRS: Tracks the number of unique PF reads pairs we see (used to calc library size)
PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads.
PF_UQ_BASES_ALIGNED: The number of PF unique bases that are aligned with mapping score > 0 to the reference genome.
ON_AMPLICON_BASES: The number of PF aligned amplified that mapped to an amplified region of the genome.
NEAR_AMPLICON_BASES: The number of PF aligned bases that mapped to within a fixed interval of an amplified region, but not on a baited region.
OFF_AMPLICON_BASES: The number of PF aligned bases that mapped to neither on or near an amplicon.
ON_TARGET_BASES: The number of PF aligned bases that mapped to a targeted region of the genome.
ON_TARGET_FROM_PAIR_BASES: The number of PF aligned bases that are mapped in pair to a targeted region of the genome.
PCT_AMPLIFIED_BASES: On+Near Amplicon Bases / PF Bases Aligned.
PCT_OFF_AMPLICON: The percentage of aligned PF bases that mapped neither on or near an amplicon.
ON_AMPLICON_VS_SELECTED: The percentage of on+near amplicon bases that are on as opposed to near.
MEAN_AMPLICON_COVERAGE: The mean coverage of all amplicons in the experiment.
MEAN_TARGET_COVERAGE: The mean coverage of targets that recieved at least coverage depth = 2 at one base.
FOLD_ENRICHMENT: The fold by which the amplicon region has been amplified above genomic background.
ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base.
FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.
PCT_TARGET_BASES_2X: The percentage of ALL target bases achieving 2X or greater coverage.
PCT_TARGET_BASES_10X: The percentage of ALL target bases achieving 10X or greater coverage.
PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage.
PCT_TARGET_BASES_30X: The percentage of ALL target bases achieving 30X or greater coverage.
AT_DROPOUT: A measure of how undercovered <= 50% GC regions are relative to the mean. For each GC bin [0..50] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. AT DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC<=50% regions mapped elsewhere.
GC_DROPOUT: A measure of how undercovered >= 50% GC regions are relative to the mean. For each GC bin [50..100] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. GC DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC>=50% regions mapped elsewhere.
UploadAggregationMetrics.Metrics.AggregationFauxMetric

A transient metric to encapsulate the data within the METRICS.AGGREGATION table. This class must be public to ensure visibility for reflection.

Column Definitions

ID:
PROJECT:
SAMPLE:
LIBRARY:
AGGREGATION_TYPE:
DATA_TYPE:
VERSION:
IS_LATEST:
READ_GROUP_COUNT:
CREATED_AT:
MODIFIED_AT:
WORKFLOW_START_DATE:
WORKFLOW_END_DATE:
UploadAggregationMetrics.Metrics.ForeignKeyFauxMetric

A transient metric to be merged with other metrics to augment them with a foreign key value in order to conform to database schema.

Column Definitions

AGGREGATION_ID:
UploadAggregationMetrics.Metrics.ReadGroupFauxMetric

A transient metric to encapsulate the data within the METRICS.AGGREGATION_READ_GROUP table. This class must be public to ensure visibility for reflection.

Column Definitions

MOLECULAR_BARCODE_NAME:
FLOWCELL_BARCODE:
LANE:
LIBRARY_NAME:
PAIRED_END:
UploadFour54ScreeningMetrics.Four54ScreeningMetrics

Column Definitions

SCREENING_QUERY_NAME: A logical human readable name that uniquely identifies the screening metrics
DATE_CREATED: Metrics creation date
TRIMMING_DB_FASTA: The sequence trimming database containing oligo sequences. Used for clipping before alignment of un-screened reads.
START_BASES: The number of bases in the un-screened source data file.
POSITIVE_REFERENCE: The positive reference. Used during alignment of un-screened reads.
PCT_POSITIVE_ALIGNED: The % of bases that mapped to the positive reference.
PCT_POSITIVE_UNMAPPED: The % of bases that did not map to the positive reference.
NEGATIVE_REFERENCE: The negative reference. Used during alignment of un-screened reads.
PCT_NEGATIVE_ALIGNED: The % of bases that mapped to the negative reference.
PCT_NEGATIVE_UNMAPPED: The % of bases that did not map to the negative reference.
END_BASES: The number of bases in the screened data file.
PCT_PASSING_SCREEN: The % of bases that passed the screen (of start bases).
PCT_FAILING_SCREEN: The % of bases that failed the screen (of start bases).
BASS_GLOBAL_ID: The BASS Global Identifier (uniquely identifies the file in BASS)
ORGANISM:
INITIATIVE:
GSSR_BARCODE:
SAMPLE:
PROJECT:
PTP_BARCODE:
RUN_NAME:
RUN_BARCODE:
READ_GROUP_TYPE:
REGION:
SEQUENCE_KEY:
MOLECULAR_BARCODE_NAME:
MOLECULAR_BARCODE_SEQUENCE:
UploadIlluminaScreeningMetrics.IlluminaScreeningMetrics

Column Definitions

SCREENING_QUERY_NAME: A logical human readable name that uniquely identifies the screening metrics
DATE_CREATED: Metrics creation date
BASS_GLOBAL_ID: The BASS Global Identifier (uniquely identifies the file in BASS)
ORGANISM:
INITIATIVE:
GSSR_BARCODE:
SAMPLE:
PROJECT:
LIBRARY_NAME:
READ_PAIRING_TYPE:
FLOWCELL_BARCODE:
RUN_NAME:
RUN_BARCODE:
LANE:
MOLECULAR_BARCODE_NAME:
MOLECULAR_BARCODE_SEQUENCE:
VariantCallingMetricsUploader.Metric.VariantCallingAnalysisFauxMetric

A bean for the METRIC.VARIANT_CALLING_ANALYSIS database table represented as a MetricBase so that it can be uploaded via the metrics uploader.

Column Definitions

ID:
NAME:
VERSION:
BAIT_SET:
DATA_TYPE:
CREATED_AT:
MODIFIED_AT:
IS_LATEST:
VariantCallingMetricsUploader.Metric.VariantCallingAnalysisForeignKeyFauxMetric

Column Definitions

ANALYSIS_ID:
VariantCallingSampleMetadataMetric

Column Definitions

SAMPLE_ALIAS: The name of the sample being assayed
PROJECT: The project (squid project or mercury research project) for this sample (does not include data_type)