AlignmentSummaryMetrics
High level metrics about the alignment of reads within a SAM file, produced by
the CollectAlignmentSummaryMetrics program and usually stored in a file with
the extension ".alignment_summary_metrics".
Column Definitions
CATEGORY: One of either UNPAIRED (for a fragment run), FIRST_OF_PAIR when metrics are for only the
first read in a paired run, SECOND_OF_PAIR when the metrics are for only the second read
in a paired run or PAIR when the metrics are aggregated for both first and second reads
in a pair.
TOTAL_READS: The total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR
this value will be 2x the number of clusters.
PF_READS: The number of PF reads where PF is defined as passing Illumina's filter.
PCT_PF_READS: The percentage of reads that are PF (PF_READS / TOTAL_READS)
PF_NOISE_READS: The number of PF reads that are marked as noise reads. A noise read is one which is composed
entirely of A bases and/or N bases. These reads are marked as they are usually artifactual and
are of no use in downstream analysis.
PF_READS_ALIGNED: The number of PF reads that were aligned to the reference sequence. This includes reads that
aligned with low quality (i.e. their alignments are ambiguous).
PCT_PF_READS_ALIGNED: The percentage of PF reads that aligned to the reference sequence. PF_READS_ALIGNED / PF_READS
PF_ALIGNED_BASES: The total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence.
PF_HQ_ALIGNED_READS: The number of PF reads that were aligned to the reference sequence with a mapping quality of
Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the
alignment is wrong.
PF_HQ_ALIGNED_BASES: The number of bases aligned to the reference sequence in reads that were mapped at high
quality. Will usually approximate PF_HQ_ALIGNED_READS * READ_LENGTH but may differ when
either mixed read lengths are present or many reads are aligned with gaps.
PF_HQ_ALIGNED_Q20_BASES: The subset of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher.
PF_HQ_MEDIAN_MISMATCHES: The median number of mismatches versus the reference sequence in reads that were aligned
to the reference at high quality (i.e. PF_HQ_ALIGNED READS).
PF_MISMATCH_RATE: The rate of bases mismatching the reference for all bases aligned to the reference sequence.
PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads.
PF_INDEL_RATE: The number of insertion and deletion events per 100 aligned bases. Uses the number of events
as the numerator, not the number of inserted or deleted bases.
MEAN_READ_LENGTH: The mean read length of the set of reads examined. When looking at the data for a single lane with
equal length reads this number is just the read length. When looking at data for merged lanes with
differing read lengths this is the mean read length of all reads.
READS_ALIGNED_IN_PAIRS: The number of aligned reads whose mate pair was also aligned to the reference.
PCT_READS_ALIGNED_IN_PAIRS: The percentage of reads whose mate pair was also aligned to the reference.
READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED
BAD_CYCLES: The number of instrument cycles in which 80% or more of base calls were no-calls.
STRAND_BALANCE: The number of PF reads aligned to the positive strand of the genome divided by the number of
PF reads aligned to the genome.
PCT_CHIMERAS: The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have
the two ends mapping to different chromosomes.
PCT_ADAPTER: The percentage of PF reads that are unaligned and match to a known adapter sequence right from the
start of the read.
CollectIonTorrentBaseCallMetrics.IonBaseCallRunMetric
Metrics to describe an Ion Torrent basecalling run, including metrics about first 2 test fragments.
Column Definitions
RUN_NAME: The full name of the Ion run consisting of run date (analysis date?), operator, and PGM run number.
TOTAL_NUM_BASES: Number of filtered and trimmed base pairs reported in the SFF and FASTQ files.
NUM_Q17_BASES: Number of bases with predicted quality of Q17 or greater.
NUM_Q20_BASES: Number of bases with predicted quality of Q20 or greater.
TOTAL_NUM_READS: Total number of filtered and trimmed reads independent of length reported in the SFF and FASTQ files.
MEAN_READ_LENGTH: Average length, in base pairs, of all filtered and trimmed library reads reported in the SFF and FASTQ files.
LONGEST_READ: Maximum length, in base pairs, of all filtered and trimmed library reads reported in the file.
LIBRARY_CF: percentage of reads affected by Carry Forward events?
LIBRARY_IE: percentage of reads affected by Incomplete Extension?
LIBRARY_DR:
LIBRARY_SNR:
NUMBER_AMBIGUOUS:
NUMBER_DUD:
NUMBER_TF: Number of Test Fragment Reads (defined by key sequence) before filtering and trimming
NUMBER_LIB: Number of Library Reads (defined by key sequence TCAG) before filtering and trimming
KEYPASS_ALL_BEADS: Total number of reads with any key sequence before filtering and trimming
TOTAL_ADDRESSABLE_WELLS: Total number of addressable wells
WELLS_WITH_ISPS: Number of wells that were determined to be "positive" for the presence of an ISP within the well.
"Positive" is determined by measuring the diffusion rate of a flow with a different pH. Wells containing ISPs
have a delayed pH change due to the presence of an ISP slowing the detection of the pH change from the solution.
PCT_WELLS_WITH_ISPS: Percent of Addressable wells loaded with a bead (WELLS_WITH_ISPS/TOTAL_ADDRESSABLE_WELLS)
LIVE_ISPS: Number of wells that contained an ISP with a signal of sufficient strength and composition to be associated
with the library or Test Fragment key. This value is the sum of the following categories:
Test Fragment
Library
PCT_LIVE_ISPS: Percent of wells with ISPs that have live ISPs (LIVE_ISPS/WELLS_WITH_ISPS)
TEST_FRAGMENT_ISPS: Number of Live ISPs with a key signal that was identical to the Test Fragment key signal.
PCT_TEST_FRAGMENT_ISPS: Percent of live ISPs that are test fragments
LIBRARY_ISPS: Number of Live ISPs that have a key signal identical to the library key signal. These reads are input into
the Library filtering process.
PCT_LIBRARY_ISPS: Percent of live ISPs that are library
FLTRD_TOO_SHORT:
PCT_FLTRD_TOO_SHORT:
FLTRD_KEYPASS_FAILURE:
PCT_FLTRD_KEYPASS_FAILURE:
FLTRD_LOW_SIGNAL:
PCT_FLTRD_LOW_SIGNAL:
FLTRD_POOR_SIGNAL_PROFILE:
PCT_FLTRD_POOR_SIGNAL_PROFILE:
FLTRD_3_PRIME_ADAPTER_TRIM:
PCT_FLTRD_3_PRIME_ADAPTER_TRIM:
FLTRD_3_PRIME_QUAL_TRIM:
PCT_FLTRD_3_PRIME_QUAL_TRIM:
FINAL_LIBRARY_READS: Number of Library reads passing all filters, which are recorded in the SFF and FASTQ files.
PCT_FINAL_LIBRARY_READS: Percentage of library reads passing all filters, which are recorded in the SFF and FASTQ files.
CHIP_CHECK: A series of tests on reference wells (about 10% of the chip in non-addressable areas) is performed to ensure
that the chip is functioning at a basic level. The value of this field is either Passed or Failed.
CHIP_TYPE: Chip type (314,316,318)
FLOW_ORDER: Nucelotide flow order
LIBRARY_KEY: A short known sequence of bases used to distinguish the library fragment from the Test Fragment. Example: "TCAG"
ANALYSIS_VERSION: Version of the Analysis Pipeline used to generate the analysis.
DBREPORTS_VERSION: Version of the ion-dbreports package.
TF1_NAME: Name of 1st TF
TF1_Q10_MEAN: Mean read length of all Q10 or greater Test Fragments (type 1)
TF1_Q17_MEAN: Mean read length of all Q17 or greater Test Fragments (type 1)
TF1_Q10_MODE: Mode of read lengths of all Q10 or greater Test Fragments (type 1)
TF1_Q17_MODE: Mode of read lengths of all Q17 or greater Test Fragments (type 1)
TF1_SYSTEM_SNR:
TF1_50Q10_READS: Number of Test Fragments (type1) that at 50bp have a quality score of Q10 or greater
TF1_50Q17_READS: Number of Test Fragments (type1) that at 50bp have a quality score of Q17 or greater
TF1_KEYPASS_READS: Total number of type 1 TF reads
TF1_CF: Percent of TF type 2 reads affected by Carry Forward
TF1_IE: Percent of TF type 2 reads affected by Incomplete Extension
TF1_DR:
TF1_KEY_PEAK_COUNTS:
TF2_NAME: Name of 2nd TF
TF2_Q10_MEAN: Mean read length of all Q10 or greater Test Fragments (type 2)
TF2_Q17_MEAN: Mean read length of all Q17 or greater Test Fragments (type 2)
TF2_Q10_MODE: Mode of read lengths of all Q10 or greater Test Fragments (type 2)
TF2_Q17_MODE: Mode of read lengths of all Q17 or greater Test Fragments (type 2)
TF2_SYSTEM_SNR:
TF2_50Q10_READS: Number of Test Fragments (type2) that at 50bp have a quality score of Q10 or greater
TF2_50Q17_READS: Number of Test Fragments (type2) that at 50bp have a quality score of Q17 or greater
TF2_KEYPASS_READS: Total number of type 2 TF reads
TF2_CF: Percent of TF type 2 reads affected by Carry Forward
TF2_IE: Percent of TF type 2 reads affected by Incomplete Extension
TF2_DR:
TF2_KEY_PEAK_COUNTS:
ION_RUN_ID:
CollectOxoGMetrics.CpcgMetrics
Metrics class for outputs.
Column Definitions
SAMPLE_ALIAS: The name of the sample being assayed.
LIBRARY: The name of the library being assayed.
CONTEXT: The sequence context being reported on.
TOTAL_SITES: The total number of sites that had at least one base covering them.
TOTAL_BASES: The total number of basecalls observed at all sites.
REF_NONOXO_BASES: The number of reference alleles observed as C in read 1 and G in read 2.
REF_OXO_BASES: The number of reference alleles observed as G in read 1 and C in read 2.
REF_TOTAL_BASES: The total number of reference alleles observed
ALT_NONOXO_BASES: The count of observed A basecalls at C reference positions and T basecalls
at G reference bases that are correlated to instrument read number in a way
that rules out oxidation as the cause
ALT_OXO_BASES: The count of observed A basecalls at C reference positions and T basecalls
at G reference bases that are correlated to instrument read number in a way
that is consistent with oxidative damage.
OXIDATION_ERROR_RATE: The oxo error rate, calculated as max(ALT_OXO_BASES - ALT_NONOXO_BASES, 1) / TOTAL_BASES
OXIDATION_Q: -10 * log10(OXIDATION_ERROR_RATE)
C_REF_REF_BASES: The number of ref basecalls observed at sites where the genome reference == C.
G_REF_REF_BASES: The number of ref basecalls observed at sites where the genome reference == G.
C_REF_ALT_BASES: The number of alt (A/T) basecalls observed at sites where the genome reference == C.
G_REF_ALT_BASES: The number of alt (A/T) basecalls observed at sites where the genome reference == G.
C_REF_OXO_ERROR_RATE: The rate at which C>A and G>T substitutions are observed at C reference sites above the expected rate if there
were no bias between sites with a C reference base vs. a G reference base.
C_REF_OXO_Q: C_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.
G_REF_OXO_ERROR_RATE: The rate at which C>A and G>T substitutions are observed at G reference sites above the expected rate if there
were no bias between sites with a C reference base vs. a G reference base.
G_REF_OXO_Q: G_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.
CollectVariantCallingMetrics.VariantCallingSummaryMetrics
A collection of metrics relating to snps and indels within a variant-calling file (VCF).
Column Definitions
TOTAL_SNPS: The number of high confidence SNPs calls (i.e. non-reference genotypes) that were examined
NUM_IN_DB_SNP: The number of high confidence SNPs found in dbSNP
NOVEL_SNPS: The number of high confidence SNPS called that were not found in dbSNP
FILTERED_SNPS: The number of SNPs that are also filtered
PCT_DBSNP: The percentage of high confidence SNPs in dbSNP
DBSNP_TITV: The Transition/Transversion ratio of the SNP calls made at dbSNP sites
NOVEL_TITV: The Transition/Transversion ratio of the SNP calls made at non-dbSNP sites
TOTAL_INDELS: The number of high confidence Indel calls that were examined
NOVEL_INDELS: The number of high confidence Indels called that were not found in dbSNP
FILTERED_INDELS: The number of indels that are also filtered
PCT_DBSNP_INDELS: The percentage of high confidence Indels in dbSNP
NUM_IN_DB_SNP_INDELS: The number of high confidence Indels found in dbSNP
DBSNP_INS_DEL_RATIO: The Insertion/Deletion ratio of the Indel calls made at dbSNP sites
NOVEL_INS_DEL_RATIO: The Insertion/Deletion ratio of the Indel calls made at non-dbSNP sites
TOTAL_MULTIALLELIC_SNPS: The number of high confidence multiallelic SNP calls that were examined
NUM_IN_DB_SNP_MULTIALLELIC: The number of high confidence multiallelic SNPs found in dbSNP
TOTAL_COMPLEX_INDELS: The number of high confidence complex Indel calls that were examined
NUM_IN_DB_SNP_COMPLEX_INDELS: The number of high confidence complex Indels found in dbSNP
SNP_REFERENCE_BIAS: The rate at which reference bases are observed at ref/alt heterozygous SNP sites.
NUM_SINGLETONS: For summary metrics, the number of variants that appear in only one sample.
For detail metrics, the number of variants that appear only in the current sample.
CollectWgsMetrics.WgsMetrics
Metrics for evaluating the performance of whole genome sequencing experiments.
Column Definitions
GENOME_TERRITORY: The number of non-N bases in the genome reference over which coverage will be evaluated.
MEAN_COVERAGE: The mean coverage in bases of the genome territory, after all filters are applied.
SD_COVERAGE: The standard deviation of coverage of the genome after all filters are applied.
MEDIAN_COVERAGE: The median coverage in bases of the genome territory, after all filters are applied.
MAD_COVERAGE: The median absolute deviation of coverage of the genome after all filters are applied.
PCT_EXC_MAPQ: The fraction of aligned bases that were filtered out because they were in reads with low mapping quality (default is < 20).
PCT_EXC_DUPE: The fraction of aligned bases that were filtered out because they were in reads marked as duplicates.
PCT_EXC_UNPAIRED: The fraction of aligned bases that were filtered out because they were in reads without a mapped mate pair.
PCT_EXC_BASEQ: The fraction of aligned bases that were filtered out because they were of low base quality (default is < 20).
PCT_EXC_OVERLAP: The fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads.
PCT_EXC_CAPPED: The fraction of aligned bases that were filtered out because they would have raised coverage above the capped value (default cap = 250x).
PCT_EXC_TOTAL: The total fraction of aligned bases excluded due to all filters.
PCT_5X: The fraction of bases that attained at least 5X sequence coverage in post-filtering bases.
PCT_10X: The fraction of bases that attained at least 10X sequence coverage in post-filtering bases.
PCT_20X: The fraction of bases that attained at least 20X sequence coverage in post-filtering bases.
PCT_30X: The fraction of bases that attained at least 30X sequence coverage in post-filtering bases.
PCT_40X: The fraction of bases that attained at least 40X sequence coverage in post-filtering bases.
PCT_50X: The fraction of bases that attained at least 50X sequence coverage in post-filtering bases.
PCT_60X: The fraction of bases that attained at least 60X sequence coverage in post-filtering bases.
PCT_70X: The fraction of bases that attained at least 70X sequence coverage in post-filtering bases.
PCT_80X: The fraction of bases that attained at least 80X sequence coverage in post-filtering bases.
PCT_90X: The fraction of bases that attained at least 90X sequence coverage in post-filtering bases.
PCT_100X: The fraction of bases that attained at least 100X sequence coverage in post-filtering bases.
ExtractIlluminaBarcodes.BarcodeMetric
Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in
the basecalls directory and determine to which barcode each read should be assigned.
Column Definitions
BARCODE: The barcode (from the set of expected barcodes) for which the following metrics apply.
Note that the "symbolic" barcode of NNNNNN is used to report metrics for all reads that
do not match a barcode.
BARCODE_NAME:
LIBRARY_NAME:
READS: The total number of reads matching the barcode.
PF_READS: The number of PF reads matching this barcode (always less than or equal to READS).
PERFECT_MATCHES: The number of all reads matching this barcode that matched with 0 errors or no-calls.
PF_PERFECT_MATCHES: The number of PF reads matching this barcode that matched with 0 errors or no-calls.
ONE_MISMATCH_MATCHES: The number of all reads matching this barcode that matched with 1 error or no-call.
PF_ONE_MISMATCH_MATCHES: The number of PF reads matching this barcode that matched with 1 error or no-call.
PCT_MATCHES: The percentage of all reads in the lane that matched to this barcode.
RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT: The rate of all reads matching this barcode to all reads matching the most prevelant barcode. For the
most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible
exception of when there are more orphan reads than for any other barcode, in which case the value
may be arbitrarily large). One over the lowest number in this column gives you the fold-difference
in representation between barcodes.
PF_PCT_MATCHES: The percentage of PF reads in the lane that matched to this barcode.
PF_RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT: The rate of PF reads matching this barcode to PF reads matching the most prevelant barcode. For the
most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible
exception of when there are more orphan reads than for any other barcode, in which case the value
may be arbitrarily large). One over the lowest number in this column gives you the fold-difference
in representation of PF reads between barcodes.
PF_NORMALIZED_MATCHES: The "normalized" matches to each barcode. This is calculated as the number of pf reads matching
this barcode over the sum of all pf reads matching any barcode (excluding orphans). If all barcodes
are represented equally this will be 1.
FingerprintingDetailMetrics
Detailed metrics about an individual SNP/Haplotype comparison within a fingerprint comparison.
Column Definitions
READ_GROUP: The sequencing read group from which sequence data was fingerprinted.
SAMPLE: The name of the sample who's genotypes the sequence data was compared to.
SNP: The name of a representative SNP within the haplotype that was compared. Will usually be the
exact SNP that was genotyped externally.
SNP_ALLELES: The possible alleles for the SNP.
CHROM: The chromosome on which the SNP resides.
POSITION: The position of the SNP on the chromosome.
EXPECTED_GENOTYPE: The expected genotype of the sample at the SNP locus.
OBSERVED_GENOTYPE: The most likely genotype given the observed evidence at the SNP locus in the sequencing data.
LOD: The LOD score for OBSERVED_GENOTYPE vs. the next most likely genotype in the sequencing data.
OBS_A: The number of observations of the first, or A, allele of the SNP in the sequencing data.
OBS_B: The number of observations of the second, or B, allele of the SNP in the sequencing data.
FingerprintingSummaryMetrics
Summary fingerprinting metrics and statistics about the comparison of the sequence data
from a single read group (lane or index within a lane) vs. a set of known genotypes for
the expected sample.
Column Definitions
READ_GROUP: The read group from which sequence data was drawn for comparison.
SAMPLE: The sample whose known genotypes the sequence data was compared to.
LL_EXPECTED_SAMPLE: The Log Likelihood of the sequence data given the expected sample's genotypes.
LL_RANDOM_SAMPLE: The Log Likelihood of the sequence data given a random sample from the human population.
LOD_EXPECTED_SAMPLE: The LOD for Expected Sample vs. Random Sample. A positive LOD indicates that the sequence data
is more likely to come from the expected sample vs. a random sample from the population, by LOD logs.
I.e. a value of 6 indicates that the sequence data is 1,000,000 more likely to come from the expected
sample than from a random sample. A negative LOD indicates the reverse - that the sequence data is more
likely to come from a random sample than from the expected sample.
HAPLOTYPES_WITH_GENOTYPES: The number of haplotypes that had expected genotypes to compare to.
HAPLOTYPES_CONFIDENTLY_CHECKED: The subset of genotyped haplotypes for which there was sufficient sequence data to
confidently genotype the haplotype. Note: all haplotypes with sequence coverage contribute to the
LOD score, even if they cannot be "confidently checked" individually.
HAPLOTYPES_CONFIDENTLY_MATCHING: The subset of confidently checked haplotypes that match the expected genotypes.
HsMetrics
The set of metrics captured that are specific to a hybrid selection analysis.
Column Definitions
BAIT_SET: The name of the bait set used in the hybrid selection.
GENOME_SIZE: The number of bases in the reference genome used for alignment.
BAIT_TERRITORY: The number of bases which have one or more baits on top of them.
TARGET_TERRITORY: The unique number of target bases in the experiment where target is usually exons etc.
BAIT_DESIGN_EFFICIENCY: Target terrirtoy / bait territory. 1 == perfectly efficient, 0.5 = half of baited bases are not target.
TOTAL_READS: The total number of reads in the SAM or BAM file examine.
PF_READS: The number of reads that pass the vendor's filter.
PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates.
PCT_PF_READS: PF reads / total reads. The percent of reads passing filter.
PCT_PF_UQ_READS: PF Unique Reads / Total Reads.
PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.
PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads.
PF_UQ_BASES_ALIGNED: The number of bases in the PF aligned reads that are mapped to a reference base. Accounts for clipping and gaps.
ON_BAIT_BASES: The number of PF aligned bases that mapped to a baited region of the genome.
NEAR_BAIT_BASES: The number of PF aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region.
OFF_BAIT_BASES: The number of PF aligned bases that mapped to neither on or near a bait.
ON_TARGET_BASES: The number of PF aligned bases that mapped to a targeted region of the genome.
PCT_SELECTED_BASES: On+Near Bait Bases / PF Bases Aligned.
PCT_OFF_BAIT: The percentage of aligned PF bases that mapped neither on or near a bait.
ON_BAIT_VS_SELECTED: The percentage of on+near bait bases that are on as opposed to near.
MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment.
MEAN_TARGET_COVERAGE: The mean coverage of targets that received at least coverage depth = 2 at one base.
PCT_USABLE_BASES_ON_BAIT: The number of aligned, de-duped, on-bait bases out of the PF bases available.
PCT_USABLE_BASES_ON_TARGET: The number of aligned, de-duped, on-target bases out of the PF bases available.
FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background.
ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base.
FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to
the mean coverage level in those targets.
PCT_TARGET_BASES_2X: The percentage of ALL target bases achieving 2X or greater coverage.
PCT_TARGET_BASES_10X: The percentage of ALL target bases achieving 10X or greater coverage.
PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage.
PCT_TARGET_BASES_30X: The percentage of ALL target bases achieving 30X or greater coverage.
PCT_TARGET_BASES_40X: The percentage of ALL target bases achieving 40X or greater coverage.
PCT_TARGET_BASES_50X: The percentage of ALL target bases achieving 50X or greater coverage.
PCT_TARGET_BASES_100X: The percentage of ALL target bases achieving 100X or greater coverage.
HS_LIBRARY_SIZE: The estimated number of unique molecules in the selected part of the library.
HS_PENALTY_10X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric
should be interpreted as: if I have a design with 10 megabases of target, and want to get
10X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 10 * HS_PENALTY_10X.
HS_PENALTY_20X: The "hybrid selection penalty" incurred to get 80% of target bases to 20X. This metric
should be interpreted as: if I have a design with 10 megabases of target, and want to get
20X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 20 * HS_PENALTY_20X.
HS_PENALTY_30X: The "hybrid selection penalty" incurred to get 80% of target bases to 30X. This metric
should be interpreted as: if I have a design with 10 megabases of target, and want to get
30X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 30 * HS_PENALTY_30X.
HS_PENALTY_40X: The "hybrid selection penalty" incurred to get 80% of target bases to 40X. This metric
should be interpreted as: if I have a design with 10 megabases of target, and want to get
40X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 40 * HS_PENALTY_40X.
HS_PENALTY_50X: The "hybrid selection penalty" incurred to get 80% of target bases to 50X. This metric
should be interpreted as: if I have a design with 10 megabases of target, and want to get
50X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 50 * HS_PENALTY_50X.
HS_PENALTY_100X: The "hybrid selection penalty" incurred to get 80% of target bases to 100X. This metric
should be interpreted as: if I have a design with 10 megabases of target, and want to get
100X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 100 * HS_PENALTY_100X.
AT_DROPOUT: A measure of how undercovered <= 50% GC regions are relative to the mean. For each GC bin [0..50]
we calculate a = % of target territory, and b = % of aligned reads aligned to these targets.
AT DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total
reads that should have mapped to GC<=50% regions mapped elsewhere.
GC_DROPOUT: A measure of how undercovered >= 50% GC regions are relative to the mean. For each GC bin [50..100]
we calculate a = % of target territory, and b = % of aligned reads aligned to these targets.
GC DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total
reads that should have mapped to GC>=50% regions mapped elsewhere.
IlluminaBasecallingMetrics
Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis. Averages
and means are taken over all tiles.
Column Definitions
LANE: The lane for which the metrics were calculated.
MOLECULAR_BARCODE_SEQUENCE_1: The barcode sequence for which the metrics were calculated.
MOLECULAR_BARCODE_NAME: The barcode name for which the metrics were calculated.
TOTAL_BASES: The total number of bases assigned to the index.
PF_BASES: The total number of passing-filter bases assigned to the index.
TOTAL_READS: The total number of reads assigned to the index.
PF_READS: The total number of passing-filter reads assigned to the index.
TOTAL_CLUSTERS: The total number of clusters assigned to the index.
PF_CLUSTERS: The total number of PF clusters assigned to the index.
MEAN_CLUSTERS_PER_TILE: The mean number of clusters per tile.
SD_CLUSTERS_PER_TILE: The standard deviation of clusters per tile.
MEAN_PCT_PF_CLUSTERS_PER_TILE: The mean percentage of pf clusters per tile.
SD_PCT_PF_CLUSTERS_PER_TILE: The standard deviation in percentage of pf clusters per tile.
MEAN_PF_CLUSTERS_PER_TILE: The mean number of pf clusters per tile.
SD_PF_CLUSTERS_PER_TILE: The standard deviation in number of pf clusters per tile.
InsertSizeMetrics
Metrics about the insert size distribution of a paired-end library, created by the
CollectInsertSizeMetrics program and usually written to a file with the extension
".insert_size_metrics". In addition the insert size distribution is plotted to
a file with the extension ".insert_size_Histogram.pdf".
Column Definitions
MEDIAN_INSERT_SIZE: The MEDIAN insert size of all paired end reads where both ends mapped to the same chromosome.
MEDIAN_ABSOLUTE_DEVIATION: The median absolute deviation of the distribution. If the distribution is essentially normal then
the standard deviation can be estimated as ~1.4826 * MAD.
MIN_INSERT_SIZE: The minimum measured insert size. This is usually 1 and not very useful as it is likely artifactual.
MAX_INSERT_SIZE: The maximum measure insert size by alignment. This is usually very high representing either an artifact
or possibly the presence of a structural re-arrangement.
MEAN_INSERT_SIZE: The mean insert size of the "core" of the distribution. Artefactual outliers in the distribution often
cause calculation of nonsensical mean and stdev values. To avoid this the distribution is first trimmed
to a "core" distribution of +/- N median absolute deviations around the median insert size. By default
N=10, but this is configurable.
STANDARD_DEVIATION: Standard deviation of insert sizes over the "core" of the distribution.
READ_PAIRS: The total number of read pairs that were examined in the entire distribution.
PAIR_ORIENTATION: The pair orientation of the reads in this data category.
WIDTH_OF_10_PERCENT: The "width" of the bins, centered around the median, that encompass 10% of all read pairs.
WIDTH_OF_20_PERCENT: The "width" of the bins, centered around the median, that encompass 20% of all read pairs.
WIDTH_OF_30_PERCENT: The "width" of the bins, centered around the median, that encompass 30% of all read pairs.
WIDTH_OF_40_PERCENT: The "width" of the bins, centered around the median, that encompass 40% of all read pairs.
WIDTH_OF_50_PERCENT: The "width" of the bins, centered around the median, that encompass 50% of all read pairs.
WIDTH_OF_60_PERCENT: The "width" of the bins, centered around the median, that encompass 60% of all read pairs.
WIDTH_OF_70_PERCENT: The "width" of the bins, centered around the median, that encompass 70% of all read pairs.
This metric divided by 2 should approximate the standard deviation when the insert size
distribution is a normal distribution.
WIDTH_OF_80_PERCENT: The "width" of the bins, centered around the median, that encompass 80% of all read pairs.
WIDTH_OF_90_PERCENT: The "width" of the bins, centered around the median, that encompass 90% of all read pairs.
WIDTH_OF_99_PERCENT: The "width" of the bins, centered around the median, that encompass 100% of all read pairs.
InternalControlCycleMetrics
Metrics about observations of an internal control sequence in an individual cycle.
Column Definitions
INTERNAL_CONTROL: The name of the internal control sequence.
READ: The read (1 or 2) that the metrics are for.
CYCLE: The cycle number within the read that the metrics are for.
OBSERVATIONS: The number of reads observed that were matched to this internal control.
ERRORS: The number of mismatches (including no-calls) contained within the observed reads at this cycle.
ERROR_RATE: The error rate in this IC in this cycle, i.e. ERRORS/OBSERVATIONS
SUM_OF_ERROR_PROBS: The sum of the error probabilities observed - in an ideal system this should match ERRORS.
QUALITY_ESTIMATE: The ratio of SUM_OF_ERROR_PROBS to ERRORS. A number > 1 indicates that the control had fewer errors
than would be predicted by the bases quality scores, a number < 1 indicates more errors than expected.
REF_BASE: The reference base of the internal control at this position.
A: The number of 'A' basecalls at this cycle for this internal control.
C: The number of 'C' basecalls at this cycle for this internal control.
G: The number of 'G' basecalls at this cycle for this internal control.
T: The number of 'T' basecalls at this cycle for this internal control.
JumpingLibraryMetrics
High level metrics about the presence of outward- and inward-facing pairs
within a SAM file generated with a jumping library, produced by
the CollectJumpingLibraryMetrics program and usually stored in a file with
the extension ".jump_metrics".
Column Definitions
JUMP_PAIRS: The number of outward-facing pairs in the SAM file
JUMP_DUPLICATE_PAIRS: The number of outward-facing pairs that are duplicates
JUMP_DUPLICATE_PCT: The percentage of outward-facing pairs that are marked as duplicates
JUMP_LIBRARY_SIZE: The estimated library size for outward-facing pairs
JUMP_MEAN_INSERT_SIZE: The mean insert size for outward-facing pairs
JUMP_STDEV_INSERT_SIZE: The standard deviation on the insert size for outward-facing pairs
NONJUMP_PAIRS: The number of inward-facing pairs in the SAM file
NONJUMP_DUPLICATE_PAIRS: The number of inward-facing pais that are duplicates
NONJUMP_DUPLICATE_PCT: The percentage of inward-facing pairs that are marked as duplicates
NONJUMP_LIBRARY_SIZE: The estimated library size for inward-facing pairs
NONJUMP_MEAN_INSERT_SIZE: The mean insert size for inward-facing pairs
NONJUMP_STDEV_INSERT_SIZE: The standard deviation on the insert size for inward-facing pairs
CHIMERIC_PAIRS: The number of pairs where either (a) the ends fall on different chromosomes or (b) the insert size
is greater than the maximum of 100000 or 2 times the mode of the insert size for outward-facing pairs.
FRAGMENTS: The number of fragments in the SAM file
PCT_JUMPS: The number of outward-facing pairs expressed as a percentage of the total of all outward facing pairs,
inward-facing pairs, and chimeric pairs.
PCT_NONJUMPS: The number of inward-facing pairs expressed as a percentage of the total of all outward facing pairs,
inward-facing pairs, and chimeric pairs.
PCT_CHIMERAS: The number of chimeric pairs expressed as a percentage of the total of all outward facing pairs,
inward-facing pairs, and chimeric pairs.
MendelianViolationMetrics
Describes the type and number of mendelian violations found within a Trio.
Column Definitions
FAMILY_ID: The family ID assigned to the trio for which these metrics are calculated.
MOTHER: The ID of the mother within the trio.
FATHER: The ID of the father within the trio.
OFFSPRING: The ID of the offspring within the trio..
OFFSPRING_SEX: The sex of the offspring.
NUM_VARIANT_SITES: The number of sites at which all relevant samples exceeded the minimum genotype quality and at least one of the samples was variant.
NUM_DIPLOID_DENOVO: The number of diploid sites at which a potential de-novo mutation was observed (i.e. both parents are hom-ref, offspring is not homref.
NUM_HOMVAR_HOMVAR_HET: The number of sites at which both parents are homozygous for a non-reference allele and the offspring is heterozygous.
NUM_HOMREF_HOMVAR_HOM: The number of sites at which the one parent is homozygous reference, the other homozygous variant and the offspring is homozygous.
NUM_HOM_HET_HOM: The number of sites at which one parent is homozygous, the other is heterozygous and the offspring is the alternative homozygote.
NUM_HAPLOID_DENOVO: The number of sites at which the offspring is haploid, the parent is homozygous reference and the offspring is non-reference.
NUM_HAPLOID_OTHER: The number os sites at which the offspring is haploid and exhibits a reference allele that is not present in the parent.
NUM_OTHER: The number of otherwise unclassified events.
TOTAL_MENDELIAN_VIOLATIONS: The total of all mendelian violations observed.
RnaSeqMetrics
Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics
program and usually stored in a file with the extension ".rna_metrics".
Column Definitions
PF_BASES: The total number of PF bases including non-aligned reads.
PF_ALIGNED_BASES: The total number of aligned PF bases. Non-primary alignments are not counted. Bases in aligned reads that
do not correspond to reference (e.g. soft clips, insertions) are not counted.
RIBOSOMAL_BASES: Number of bases in primary aligments that align to ribosomal sequence.
CODING_BASES: Number of bases in primary aligments that align to a non-UTR coding base for some gene, and not ribosomal sequence.
UTR_BASES: Number of bases in primary aligments that align to a UTR base for some gene, and not a coding base.
INTRONIC_BASES: Number of bases in primary aligments that align to an intronic base for some gene, and not a coding or UTR base.
INTERGENIC_BASES: Number of bases in primary aligments that do not align to any gene.
IGNORED_READS: Number of primary alignments that map to a sequence specified on command-line as IGNORED_SEQUENCE. These are not
counted in PF_ALIGNED_BASES, CORRECT_STRAND_READS, INCORRECT_STRAND_READS, or any of the base-counting metrics.
These reads are counted in PF_BASES.
CORRECT_STRAND_READS: Number of aligned reads that map to the correct strand. 0 if library is not strand-specific.
INCORRECT_STRAND_READS: Number of aligned reads that map to the incorrect strand. 0 if library is not strand-specific.
PCT_RIBOSOMAL_BASES: RIBOSOMAL_BASES / PF_ALIGNED_BASES
PCT_CODING_BASES: CODING_BASES / PF_ALIGNED_BASES
PCT_UTR_BASES: UTR_BASES / PF_ALIGNED_BASES
PCT_INTRONIC_BASES: INTRONIC_BASES / PF_ALIGNED_BASES
PCT_INTERGENIC_BASES: INTERGENIC_BASES / PF_ALIGNED_BASES
PCT_MRNA_BASES: PCT_UTR_BASES + PCT_CODING_BASES
PCT_USABLE_BASES: The percentage of bases mapping to mRNA divided by the total number of PF bases.
PCT_CORRECT_STRAND_READS: CORRECT_STRAND_READS/(CORRECT_STRAND_READS + INCORRECT_STRAND_READS). 0 if library is not strand-specific.
MEDIAN_CV_COVERAGE: The median CV of coverage of the 1000 most highly expressed transcripts. Ideal value = 0.
MEDIAN_5PRIME_BIAS: The median 5 prime bias of the 1000 most highly expressed transcripts, where 5 prime bias is calculated per
transcript as: mean coverage of the 5' most 100 bases divided by the mean coverage of the whole transcript.
MEDIAN_3PRIME_BIAS: The median 3 prime bias of the 1000 most highly expressed transcripts, where 3 prime bias is calculated per
transcript as: mean coverage of the 3' most 100 bases divided by the mean coverage of the whole transcript.
MEDIAN_5PRIME_TO_3PRIME_BIAS: The ratio of coverage at the 5' end of to the 3' end based on the 1000 most highly expressed transcripts.
RrbsSummaryMetrics
Holds summary statistics from RRBS processing QC
Column Definitions
READS_ALIGNED: Number of mapped reads processed
NON_CPG_BASES: Number of times a non-CpG cytosine was encountered
NON_CPG_CONVERTED_BASES: Number of times a non-CpG cytosine was converted (C->T for +, G->A for -)
PCT_NON_CPG_BASES_CONVERTED: NON_CPG_BASES / NON_CPG_CONVERTED_BASES
CPG_BASES_SEEN: Number of CpG sites encountered
CPG_BASES_CONVERTED: Number of CpG sites that were converted (TG for +, CA for -)
PCT_CPG_BASES_CONVERTED: CPG_BASES_SEEN / CPG_BASES_CONVERTED
MEAN_CPG_COVERAGE: Mean coverage of CpG sites
MEDIAN_CPG_COVERAGE: Median coverage of CpG sites
READS_WITH_NO_CPG: Number of reads discarded for having no CpG sites
READS_IGNORED_SHORT: Number of reads discarded due to being too short
READS_IGNORED_MISMATCHES: Number of reads discarded for exceeding the mismatch threshold
ScreenSamReads.ScreenSamReadsMetrics
SAM or BAM read screening Metrics
Column Definitions
START_REFERENCE: The reference of the un-screened source file.
START_BASES: The number of bases in the un-screened source data file.
PCT_START_ALIGNED: The % of mapped bases in the un-screened source data file (of start bases).
PCT_START_UNMAPPED: The % of unmapped bases in the un-screened source data file (of start bases).
POSITIVE_REFERENCE: The positive reference used during alignment of un-screened reads.
PCT_POSITIVE_ALIGNED: The % of bases that mapped to the positive reference.
PCT_POSITIVE_UNMAPPED: The % of bases that did not map to the positive reference.
NEGATIVE_REFERENCE: The negative reference used during alignment of un-screened reads.
PCT_NEGATIVE_ALIGNED: The % of bases that mapped to the negative reference.
PCT_NEGATIVE_UNMAPPED: The % of bases that did not map to the negative reference.
END_BASES: The number of bases in the screened data file.
PCT_END_ALIGNED: The % of mapped bases in the screened data file (of end bases).
PCT_END_UNMAPPED: The % of unmapped bases in the screened data file (of end bases).
PCT_PASSING_SCREEN: The % of bases that passed the screen (of start bases).
PCT_FAILING_SCREEN: The % of bases that failed the screen (of start bases).
SpikeInMetrics
Created by IntelliJ IDEA.
User: ktibbett
Date: Nov 17, 2009
Time: 4:17:25 PM
To change this template use File | Settings | File Templates.
Column Definitions
TOTAL_PLASMID_READS: The number of reads in the BAM file that map to plasmids
EXPECTED_PLASMID: The name of the plasmid that was spiked into the lane; all plasmid reads
are expected to align to this reference.
MEDIAN_COVERAGE_EXPECTED_PLASMID: The median number of reads covering each "bin" of the expected plasmid
EXPECTED_PLASMID_COUNT: The number of reads mapping to the expected plasmid
BEST_PLASMID: The name of the plasmid that to which the most reads aligned.
BEST_PLASMID_MEDIAN_COVERAGE: The median number of reads covering each "bin" of the best plasmid
BEST_PLASMID_COUNT: The number of reads mapping to the plasmid with the most reads aligned
SECOND_BEST_PLASMID: The name of the plasmid to which the second-highest number of reads aligned.
SECOND_BEST_PLASMID_MEDIAN_COVERAGE: The median number of reads covering each "bin" of the second-best plasmid
SECOND_BEST_PLASMID_COUNT: The number of reads mapping to the plasmid with the second-highest number of reads aligned.
TargetedPcrMetrics
Metrics class for targeted pcr runs such as TSCA runs
Column Definitions
CUSTOM_AMPLICON_SET: The name of the amplicon set used in this metrics collection run
GENOME_SIZE: The number of bases in the reference genome used for alignment.
AMPLICON_TERRITORY: The number of unique bases covered by the intervals of all amplicons in the amplicon set
TARGET_TERRITORY: The number of unique bases covered by the intervals of all targets that should be covered
TOTAL_READS: The total number of reads in the SAM or BAM file examine.
PF_READS: The number of reads that pass the vendor's filter.
PF_BASES: THe number of bases in the SAM or BAM file to be examined
PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates.
PCT_PF_READS: PF reads / total reads. The percent of reads passing filter.
PCT_PF_UQ_READS: PF Unique Reads / Total Reads.
PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.
PF_SELECTED_PAIRS: Tracks the number of read pairs that we see that are PF (used to calculate library size)
PF_SELECTED_UNIQUE_PAIRS: Tracks the number of unique PF reads pairs we see (used to calc library size)
PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads.
PF_UQ_BASES_ALIGNED: The number of PF unique bases that are aligned with mapping score > 0 to the reference genome.
ON_AMPLICON_BASES: The number of PF aligned amplified that mapped to an amplified region of the genome.
NEAR_AMPLICON_BASES: The number of PF aligned bases that mapped to within a fixed interval of an amplified region, but not on a baited region.
OFF_AMPLICON_BASES: The number of PF aligned bases that mapped to neither on or near an amplicon.
ON_TARGET_BASES: The number of PF aligned bases that mapped to a targeted region of the genome.
ON_TARGET_FROM_PAIR_BASES: The number of PF aligned bases that are mapped in pair to a targeted region of the genome.
PCT_AMPLIFIED_BASES: On+Near Amplicon Bases / PF Bases Aligned.
PCT_OFF_AMPLICON: The percentage of aligned PF bases that mapped neither on or near an amplicon.
ON_AMPLICON_VS_SELECTED: The percentage of on+near amplicon bases that are on as opposed to near.
MEAN_AMPLICON_COVERAGE: The mean coverage of all amplicons in the experiment.
MEAN_TARGET_COVERAGE: The mean coverage of targets that recieved at least coverage depth = 2 at one base.
FOLD_ENRICHMENT: The fold by which the amplicon region has been amplified above genomic background.
ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base.
FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to
the mean coverage level in those targets.
PCT_TARGET_BASES_2X: The percentage of ALL target bases achieving 2X or greater coverage.
PCT_TARGET_BASES_10X: The percentage of ALL target bases achieving 10X or greater coverage.
PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage.
PCT_TARGET_BASES_30X: The percentage of ALL target bases achieving 30X or greater coverage.
AT_DROPOUT: A measure of how undercovered <= 50% GC regions are relative to the mean. For each GC bin [0..50]
we calculate a = % of target territory, and b = % of aligned reads aligned to these targets.
AT DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total
reads that should have mapped to GC<=50% regions mapped elsewhere.
GC_DROPOUT: A measure of how undercovered >= 50% GC regions are relative to the mean. For each GC bin [50..100]
we calculate a = % of target territory, and b = % of aligned reads aligned to these targets.
GC DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total
reads that should have mapped to GC>=50% regions mapped elsewhere.
UploadFour54ScreeningMetrics.Four54ScreeningMetrics
Column Definitions
SCREENING_QUERY_NAME: A logical human readable name that uniquely identifies the screening metrics
DATE_CREATED: Metrics creation date
TRIMMING_DB_FASTA: The sequence trimming database containing oligo sequences. Used for clipping before
alignment of un-screened reads.
START_BASES: The number of bases in the un-screened source data file.
POSITIVE_REFERENCE: The positive reference. Used during alignment of un-screened reads.
PCT_POSITIVE_ALIGNED: The % of bases that mapped to the positive reference.
PCT_POSITIVE_UNMAPPED: The % of bases that did not map to the positive reference.
NEGATIVE_REFERENCE: The negative reference. Used during alignment of un-screened reads.
PCT_NEGATIVE_ALIGNED: The % of bases that mapped to the negative reference.
PCT_NEGATIVE_UNMAPPED: The % of bases that did not map to the negative reference.
END_BASES: The number of bases in the screened data file.
PCT_PASSING_SCREEN: The % of bases that passed the screen (of start bases).
PCT_FAILING_SCREEN: The % of bases that failed the screen (of start bases).
BASS_GLOBAL_ID: The BASS Global Identifier (uniquely identifies the file in BASS)
ORGANISM:
INITIATIVE:
GSSR_BARCODE:
SAMPLE:
PROJECT:
PTP_BARCODE:
RUN_NAME:
RUN_BARCODE:
READ_GROUP_TYPE:
REGION:
SEQUENCE_KEY:
MOLECULAR_BARCODE_NAME:
MOLECULAR_BARCODE_SEQUENCE: