IDR Peak calling Pipeline:
Feb 9 2012
Anshul Kundaje
akundaje _at_ stanford _dot_ edu

Please read the README.txt file in the SPP package first before reading this ..

====================================
CALL PEAKS ON INDIVIDUAL REPLICATES
====================================
- Lets say we have a particular TF ChIP set dataset. I will refer to it as chipSample from here on.

- Lets say we have 3 replicates for chipSample named
	- chipSampleRep1.tagAlign.gz
	- chipSampleRep2.tagAlign.gz
	- chipSampleRep3.tagAlign.gz
	
- Each of the ChIP replicates might be paired with its own input/control tagAlign file or a single tagAlign file.
	- controlSampleRep1.tagAlign.gz
	- controlSampleRep2.tagAlign.gz
	- controlSampleRep3.tagAlign.gz
	
- The first thing we need to do is pool all control datasets that correspond to all replicates of chipSample i.e. simply concatenate the control tagAlign files.
zcat controlSampleRep1.tagAlign.gz controlSampleRep2.tagAlign.gz controlSampleRep3.tagAlign.gz | gzip -c > controlSampleRep0.tagAlign.gz

- Now we are ready to call peaks on each of the original replicates (REMEMBER TO USE RELAXED THRESHOLDS AND TRY TO CALL 150k to 300k peaks even if most of them are noise)
Rscript run_spp.R -c=chipSampleRep1.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/reps -savr -savp -rf -out=/stats/phantomPeakStatsReps.tab
Rscript run_spp.R -c=chipSampleRep2.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/reps -savr -savp -rf -out=/stats/phantomPeakStatsReps.tab
Rscript run_spp.R -c=chipSampleRep3.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/reps -savr -savp -rf -out=/stats/phantomPeakStatsReps.tab

If you are using the MACS 1.4 peak caller then you can use a p-value threshold of 1e-2, as this p-value allows one to obtain large number of peaks. Then truncate the peak list to 150K to 300K peaks.

macs14 -t chipSampleRep1.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep1_VS_controlSampleRep0 -g hs -p 1e-2 -m 5,30
macs14 -t chipSampleRep2.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep2_VS_controlSampleRep0 -g hs -p 1e-2 -m 5,30
macs14 -t chipSampleRep3.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep3_VS_controlSampleRep0 -g hs -p 1e-2 -m 5,30

MACS will often not estimate the fragment length parameter accurately (it will estimate it as the read length) which can cause severely suboptimal peak calling. One can estimate this parameter far more robustly using strand cross-correlation analysis. Use this package (http://www.ebi.ac.uk/~anshul/public/softwareRepo/spp_package.tar.gz) to estimate fragment length. Then use the following MACS commands

macs14 -t chipSampleRep1.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep1_VS_controlSampleRep0 -g hs -p 1e-2 --shift-size=[FRAGLEN/2] --nomodel
macs14 -t chipSampleRep2.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep2_VS_controlSampleRep0 -g hs -p 1e-2 --shift-size=[FRAGLEN/2] --nomodel
macs14 -t chipSampleRep3.tagAlign.gz -c controlSampleRep0.tagAlign.gz -f BED -n chipSampleRep3_VS_controlSampleRep0 -g hs -p 1e-2 --shift-size=[FRAGLEN/2] --nomodel

MACS will generate '_peaks.xls' files that will have the relevant peak information. You can convert these to narrowPeak files using the macs2npk.sh bash script (http://www.ebi.ac.uk/~anshul/public/softwareRepo/macs2npk.sh).
This script will sort the peaks by p-value and then convert to narrowPeak format (NOTE: It will NOT truncate the peak list to 300K peaks)

macs2npk.sh chipSampleRep1_VS_controlSampleRep0_peaks.xls /peaks/reps
macs2npk.sh chipSampleRep2_VS_controlSampleRep0_peaks.xls /peaks/reps
macs2npk.sh chipSampleRep3_VS_controlSampleRep0_peaks.xls /peaks/reps

	
- This will generate 3 peak files in directory /peaks/reps/
	chipSampleRep1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep2_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep3_VS_controlSampleRep0.regionPeak.gz

- Phantom peak quality measures for the 3 runs will be appended to file /stats/phantomPeakStats.tab if you are using SPP scripts for peak calling.
(See README.txt in the peak calling package for details about the phantom peak quality measures files)

- For MACS peak calls remember to truncate the peak files to the top 300K peaks.

zcat chipSampleRep1_VS_controlSampleRep0.regionPeak.gz | head -n 300000 | gzip -c > temp; mv temp chipSampleRep1_VS_controlSampleRep0.regionPeak.gz
zcat chipSampleRep2_VS_controlSampleRep0.regionPeak.gz | head -n 300000 | gzip -c > temp; mv temp chipSampleRep2_VS_controlSampleRep0.regionPeak.gz
zcat chipSampleRep3_VS_controlSampleRep0.regionPeak.gz | head -n 300000 | gzip -c > temp; mv temp chipSampleRep3_VS_controlSampleRep0.regionPeak.gz

From here on I only give SPP commands but you can use MACS commands similar to the ones used above ..

====================================
CALL PEAKS ON POOLED REPLICATES
====================================
- Now pool all ChIP replicates
zcat chipSampleRep1.tagAlign.gz chipSampleRep2.tagAlign.gz chipSampleRep3.tagAlign.gz | gzip -c > chipSampleRep0.tagAlign.gz
	
- Call peaks on the pooled data (REMEMBER TO USE RELAXED THRESHOLDS AND TRY TO CALL 150k to 300k peaks even if most of them are noise)
Rscript run_spp.R -c=chipSampleRep0.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/pooled/ -savr -savp -rf
	
- This will generate a peak file in directory /peaks/pooled
	chipSampleRep0_VS_controlSampleRep0.regionPeak.gz
	
==================================================
FOR SELF-CONSISTENCY ANALYSIS CALL PEAKS ON PSEUDOREPLICATES OF INDIVIDUAL REPLICATES
==================================================
For each replicate e.g. chipSampleRep1.tagAlign.gz
- Randomly split the mapped reads in the tagAlign file into 2 parts (pseudoReplicates) with approximately equal number of reads
	You can use the following code snippet to do this
	*NOTE*: It uses the unix 'shuf' command that is standard with most installations. Incase you dont have it, I have included a 64-bit binary in the SPP package.
	
fileName='chipSampleRep1.tagAlign.gz' # input tagAlign file name
outputDir='/mapped/selfPseudoReps' # output directory for pseudoreplicate files
outputStub='chipSampleRep1' # prefix name for pseudoReplicate files

nlines=$( zcat ${fileName} | wc -l ) # Number of reads in the tagAlign file
nlines=$(( (nlines + 1) / 2 )) # half that number
zcat "${fileName}" | shuf | split -d -l ${nlines} - "${outputDir}/${outputStub}" # This will shuffle the lines in the file and split it into two parts
gzip "${outputDir}/${outputStub}00"
gzip "${outputDir}/${outputStub}01"
mv "${outputDir}/${outputStub}00.gz" "${outputDir}/${outputStub}.pr1.tagAlign.gz"
mv "${outputDir}/${outputStub}01.gz" "${outputDir}/${outputStub}.pr2.tagAlign.gz"

- You will now have 2 self-pseudoReplicates in the directory named /mapped/selfPseudoReps/
	chipSampleRep1.pr1.tagAlign.gz
	chipSampleRep1.pr2.tagAlign.gz

- Now call peaks on each of these using the pooled control dataset as the control (REMEMBER TO USE RELAXED THRESHOLDS AND TRY TO CALL 150k to 300k peaks even if most of them are noise)
Rscript run_spp.R -c=/mapped/selfPseudoReps/chipSampleRep1.pr1.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/selfPseudoReps -savr -savp -rf
Rscript run_spp.R -c=/mapped/selfPseudoReps/chipSampleRep1.pr2.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/selfPseudoReps -savr -savp -rf

- This will create two peak files in the directory /peaks/selfPseudoReps
	chipSampleRep1.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep1.pr2_VS_controlSampleRep0.regionPeak.gz
	
==================================================
CREATE PSEUDOREPLICATES OF POOLED DATA AND CALL PEAKS
==================================================
- Now, we create pseudoReplicates on the pooled data (same code as above)

fileName='chipSampleRep0.tagAlign.gz' # input tagAlign file name
outputDir='/mapped/pooledPseudoReps' # output directory for pseudoreplicate files
outputStub='chipSampleRep0' # prefix name for pseudoReplicate files

nlines=$( zcat ${fileName} | wc -l ) # Number of reads in the tagAlign file
nlines=$(( (nlines + 1) / 2 )) # half that number
zcat "${fileName}" | shuf | split -d -l ${nlines} - "${outputDir}/${outputStub}" # This will shuffle the lines in the file and split it into two parts
gzip "${outputDir}/${outputStub}00"
gzip "${outputDir}/${outputStub}01"
mv "${outputDir}/${outputStub}00.gz" "${outputDir}/${outputStub}.pr1.tagAlign.gz"
mv "${outputDir}/${outputStub}01.gz" "${outputDir}/${outputStub}.pr2.tagAlign.gz"

- You will now have 2 pooled-pseudoReplicates in the directory named /mapped/pooledPseudoReps/
	chipSampleRep0.pr1.tagAlign.gz
	chipSampleRep0.pr2.tagAlign.gz

- Now call peaks on each of these using the pooled control dataset as the control (REMEMBER TO USE RELAXED THRESHOLDS AND TRY TO CALL 150k to 300k peaks even if most of them are noise)
Rscript run_spp.R -c=/mapped/selfPseudoReps/chipSampleRep0.pr1.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/pooledPseudoReps -savr -savp -rf
Rscript run_spp.R -c=/mapped/selfPseudoReps/chipSampleRep0.pr2.tagAlign.gz -i=controlSampleRep0.tagAlign.gz -npeak=300000 -odir=/peaks/pooledPseudoReps -savr -savp -rf

- This will create two peak files in the directory /peaks/pooledPseudoReps
	chipSampleRep0.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep0.pr2_VS_controlSampleRep0.regionPeak.gz

===============================
INPUT TO IDR ANALYSIS
===============================
For IDR analysis, we always compare a pair of peak files

For dataset chipSample, we will have ended up with the following files

PEAKS ON ORIGINAL REPLICATES
	chipSampleRep1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep2_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep3_VS_controlSampleRep0.regionPeak.gz

PEAKS ON SELF-PSEUDOREPLICATES
	chipSampleRep1.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep1.pr2_VS_controlSampleRep0.regionPeak.gz

	chipSampleRep2.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep2.pr2_VS_controlSampleRep0.regionPeak.gz

	chipSampleRep3.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep3.pr2_VS_controlSampleRep0.regionPeak.gz
	
PEAKS ON POOLED PSEUDOREPLICATES
	chipSampleRep0.pr1_VS_controlSampleRep0.regionPeak.gz
	chipSampleRep0.pr2_VS_controlSampleRep0.regionPeak.gz

===============================
IDR ANALYSIS ON ORIGINAL REPLICATES
===============================
We will perform the following pairwise analyses 
/peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak.gz AND /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak.gz
/peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak.gz AND /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak.gz
/peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak.gz AND /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak.gz

First gunzip all the peak files

Then run the IDR analysis on all pairs

Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep1_VS_chipSampleRep2 0 F signal.value
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep1_VS_chipSampleRep3 0 F signal.value
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak -1 /consistency/reps/chipSampleRep2_VS_chipSampleRep3 0 F signal.value

If you are using peaks based on the MACS peak caller, then use p.value as the ranking measure and also set max peak half-width to 200 bp (since MACS tends to call wider peaks with relaxed p-value thresholds of 1e-2)

Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak 200 /consistency/reps/chipSampleRep1_VS_chipSampleRep2 0 F p.value
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep1_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak 200 /consistency/reps/chipSampleRep1_VS_chipSampleRep3 0 F p.value
Rscript batch-consistency-analysis.r /peaks/reps/chipSampleRep2_VS_controlSampleRep0.regionPeak /peaks/reps/chipSampleRep3_VS_controlSampleRep0.regionPeak 200 /consistency/reps/chipSampleRep2_VS_chipSampleRep3 0 F p.value

You can plot the IDR plots using
Rscript batch-consistency-plot.r 3 /consistency/reps/chipSampleAllReps /consistency/reps/chipSampleRep1_VS_chipSampleRep2 /consistency/reps/chipSampleRep1_VS_chipSampleRep3 /consistency/reps/chipSampleRep2_VS_chipSampleRep3

===============================
IDR ANALYSIS ON SELF-PSEUDOREPLICATES
===============================
We will perform the following pairwise analyses 
/peaks/selfPseudoReps/chipSampleRep1.pr1_VS_controlSampleRep0.regionPeak.gz AND /peaks/selfPseudoReps/chipSampleRep1.pr2_VS_controlSampleRep0.regionPeak.gz
/peaks/selfPseudoReps/chipSampleRep2.pr1_VS_controlSampleRep0.regionPeak.gz AND /peaks/selfPseudoReps/chipSampleRep2.pr2_VS_controlSampleRep0.regionPeak.gz
/peaks/selfPseudoReps/chipSampleRep3.pr1_VS_controlSampleRep0.regionPeak.gz AND /peaks/selfPseudoReps/chipSampleRep3.pr2_VS_controlSampleRep0.regionPeak.gz

First gunzip all the peak files

Then run the IDR analysis on all pairs

Rscript batch-consistency-analysis.r /peaks/selfPseudoReps/chipSampleRep1.pr1_VS_controlSampleRep0.regionPeak /peaks/selfPseudoReps/chipSampleRep1.pr2_VS_controlSampleRep0.regionPeak -1 /consistency/selfPseudoReps/chipSampleRep1.pr1_VS_chipSampleRep1.pr2 0 F signal.value
Rscript batch-consistency-analysis.r /peaks/selfPseudoReps/chipSampleRep2.pr1_VS_controlSampleRep0.regionPeak /peaks/selfPseudoReps/chipSampleRep2.pr2_VS_controlSampleRep0.regionPeak -1 /consistency/selfPseudoReps/chipSampleRep2.pr1_VS_chipSampleRep2.pr2 0 F signal.value
Rscript batch-consistency-analysis.r /peaks/selfPseudoReps/chipSampleRep3.pr1_VS_controlSampleRep0.regionPeak /peaks/selfPseudoReps/chipSampleRep3.pr2_VS_controlSampleRep0.regionPeak -1 /consistency/selfPseudoReps/chipSampleRep3.pr1_VS_chipSampleRep3.pr2 0 F signal.value

You can plot the IDR plots using
Rscript batch-consistency-plot.r 3 /consistency/selfPseudoReps/chipSampleAllSelfReps /consistency/selfPseudoReps/chipSampleRep1.pr1_VS_chipSampleRep1.pr2 /consistency/selfPseudoReps/chipSampleRep2.pr1_VS_chipSampleRep2.pr2 /consistency/selfPseudoReps/chipSampleRep3.pr1_VS_chipSampleRep3.pr2

===============================
IDR ANALYSIS ON POOLED-PSEUDOREPLICATES
===============================
We will perform the following pairwise analyses 
/peaks/pooledPseudoReps/chipSampleRep0.pr1_VS_controlSampleRep0.regionPeak.gz AND /peaks/pooledPseudoReps/chipSampleRep0.pr2_VS_controlSampleRep0.regionPeak.gz

First gunzip all the peak files

Then run the IDR analysis on all pairs

Rscript batch-consistency-analysis.r /peaks/pooledPseudoReps/chipSampleRep0.pr1_VS_controlSampleRep0.regionPeak /peaks/pooledPseudoReps/chipSampleRep0.pr2_VS_controlSampleRep0.regionPeak -1 /consistency/pooledPseudoReps/chipSampleRep0.pr1_VS_chipSampleRep0.pr2 0 F signal.value

You can plot the IDR plots using
Rscript batch-consistency-plot.r 1 /consistency/pooledPseudoReps/chipSamplePooledReps /consistency/pooledPseudoReps/chipSampleRep0.pr1_VS_chipSampleRep0.pr2

===============================
GETTING THRESHOLDS TO TRUNCATE PEAK LISTS
===============================

The following IDR thresholds are recommended for the different types of IDR analyses.

For self-consistency and comparison of true replicates

- If you started with ~150 to 300K relaxed pre-IDR peaks for large genomes (human/mouse), then threshold of 0.01 or 0.02 generally works well. For smaller genomes such as worm, if you start with ~15K to 40K peaks then once again IDR thresholds of 0.01 or 0.02 work well.
- If you started with < 100K pre-IDR peaks for large genomes (human/mouse), then threshold of 0.05 is more appropriate. This is because the IDR sees a smaller noise component and the IDR scores get weaker. This is typically for use with peak callers that are unable to be adjusted to call large number of peaks (eg. PeakSeq or QuEST)
- For self-consistency analysis of datasets with shallow sequencing depths, you can use an IDR threshold as relaxed as 0.1 if you start with < 100K pre-IDR peaks.

For pooled-consistency analysis
- If you started with ~150 to 300K relaxed pre-IDR peaks for large genomes (human/mouse), then threshold of 0.0025 or 0.005 generally works well. For smaller genomes such as worm, if you start with ~15K to 40K peaks then once again IDR thresholds of 0.01 work well. We use a tighter threshold for pooled-consistency since pooling and subsampling equalizes the pseudo-replicates in terms of data quality. So we err on the side of caution and use more stringent thresholds.
- If you started with < 100K pre-IDR peaks for large genomes (human/mouse) , then threshold of 0.01 is more appropriate. This is because the IDR sees a smaller noise component and the IDR scores get weaker so we have to relax the thresholds. This is typically for use with peak callers that are unable to be adjusted to call large number of peaks (eg. PeakSeq or QuEST)

Each of these comparisons will give us a certain number of peaks that pass their respective IDR thresholds
We will refer to them as follows

ORIGINAL REPLICATE THRESHOLD
numPeaks_Rep1_Rep2=$( awk '$11 <= 0.01 {print $0}' /consistency/reps/chipSampleRep1_VS_chipSampleRep2-overlapped-peaks.txt | wc -l )
numPeaks_Rep1_Rep3=$( awk '$11 <= 0.01 {print $0}' /consistency/reps/chipSampleRep1_VS_chipSampleRep3-overlapped-peaks.txt | wc -l )
numPeaks_Rep2_Rep3=$( awk '$11 <= 0.01 {print $0}' /consistency/reps/chipSampleRep2_VS_chipSampleRep3-overlapped-peaks.txt | wc -l )

We pick the maximum value out of these 3. Lets call it 'max_numPeaks_Rep'

SELF-CONSISTENCY THRESHOLDS
numPeaks_Rep1_pr=$( awk '$11 <= 0.02 {print $0}' /consistency/selfPseudoReps/chipSampleRep1.pr1_VS_chipSampleRep1.pr2-overlapped-peaks.txt | wc -l )
numPeaks_Rep2_pr=$( awk '$11 <= 0.02 {print $0}' /consistency/selfPseudoReps/chipSampleRep2.pr1_VS_chipSampleRep2.pr2-overlapped-peaks.txt | wc -l )
numPeaks_Rep3_pr=$( awk '$11 <= 0.02 {print $0}' /consistency/selfPseudoReps/chipSampleRep3.pr1_VS_chipSampleRep3.pr2-overlapped-peaks.txt | wc -l )

These 3 numbers should be in a similar range. If any of these is substantially different from the others, it points that replicate being different in some way.
You should also refer back to the phantom peak coefficient and sequencing depth for the replicates as this can further support enrichment/undesequencing problems.

POOLED-PSEUDOREPLICATE THRESHOLD
numPeaks_Rep0=$( awk '$11 <= 0.0025 {print $0}' /consistency/pooledPseudoReps/chipSampleRep0.pr1_VS_chipSampleRep0.pr2-overlapped-peaks.txt | wc -l )

Typically if the self-consistency thresholds are all similar i.e. the replicates are very similar to each other, then max_numPeaks_Rep and numPeaks_Rep0 should be similar.
If numPeaks_Rep0 is substantially greater than max_numPeaks_Rep, then you have rescued the dataset since it likely had some weak/different replicates.

=======================
FINAL SET OF PEAK CALLS
=======================
You can now generate a conservative and an optimal final set of peak calls for the dataset chipSample

For the conservative set
- Sort the peaks in chipSampleRep0_VS_controlSampleRep0.regionPeak.gz by the signal.value (or p.value or q.value depending on what you used to rank the peaks in the IDR analysis) column (column 7 is signal.value) in descending order
- Pick the top max_numPeaks_Rep peaks

zcat chipSampleRep0_VS_controlSampleRep0.regionPeak.gz | sort -k7nr,7nr | head -n [max_numPeaks_Rep] | gzip -c > spp.conservative.chipSampleRep0_VS_controlSampleRep0.regionPeak.gz

For the optimal set
- Sort the peaks in chipSampleRep0_VS_controlSampleRep0.regionPeak.gz by the signal.value (or p.value or q.value depending on what you used to rank the peaks in the IDR analysis) column (column 7 is signal.value) in descending order
- Select the maximum of max_numPeaks_Rep and numPeaks_Rep0 i.e. optThresh = max(max_numPeaks_Rep,numPeaks_Rep0)
- Pick the top 'optThresh' peaks

zcat chipSampleRep0_VS_controlSampleRep0.regionPeak.gz | sort -k7nr,7nr | head -n [optThresh] | gzip -c > spp.conservative.chipSampleRep0_VS_controlSampleRep0.regionPeak.gz