=== QC_practical_BoulderWorkshop2023.txt ====

Presenter: Daniel Howrigan

Script re-organized from the 2021 workshop practical by Lucia Colodro-Conde and Katrina Grasby 

The genetic data for todays tutorial originated from http://labs.med.miami.edu/myers/LFuN/LFUN/DATA.html
Meyers 2007 "A survey of genetic human cortical gene expression"
The phenotypes are simulated.

# Online Resources
https://www.cog-genomics.org/plink/1.9
http://zzz.bwh.harvard.edu/plink/
https://genome.ucsc.edu

# Workshop Links
https://workshop.colorado.edu/ssh 
https://workshop.colorado.edu/rstudio 

# Qualtrics website:
https://ucsas.qualtrics.com/jfe/form/SV_eWpdYL7srw7Cy6W

## ==== Practical - Short Outline ====

## === Main QC ===
# STEP 1. Data and Formats
# STEP 2. Check for reported/genotype sex discrepancies
# STEP 3. Obtain information on individuals missing SNP data
# STEP 4. Variant QC: SNPs missing data; MAF; Hardy-Weinberg
# STEP 5. Sample QC: genotype call rate and heterozygosity 
# STEP 6. LD-pruned SNP set
# STEP 7. Sample QC: sex check filtering using LD-pruned SNP set
# STEP 8. Sample QC: Checking for cryptic relatedness

## === Additional QC ===
# STEP 9. Genome Build checking
# STEP 10. Haploid het checking / filtering
# STEP 11. HapMap3 reference panel merging and strand ambiguous SNP checking / filtering
# STEP 12. Using Principal Component Analysis to infer broad ancestry
# STEP 13. Looking at additional QC/PCA pipelines
# STEP 14. Clean up your workspace


## ==== Practical - Extended Outline ====

## === Main QC ===

# STEP 1. Data and Formats
#  1.1 Creating workspace
#  1.2 Copying over genetic dataset
#  1.3 Looking at genetic dataset
#  1.4 Convert to PLINK binary file format

# STEP 2. Check for reported/genotype sex discrepancies
#  2.1 Running PLINK -sexcheck
#  2.2 Viewing discrepant samples
#  2.3 Plotting X chromosome heterozygosity in R

# STEP 3. Obtain information on individuals missing SNP data
#  3.1 Running PLINK --missing
#  3.2 Plotting sample call rate in R

# STEP 4. Variant QC: SNPs missing data; MAF; Hardy-Weinberg
#  4.1 Filter SNP call rate < 95% with PLINK --geno
#  4.2 Plot SNP call rate in R 
#  4.3 Case-control call rate difference with PLINK --test-missing
#  4.4 Filter case-control call rate SNPs with p < 1e-5
#  4.5 Running PLINK --freq
#  4.6 Plot the sample allele frequency spectrum in R
#  4.7 Filter low-frequency SNPs with PLINK --geno
#  4.8 Running PLINK --hardy and Filter SNPs with --hwe 1e-6

# STEP 5. Sample QC: genotype call rate and heterozygosity 
#  5.1 Filtering samples with < 95% call rate with PLINK --mind
#  5.2 Plotting pre and post QC call rates in R
#  5.3 Running PLINK --het
#  5.4 Filtering samples with excess / depleted heterozygosity

# STEP 6. LD-pruned SNP set
#  6.1 Extracting independent SNP list 

# STEP 7. Sample QC: sex check filtering using LD-pruned SNP set
#  7.1 Filtering sex-check discrepant samples
#  7.2 Compare against earlier sex-check evaluation

# STEP 8. Sample QC: Checking for cryptic relatedness
#  8.1 Running PLINK --genome on LD-pruned SNP set
#  8.2 Running PLINK --genome to identify first/second degree relatives


## === Additional QC ===

# STEP 9. Genome Build checking

# STEP 10. Haploid het checking / filtering
#  10.1 Using PLINK --split-par (formerly --split-x) to evaluate the X/Y psuedo-autosomal region
#  10.2 Why we used PLINK --set-hh-missing

# STEP 11. HapMap3 reference panel merging and strand ambiguous SNP checking / filtering
#  11.1 Selecting strand ambiguous SNPs
#  11.2 Extract HapMap3 (HM3) data on our remaining SNPs
#  11.3 Reduce our data to only include SNPs also in HM3
#  11.4 Merge our cleaned data and the HM3 data (for the snps that matched)
#  11.5 Flip the strand for the missnps in our data using PLINK --flip
#  11.6 Merge again, using the file with the flipped SNPs

# STEP 12. Using Principal Component Analysis to infer broad ancestry
#  12.1 Run PCA for 4 components
#  12.2 Plot PCA eigenvectors in R

# STEP 13. Looking at additional QC/PCA pipelines
#  13.1 Look at the Preimputation (QC) page, and read the Input requirements and Technical Details.
#  13.2 GWASpy pipeline
#  13.3 Separate GWAS tutorial on GitHub

# STEP 14. Clean up your workspace


## ==== Practical - Questions ====

Q1a. How is the trait coded?

Q1b. How many SNPs, people, females, and controls are there?

Q2a. How many mismatches were flagged?

Q3a. How many individuals will we drop if we keep a Sample Call Rate of >=95 (i.e. remove those with >5% missing SNPs)?

Q4a. How many variants were removed by the call rate filter?

Q4b. How many variants were removed using the MAF filter?

Q4c. How many variants were dropped due to HWE?

Q5a. How many samples were dropped with call rate < 0.95?

Q5b. How many people were >3SD from the mean heterozygosity rate?

Q7a. How many sex discrepant samples are there?

Q7b. Are the others who were initially flagged as mismatches still in the data set?

Q7c. At what step in QC were the other flagged individuals removed?

Q9a. Which build are these data mapped to?

Q10a. Did this remove the remaining het haploid genotypes?


# NOTE: Answers to the questions are at the end of the script.


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 1.  Data and Formats
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


# We begin by creating our working directory within the secure shell

# 1) Go to: https://workshop.colorado.edu/ssh/
# 2) Enter workshop login ID and password

# you are now in your home directory

#  1.1 Creating workspace

## Create day1 subdirectory (-p creates full path into new directories)
mkdir -p ~/day1/QC
## traverse into new subdirectory
cd ~/day1/QC


#  1.2 Copying over genetic dataset

# Copy the files to your working subdirectory
cp /faculty/daniel/2023/QC/* .

# Check you have the required files:
ls -l
# HM3.bed
# HM3.bim
# HM3.fam
# QC_practical_BoulderWorkshop2023.R 
# QC_practical_BoulderWorkshop2023.sh
# cc.ped
# cc.map


#  1.3 Looking at genetic dataset

# The ped (pedigree format) file holds both phenotype 
# and genotype information.
# Each row is an individual.
# Ped files always begin with: 
# FamilyID IndividualID PaternalID MaternalID SEX
# Sex is coded with Male = 1, Female = 2 
# HINT: think of number of X chromosomes
# Then follows the trait(s).
# For case-controls: 
# U = 1 = controls; A = 2 = cases; X = 0 = -9 = missing
# Then come the columns for all the SNP data. 
# SNP data is represented with 2 columns, one for each allele.
# SNPs are coded either with letters (a,c,t,g) or numbers (1,2,3,4)


# --- Look at the ped file:

less -S cc.ped

# Use the -S flag to stop rows from wrapping
# less REMINDER: q to quit

# Scroll down to get an idea of how the trait is coded.

# Q1a. How is the trait coded?

# NOTE: Answers to the questions are at the end of the script.

# You can check the minimum and maximum value of the 
# trait (column $6) with the following command:

awk 'NR==1{min=max=$6};$6<min{min=$6};$6>max{max=$6}END{print min, max}' cc.ped

# If you're certain it's not continuous, 
# then you get more of an idea about sample size and missing data with:

awk '{print $6}' cc.ped | sort -n | uniq -c 


# --- Look at the map file

# Columns: 
# CHR = Chromosome
# RS# = Reference SNP ID
# Distance = Genetic Distance (in centimorgans)
# Position = Human genome reference base position

# Often there are zeros are in the distance column = unknown

less cc.map


#  1.4 Convert to PLINK binary file format

# Convert the ped & map files to plink bfile (binary) formatted files.

# The plink command --make-bed creates the binary format files:
# fam = information on the pedigree and phenotype
# bim = this is the map file with the alleles for each variant
# bed = binary file with each allele for each person

# Each time we remove either variants or individuals we will 
# create a new set of bfiles using the --make-bed flag.

plink --ped cc.ped --map cc.map --make-bed --out cc.begin

# NOTE: Errors in plink will stop progress. 
# Warnings are to inform your decisions. 

# Q1b. How many SNPs, people, females, and controls are there?

# What about the .hh file?
# Warning: 2177 het. haploid genotypes present (see cc.begin.hh)

# Heterozygous haploid errors result from 
# heterozygous calls on chromosome X for males.
# Males have one X chromosome. They are hemizygous.
# They ought not be heterozygous for loci on chromosome X,
# unless the loci in are in pseudo-autosomal regions.
# We can use the split-x command to check, 
# which we will do later in the tutorial.
# Heterozygous haploid errors are saved into files with the suffix .hh 
# if you want to examine them in more detail.


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 2. Check for reported/genotype sex discrepancies
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#  2.1 Running PLINK -sexcheck

# Check for a mismatch between sex reported in the ped file
# and sex from genotype data.
# This can help identify if IDs have been mismatched to genetic data.
# This is uncommon with modern machinery and barcoding,
# but still possible.
# Since we are only checking details we will use the PLINK --out command.

plink --bfile cc.begin --check-sex --out sex


#  2.2 Viewing discrepant samples

less sex.sexcheck 

# This file has a column that indicates the reported sex and
# sex inferred from the genetic data
# --check-sex is a heterozygosity check of chromosome X.
# Males have one chromosome X, females have two.
# We expect heterozygosity on chromosome X for females but not males.
# The F statistic (inbreeding coefficient) based on SNP homozygosity. 
# For males it should be close to 1, for females it should be close to 0.
# Typically, above .8 is classed as male, below .2 is classed as female.
# The STATUS column (OK/PROBLEM) reflects if reported sex matches the genetic data.

# Select the individuals flagged as not matching

grep PROBLEM sex.sexcheck 

## for a cleaner output, use this command
grep 'PROBLEM\|FID' sex.sexcheck | column -t

## "\|" operator looks for multiple matches, and \| treats the | as an operator instead of a string
## column -t tab-separates the output


# Q2a. How many mismatches were flagged?


#  2.3 Plotting X chromosome heterozygosity in R

## Using Rstudio:
# 1) Open a separate browser window
# 2) Go to https://workshop.colorado.edu/rstudio
# 3) Login with your username and password
# 4) In the "Terminal" tab, navigate to your working directory (cd '~/day1/QC')
# 5) To open the Rscript, go to File -> Open -> QC_practical_BoulderWorkshop2023.R
# 6) Read through the R script through to section 2.3

# Plot chromosome X heterozygosity in R

# We are checking for errors here to assess if there is a major problem.
# There are some reports that checking sex ought to be done after 
# pruning the variants to be independent (in linkage equilibrium)
# So for now we will continue with all of our sample,
# and see if there is a problem when checking sex on independent variants
# later in the tutorial.


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 3. Obtain information on individuals missing SNP data
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#  3.1 Running PLINK --missing

# The --missing command will give output files on both types of missingness: 
# individuals missing SNP data -> imiss
# variants missing data from individuals -> lmiss 
# (NOTE: the "l" stands for locus)

plink --bfile cc.begin --missing --out miss

# Have a look at both: 
less miss.imiss 
less miss.lmiss


#  3.2 Plotting sample call rate in R

# Q3a. How many individuals will we drop if we keep a 
# Sample Call Rate of >=95 (i.e. remove those with >5% missing SNPs)? 

# NOTE: check the R script for the answer


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 4. Variant QC: SNPs missing data; MAF; Hardy-Weinberg
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


# Now we start cleaning up our data!

#  4.1 Filter SNP call rate < 95% with PLINK --geno

# Remove variants that are missing information from too many individuals.
# We will keep those with a genotype call rate of at least .95 
# (i.e. variants with data from at least 95% of our sample).

plink --bfile cc.begin --geno 0.05 --make-bed --out cc.qc1

# Q4a. How many variants were removed by the call rate filter? 


#  4.2 Plot SNP call rate in R 


#  4.3 Case-control call rate difference with PLINK --test-missing

# When cleaning case-control data, check that variants are
# not disproportionately missing between cases and controls.

plink --bfile cc.qc1 --test-missing --out case-control 


#  4.4 Filter case-control call rate SNPs with p < 1e-5

# This awk code will grab any variants that differ in missingness 
# between cases and controls with a p value < .00001 
# and save those variants into a file to be dropped.

awk '$5<=0.00001' case-control.missing > case-control.missing.drop

less case-control.missing.drop

# There are none in this sample, as the file is empty.
# If there were, then we could use the PLINK --remove command.
# (We will implement --remove later in this practical.)


#  4.5 Running PLINK --freq

# Obtain MAF information on the whole sample.
# We do not need to create this frequency file for QC, 
# but it is a useful command to know.
# And by creating the file we can plot the distribution of MAF.

plink --bfile cc.begin --freq --out maf

# NOTE: When using bfiles, PLINK traditionally sets the minor allele as A1. 
# PLINK2 no longer does this
# PRO TIP: using PLINK's '--keep-allele-order' command will prevent PLINK from setting the minor allele as A1
# This is useful if you import VCF data and want A1/A2 to reflect REF/ALT 


#  4.6 Plot the sample allele frequency spectrum in R


#  4.7 Filter low-frequency SNPs with PLINK --geno

plink --bfile cc.qc1 --maf 0.01 --make-bed --out cc.qc2

# Q4b. How many variants were removed using the MAF filter? 

# NOTE: There will be less than from the result in R as that 
# was obtained prior to any QC steps.

# NOTE: The minor allele count for a 0.01 MAF SNP in our sample 

## MAF * (NCHROM)
## or 0.01 * (193*2) = 3.86 minor alleles

## As sample sizes get larger, this QC threshold should get lower for GWAS studies


#  4.8 Running PLINK --hardy and Filter SNPs with --hwe 1e-6

# Check Hardy-Weinberg Equation (HWE) deviation
# Remove variants with P < 1e-6
# With case-control data, the default HWE check 
# only conducts the check on the controls.
# We have overwritten that here with the --include-nonctrl flag. 
# The --hardy command will give us the HWE values for each SNP
# calculated on the whole sample, the cases, and the controls.
# The --hwe command will drop those variants that are below threshold.

plink --bfile cc.qc2 --hardy midp --hwe 1e-6 midp include-nonctrl --make-bed --out cc.qc3

# NOTE The 'midp' is a p-value adjustment added to test to 
# bring the null rejection rate in line with the nominal p-value
# and also reduces the filter's tendency to favor retention of variants with missing data
# (see https://www.cog-genomics.org/plink/1.9/filter#hwe)

# Q4c. How many variants were dropped due to HWE?

# Given the warning about variation due to chromosome X
# Let's check if these HWE errors are all on chromosome X (coded as 23 in PLINK).
# Keep those SNPs with P < 1e-6

grep CHR cc.qc3.hwe > hwe
awk '$9<1e-6' cc.qc3.hwe | grep ALL >> hwe

# Sort and count number of variants on each chromosome.

awk '{print $1}' hwe | sort -n | uniq -c

# Not specific to chromosome X, as there were variants on all chromosomes dropped due to HWE deviations
# from the expected.


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 5. Sample QC: genotype call rate and heterozygosity 
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#  5.1 Filtering samples with < 95% call rate with PLINK --mind

plink --bfile cc.qc3 --mind 0.05 --make-bed --out cc.qc4

# Q5a. How many samples were dropped with call rate < 0.95?

# Compare this to how many we would have dropped if we removed 
# individuals before cleaning up the variants (above in Q3a).

# Cleaning up the variants before checking individuals, means we 
# removed only 2.5% of this sample instead of 20% of this sample
# if we had done the reverse. 

# Lets obtain missingingness information from the data file created
# at the end of STEP 4 when we finished cleaning variants (Post QC).
plink --bfile cc.qc3 --missing --out miss2


#  5.2 Plotting pre and post QC call rates in R


#  5.3 Running PLINK --het

# Per-sample Heterozygosity

# Obtain the heterozygosity information on the autosomes (chr 1-22).
# Too much heterozygosity might be a problem from DNA contamination.
# Allele frequencies are used in this calculation.

plink --bfile cc.qc4 --het --out het


#  5.4 Filtering samples with excess / depleted heterozygosity

# Calculate the proportion of heterozygosity:
# (total genotypes - homozygotes) / total number of autosomal genotypes

echo "FID IID obs_HOM N_SNPs prop_HET" > het.txt 
awk 'NR>1{print $1,$2,$3,$5,($5-$3)/$5}' het.het >> het.txt

# This awk command will calculate the values of the proportion of het that corresponds to +3SD and -3SD around the mean
# (on prop_HET, the 5th column of a file that we have just created)

awk 'NR>1{sum+=$5;sq+=$5^2}END{avg=sum/(NR-1);print avg-3*(sqrt(sq/(NR-2)-2*avg*(sum/(NR-2))+(((NR-1)*(avg^2))/(NR-2)))),avg+3*(sqrt(sq/(NR-2)-2*avg*(sum/(NR-2))+(((NR-1)*(avg^2))/(NR-2))))}' het.txt

# Save individuals with heterozygosity rates >3SD into a file to then drop them.

awk '$5<=0.299429 || $5>= 0.326129' het.txt> het.drop 

# Q5b. How many samples were >3SD from the mean heterozygosity rate? 

# Remove them
plink --bfile cc.qc4 --remove het.drop --make-bed --out cc.qc5


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 6. LD-pruned SNP set
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#  6.1 Extracting independent SNP list 

# Remember that we postponed removing sex errors until we had 
# clean and independent SNPs.

# First prune SNPs. This creates a file (indep.prune.in) 
# of independent SNPs following the criteria for 
# linkage disequilibrium r2 <0.2 within windows of 1000 kb:

plink --bfile cc.qc5 --indep-pairwise 1000 5 0.2 --out indep

# check how many SNPs remain in the LD-pruned set
wc -l indep.prune.in


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 7. Sample QC: sex check filtering using LD-pruned SNP set
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#  7.1 Filtering sex-check discrepant samples

# Check for sex errors using these independent SNPs

plink --bfile cc.qc5 --extract indep.prune.in --check-sex --out sex2

# Copy them into a file to remove from the data set.

grep PROBLEM sex2.sexcheck > sex.drop

# Q7a. How many sex discrepant samples are there?


#  7.2 Compare against earlier sex-check evaluation

grep PROBLEM *.sexcheck

# NOTE: This will code extract all rows that include PROBLEM 
# from all files that end in .sexcheck


# Q7b. Are the others who were initially flagged as mismatches
# still in the data set? 

# HINT: look at cc.qc5.fam column 2 (IID)


# Q7c. At what step in QC were the other flagged individuals removed?

# Hint: grep the original IID from the various *fam files 
# and you will be able to see at what step they are no longer included.

# Remove the remaining individual(s) with a mismatch on sex.

plink --bfile cc.qc5 --remove sex.drop --set-hh-missing --make-bed --out cc.clean

## NOTE: the --set-hh-missing command is explained in more detail in step 10


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 8. Sample QC: Checking for cryptic relatedness
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#  8.1 Running PLINK --genome on LD-pruned SNP set

# Calculate identity-by-descent (IBD) on the pruned SNPs on the
# autosomes. Without actual pedigree data this estimated from 
# identity-by-state.

plink --bfile cc.clean --chr 1-22 --extract indep.prune.in --genome --out ibd

# Look at the output

less ibd.genome

# This calculated the relationships between every pairing of
# individuals in the sample.
# Z0 Z1 Z2 = the pair of IDs share 0, 1 or 2 alleles by descent.
# Ideally 1,0,0 = unrelated; 0,1,0 = parent-child; .25,.5,.25 = siblings
# pihat = proportion IBD = P(IBD=2)+0.5*P(IBD=1) 


#  8.2 Running PLINK --genome to identify first/second degree relatives

# Obtain a list of individuals who are related with pihat > .2

plink --bfile cc.clean --extract indep.prune.in --genome --min 0.2 --out pihat

# Check them in the file pihat.genome
# In this case there are none to remove. 


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# END of the MAIN QC section
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 9. Genome Build checking
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Look at your .map or .bim file 
# Cross reference the base position of several SNPs with 
# what is shown in the genome browser for several builds.

# https://genome.ucsc.edu
# Click on Genomes

# Under Human Assembly use the drop-down menu to go to a build.
# Enter an rs# chr:bp (e.g. rs3094315 or chr1:752566)
# Click Go

# Click on one of the links for dbSNP i.e. 
# Common (1000 Genomes Phase 3 MAF >= 1%) Short Genetic Variants from dbSNP Release 153
# or Simple Nucleotide Polymorphisms (dbSNP build XXX)
# There are a lot of possible links here for different releases of dbSNP.

# Then click on the rs# link on the left-hand side
## NOTE: This is inside the graph that describes each UCSC track being plotted, NOT the lists below!

# Here you will see useful information about the variant
# including position, alleles, frequency in from different projects.

# You are looking to see if the position listed against the 
# rs# in the map file matches the position in the build.
# As builds are updated, the position number for some of the 
# rs# will change. Check a couple of rs# across different builds.

# Q9a. Which build are these data mapped to?

# To LiftOver your genomic positions to a different build, see
# https://genome.ucsc.edu/cgi-bin/hgLiftOver
# https://hail.is/docs/0.2/functions/genetics.html#hail.expr.functions.liftover


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 10. Haploid het checking / filtering
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# When we removed samples with sex-check discrepancies, we still had
# hetorzygous haploid genotypes present.


#  10.1 Using PLINK --split-par (formerly --split-x) to evaluate the X/Y psuedo-autosomal region

# Use split-par to see if the remaining issues are due to the 
# pseudo-autosomal regions. This command is taking the variant 
# position boundaries of the pseudo-autosomal region according to 
# build 37 (hg19) and changing the chromosome codes of all variants
# in the region to XY. So plink will not treat these regions as haploid.
# If the data was on a different build, you would need to check 
# if plink will implement the appropriate flag

plink --bfile cc.qc5 --split-par b37 no-fail --make-bed --out cc.splitx

# Q10a. Did this remove the remaining het haploid genotypes? 

#  10.2 Why we used PLINK --set-hh-missing

# These remaining het haplod issues do not appear to be from 
# pseudo-autosomal regions. Therefore, these remaining variants would
# be removed for analytical commands, but we can set them to missing.

## This is why the '--set-hh-missing' is used to create the cc.clean PLINK files
 
# NOTE: the number of variants and individuals did not change.
# The errors come from specific variants for certain people. 
# Only those errors are set to missing.
# If you have time (and want to) you can explore the details 
# of which variants and people in the *.hh files.


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 11. HapMap3 reference panel merging and strand ambiguous SNP checking / filtering
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#  11.1 Selecting strand ambiguous SNPs

# First drop strand ambiguous alleles from our cleaned data set and
# Create a file with a list of our variants (rs#).

# NOTE on strand ambiguous alleles:
# Each chromosome is double-stranded DNA. The two strands bond together
# with A paired to T and C paired to G.
# Strands may be named as + and - (or forward and backward, 
# or top and bottom) but with genotyped data this naming is quite 
# arbitrary. If we are combining data across platforms or samples
# there is uncertainty around if the strand reference is the same.
# Often we get around this uncertainty by removing strand ambiguous alleles. 

# In the 5th and 6th column of the bim file, strand ambiguous SNPs will have
# an A and T as the two possible alleles for that variant, or they will
# have a C and G as the two possible alleles.

# We will remove all strand ambiguous allele SNPs before merging our data with HapMap3 (HM3).
# This bit of code selects the possible strand ambiguous allele combinations
# from columns 5 and 6, and then prints the RS# along with the word ambig 
# Those variants not selected just have the RS# printed 
# Then we select those rows of data without the word ambig
# And copy them into a file cc.clean.snplist

awk '{if(($5=="T" && $6=="A")||($5=="A" && $6=="T")||($5=="C" && $6=="G")||($5=="G" && $6=="C")) print $2, "ambig"; else print $2}' cc.clean.bim | grep -v ambig  > cc.clean.snplist

#  11.2 Extract HapMap3 (HM3) data on our remaining SNPs

plink --bfile HM3 --extract cc.clean.snplist --chr 1-22 --make-bed --out hm3-oursnps

# Obtain a list of HM3 snps
awk '{print $2}' HM3.bim > HM3.snplist

#  11.3 Reduce our data to only include SNPs also in HM3

plink --bfile cc.clean --chr 1-22 --extract HM3.snplist --make-bed --out cc.clean-hm3snps 

#  11.4 Merge our cleaned data and the HM3 data (for the snps that matched)

plink --bfile cc.clean-hm3snps --bmerge hm3-oursnps.bed hm3-oursnps.bim hm3-oursnps.fam --make-bed --out cc.merged 

# This gives an error!
# PLINK suggests it may be due to strand inconsistency.
# And provides an output file with the potential mismatched SNPs
# e.g. rs10000037 T C in our data and A G in HapMap.
# If we flip the strand of the mismatched SNPs in our data
# the T flips to A and the C flips to G and our rs10000037 data will
# align with HapMap.

#  11.5 Flip the strand for the missnps in our data using PLINK --flip

plink --bfile cc.clean-hm3snps --flip cc.merged-merge.missnp --make-bed --out cc.flipped 

#  11.6 Merge again, using the file with the flipped SNPs

plink --bfile cc.flipped --bmerge hm3-oursnps.bed hm3-oursnps.bim hm3-oursnps.fam --make-bed --out cc.flipped.merged


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 12. Using Principal Component Analysis to infer broad ancestry
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# PCA is a linear predictor that explains the largest amount of variance in the genotypes.
# We will combine our cleaned data with HapMap3 data (MH3), 
# then plot the data along MDS components 1 and 2 to see how our 
# data maps onto HapMap3 data.


#  12.1 Generate an LD-pruned subset of SNPs in the merged dataset

plink --bfile cc.flipped.merged --indep-pairwise 200 100 0.2 --out indep.merged

plink --bfile cc.flipped.merged --extract indep.merged.prune.in --make-bed --out cc.flipped.merged.ldprune


#  12.2 Run PCA for 4 components

plink --bfile cc.flipped.merged.ldprune --pca 4 --out pca


#  12.3 Plot PCA eigenvectors in R


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 13. Looking at additional QC/PCA pipelines
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Visit https://sites.google.com/a/broadinstitute.org/ricopili/

#  13.1 Look at the Preimputation (QC) page, and read the Input requirements and Technical Details.
# Many of the details automate the commands that you have just run through in PLINK

# Scroll the PCA page to see how this module: 
# - filters and LD prunes genotype data
# - checks relatedness
# - produces PCA graphs 

## Custom installation of Ricopili to any high performance cluster computer is detailed here:
# https://docs.google.com/document/d/14aa-oeT5hF541I8hHsDAL_42oyvlHRC5FWR7gir4xco/edit?usp=sharing

## Supported job scheduler enviroments:
# qsub
# bsub
# slurm
# Google Cloud Platform


#  13.2 GWASpy pipeline

Visit https://gwaspy.readthedocs.io/en/latest/

# Written Lindo Nkambule, this pipeline uses Hail and Python to run parallel modules to Ricopili

Visit https://github.com/atgu/hgdp_tgp/tree/master/tutorials#links-to-view-notebooks

# To check out the how it was used in the tutorial jupyter notebooks for the HGDP+1000genomes dataset

## Supported job scheduler enviroments:
# Google Cloud Platform


#  13.3 Separate GWAS tutorial on GitHub

Visit https://github.com/MareesAT/GWA_tutorial

## Written by Andries Marees, this is another full GWAS tutorial publicly available on GitHub


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# STEP 14. Clean up your workspace
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# It is normal to create many PLINK datasets in QC, so its important to clean it up!

# Look at how much space you've taken up 
ls -lh | head -n 1

# Use the * to find and remove specific files
ls -lh cc.clean* ## all files starting with cc.clean
ls -lh *bed ## all files

## remove only qc PLINK .bed/.bim/.fam/.hh files

ls -lh cc.qc*

rm cc.qc*bed
rm cc.qc*bim
rm cc.qc*fam
rm cc.qc*hh

ls -lh cc.qc*

# when you are done with the practical, erase all files in the folder (be careful with this command!)
rm ~/2023/QC/*

# remember all files can be re-created running the commands in both this script and the .sh script with the same name!


## ANSWERS:

Q1a. How is the trait coded? 1 and 2

Q1b. How many SNPs, people, females, and controls are there? 484128 SNPs. 193 people. 86 females. 66 controls.

Q2a. How many mismatches were flagged? 3 samples

Q3a. How many individuals will we drop if we keep a Sample Call Rate of >=95 (i.e. remove those with >5% missing SNPs)? 39 samples

Q4a. How many variants were removed by the call rate filter? 130901 variants removed

Q4b. How many variants were removed using the MAF filter? 49709 variants

Q4c. How many variants were dropped due to HWE? 626 variants

Q5a. How many samples were dropped with call rate < 0.95? 5 samples

Q5b. How many people were >3SD from the mean heterozygosity rate? 3 samples

Q7a. How many sex discrepant samples are there? 1 sample

Q7b. Are the others who were initially flagged as mismatches still in the data set? No

Q7c. At what step in QC were the other flagged individuals removed? Filtered out after sample call rate (PLINK --geno) filter in cc.qc2

Q9a. Which build are these data mapped to? build 37 also called hg19.

Q10a. Did this remove the remaining het haploid genotypes? No