#################################################################################
###  id_geno_checksum:                                                         ##
###                    GWAS genotypic overlap test without sharing genotypes   ##
#################################################################################
#
# july 1st 2017: links updated


tutorial for a successful run:
-------------------------------

- change to a directory with a valid plink binary dataset 
               (bed/bim/fam: for more details on this fileformat: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed)
- download the perl-script (id_geno_checksum.v2), 
     either with clicking on this link: 
          https://personal.broadinstitute.org/sripke/share_links/checksums_download
     or directly from the comandline: 
	  wget https://personal.broadinstitute.org/sripke/share_links/checksums_download/id_geno_checksum.v2
- make the perl script executable: e.g. 
          chmod u+x id_geno_checksum.v2
- make sure the program plink is found and executable in the path (you can still define another location via options), e.g. try
          plink --help
	  if necessary you'll find pre-compiled version for direct download: http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml
          *** plink2 has been successfully tested with this script which big performance increase: https://www.cog-genomics.org/plink2/

- run the program with (assuming the three files BFILE.bed, BFILE.bim and BFILE.fam are in your working directory)
          (--batcher_name is not --outname, no change is allowed here)

          ./id_geno_checksum.v2 --batcher_name checksum.cdg.0415b --bfile BFILE

   (if reference subdirectory is not present, the script will ask you for permission to build and fill it)
   (runtime can be up to one hour, it will run through ten batches)
   (make sure you have enough RAM available to read the plink file, e.g. on Broad you might want to use the interactive shell ish;
                                                   as an alternative you can use plink2, which needs less RAM)
   (you can specify a specific location of the plink program via --ploc DIR)						   

- if run succesfully you be will notified with a big success banner and asked to share a specific file with your collaborator

- for more options please use ./id_geno_checksum.v2 --help


here some further characteristics:
---------------------------------
- the script creates non identifiable checksum out of GWAS SNPs
- the script will (after asking for permission) create a subdirectory with reference files with downloads from the interenet:
          https://personal.broadinstitute.org/sripke/share_links/checksums_download/checksum.cdg.0415b.tar.gz
- the script takes care automatically of:
    - strand - flips
    - LD identical SNPs (r2 == 1 over all 1KG individuals)
    - distinct SNP names on same positions
- it uses ten batches with 50 SNPs each, all of them found on all current and older GWAS platforms I encountered yet (dating back to Affy500 and Illumina I317 up to the HumanCore, PyschChip and ENIGMA datasets included).
 

some theoretical considerations:
-------------------------------
- with genotype representation of 0,1,2 there are 3^50 = 7.1e23 possible different configurations within each batch
- it uses the <cksum> program as standardly available on UNIX platforms with 32 bit CRC algorithm
    with 2^32 = 4.3e10 possible distinct outomes
- this means for each distinct checksum there exist ~1.0e23 differnt possible genotype configurations

- conclusion: this algorithm does not provide a key for genotype configuration but is still unique enough to 
     identify genotypically identical individuals


** WARNING **: even though these checksums do not contain genotype information these are powerful identifiers.
  
   $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
   $$$$$$$  DO NOT POST CHECKSUMS on public websites or similar  $$$$$$$$$$$$$$$
   $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$


---------------------------------------------------
 created by Stephan Ripke 2014 at MGH, Boston, MA
  feel free to contact me:
      sripke (at) broadinstitute (dot) org
--------------------------------------------------