UKBB Reference Panels * Population assignment - Select genetic variants that are available in both 1000 Genomes (1KG) and the UKBB genotyped dataset - Remove the following variants in 1KG: (i) strand ambiguous SNPs; (ii) located on sex chromosomes or in long-range LD regions (chr6: 25-35Mb; chr8: 7-13Mb); (iii) call rate <0.98; (iv) MAF <5% - LD pruning on the remaining variants in 1KG using PLINK (--indep-pairwise 100 50 0.2), yielding 149,501 largely independent, high-quality common variants - Calculate principal components (PCs) using LD pruned SNPs in 1KG samples - Project SNP loadings onto UKBB samples - Train a random forest model to predict the 5 super-population labels (AFR, AMR, EAS, EUR, SAS) using the top 6 PCs in 1KG - Apply trained random forest classifier to UKBB samples to predict the genetic ancestry of each UKBB participant - Retain UKBB samples that can be assigned to one of the super-populations with predicted probability >90% - Remove individuals meeting one of the following criteria: (i) mismatch between self-reported and genetically inferred sex; (ii) missingness or heterozygosity outliers; (iii) sex chromosome aneuploidy - Select a set of unrelated individuals for each predicted population in UKBB * Final sample size - AFR: 7,507; AMR: 687; EAS: 2,181; EUR: 375,120; SAS: 8,412 * Reference building - For each population, select non-ambiguous HapMap3 SNPs with imputation INFO >0.8 and minor allele frequency >1% - Define LD blocks using pre-computed cutoff points (http://bitbucket.org/nygcresearch/ldetect-data; AFR: 2,582 blocks; ASN: 1,445 blocks; EUR: 1,703 blocks) - Calculate LD matrix for each block using PLINK - Save LD matrices into HDF5 format