Dariusz Przybylski
Senior Computational Biologist
Broad Institute of MIT and Harvard
415 Main Street
Cambridge, MA 02142
Google Scholar profile
LinkedIn profile

Research Interests

Computational Biology and Bioinformatics Systems biology Cell circuitry Genome and transcriptome assembly Protein structure and function Biophysics

Refereed Publications

A Genome-wide CRISPR Screen in Primary Immune Cells to Dissect Regulatory Networks. Oren Parnas, Marko Jovanovic, Thomas M. Eisenhaure, Rebecca H. Herbst, Atray Dixit, Chun Jimmie Ye, Dariusz Przybylski, Randall J. Platt, Itay Tirosh, Neville E. Sanjana, Ophir Shalem, Rahul Satija, Raktima Raychowdhury, Philipp Mertins, Steven A. Carr, Feng Zhang, Nir Hacohen, Aviv Regev Cell 2015 July;162:675-686. doi:10.1016/j.cell.2015.06.059
Finding the components of cellular circuits and determining their functions systematically remains a major challenge in mammalian cells. Here, we intro- duced genome-wide pooled CRISPR-Cas9 libraries into dendritic cells (DCs) to identify genes that control the induction of tumor necrosis factor (Tnf) by bacte- rial lipopolysaccharide (LPS), a key process in the host response to pathogens, mediated by the Tlr4 pathway. We found many of the known regulators of Tlr4 signaling, as well as dozens of previously un- known candidates that we validated. By measuring protein markers and mRNA profiles in DCs that are deficient in known or candidate genes, we classi- fied the genes into three functional modules with distinct effects on the canonical responses to LPS and highlighted functions for the PAF complex and oligosaccharyltransferase (OST) complex. Our find- ings uncover new facets of innate immune circuits in primary cells and provide a genetic approach for dissection of mammalian cell circuits.
Dynamic profiling of the protein life cycle in response to pathogens. Marko Jovanovic, Michael S. Rooney, Philipp Mertins, Dariusz Przybylski, Nicolas Chevrier, Rahul Satija, Edwin H. Rodriguez, Alexander P. Fields, Schraga Schwartz, Raktima Raychowdhury, Maxwell R. Mumbach, Thomas Eisenhaure, Michal Rabani, Dave Gennert, Diana Lu, Toni Delorey, Jonathan S. Weissman, Steven A. Carr, Nir Hacohen, Aviv Regev Science 2015 Mar;347:1259038. doi:10.1126/science.1259038
Mammalian gene expression is tightly controlled through the interplay between the RNA and protein life cycles. Although studies of individual genes have shown that regulation of each of these processes is important for correct protein expression, the quantitative contribution of each step to changes in protein expression levels remains largely unknown and much debated. Many studies have attempted to address this question in the context of steady-state protein levels, and comparing steady-state RNA and protein abundances has indicated a considerable discrepancy between RNA and protein levels. In contrast, only a few studies have attempted to shed light on how changes in each of these processes determine differential protein expression—either relative (ratios) or absolute (differences)—during dynamic responses, and only one recent report has attempted to quantitate each process. Understanding these contributions to a dynamic response on a systems scale is essential both for deciphering how cells deploy regulatory processes to accomplish physiological changes and for discovering key molecular regulators controlling each process.
Coelacanth The African coelacanth genome provides insights into tetrapod evolution. Chris T. Amemiya, Jessica Alföldi, Alison P. Lee, Shaohua Fan, Hervé Philippe, Iain MacCallum, Ingo Braasch, Tereza Manousaki, Igor Schneider, Nicolas Rohner, Chris Organ, Domitille Chalopin, Jeramiah J. Smith, Mark Robinson, Rosemary A. Dorrington, Marco Gerdol, Bronwen Aken, Maria Assunta Biscotti, Marco Barucca, Denis Baurain, Aaron M. Berlin, Gregory L. Blatch, Francesco Buonocore, Thorsten Burmester, Michael S. Campbell, Adriana Canapa, John P. Cannon, Alan Christoffels, Gianluca De Moro, Adrienne L. Edkins, Lin Fan, Anna Maria Fausto, Nathalie Feiner, Mariko Forconi, Junaid Gamieldien, Sante Gnerre, Andreas Gnirke, Jared V. Goldstone, Wilfried Haerty, Mark E. Hahn, Uljana Hesse, Steve Hoffmann, Jeremy Johnson, Sibel I. Karchner, Shigehiro Kuraku, Marcia Lara, Joshua Z. Levin, Gary W. Litman, Evan Mauceli, Tsutomu Miyake, M. Gail Mueller, David R. Nelson, Anne Nitsche, Ettore Olmo, Tatsuya Ota, Alberto Pallavicini, Sumir Panji, Barbara Picone, Chris P. Ponting, Sonja J. Prohaska, Dariusz Przybylski, Nil Ratan Saha, Vydianathan Ravi, Filipe J. Ribeiro, Tatjana Sauka-Spengler, Giuseppe Scapigliati, Stephen M. J. Searle, Ted Sharpe, Oleg Simakov, Peter F. Stadler, John J. Stegeman, Kenta Sumiyama, Diana Tabbaa, Hakim Tafer, Jason Turner-Maier, Peter van Heusden, Simon White, Louise Williams, Mark Yandell, Henner Brinkmann, Jean-Nicolas Volff, Clifford J. Tabin, Neil Shubin, Manfred Schartl, David B. Jaffe, John H. Postlethwait, Byrappa Venkatesh, Federica Di Palma, Eric S. Lander, Axel Meyer & Kerstin Lindblad-Toh Nature 2013 Apr;496:311-316. doi:10.1038/nature12027
The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-finned fish was thought to have become extinct 70million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features. Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues show the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
FinishedGenomes Finished bacterial genomes from shotgun sequence data. Ribeiro FJ*, Przybylski D*, Yin S, Sharpe T, Gnerre S, Abouelleil A, Berlin AM, Montmayeur A, Shea TP, Walker BJ, Young SK, Russ C, Nusbaum C, MacCallum I, Jaffe DB. Genome Res. 2012 Nov;22(11):2270-7. doi: 10.1101/gr.141515.112. Epub 2012 Jul 24.
Exceptionally accurate genome reference sequences have proven to be of great value to microbial researchers. Thus, to date, about 1800 bacterial genome assemblies have been "finished" at great expense with the aid of manual laboratory and computational processes that typically iterate over a period of months or even years. By applying a new laboratory design and new assembly algorithm to 16 samples, we demonstrate that assemblies exceeding finished quality can be obtained from whole-genome shotgun data and automated computation. Cost and time requirements are thus dramatically reduced.

*These authors contributed equally to this work.
Assemblathon1 Assemblathon 1: A competitive assessment of de novo short read assembly methods. Earl DA, Bradnam K, St John J, Darling A, Lin D, Faas J, Yu HO, Vince B, Zerbino DR, Diekhans M, Nguyen N, Nuwantha P, Sung AW, Ning Z, Haimel M, Simpson JT, Fronseca NA, Birol I, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N,Schatz MC, Kelly DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, Maclean D, Xia F, Luo R, L Z, Xie Y, Liu B, Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, Derisi JL, Caccamo M, Li Y, Jaffe DB, Green R, Haussler D, Korf I, Paten B. Genome Res. 2011 Dec;21(12):2224-41. Epub 2011 Sep 16.
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
ALLPATHS-LG High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB.Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8. Epub 2010 Dec 27.
Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100-base) sequence reads at very low cost. Whereas such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but much more expensive) capillary-based sequencing approach. Here, we report the development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of improved sequencing technology and improved computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. The ALLPATHS-LG program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd.
ALLPATHS2 ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Maccallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J, McKernan K, Ranade S, Shea TP, Williams L, Young S, Nusbaum C, Jaffe DB.Genome Biol. 2009;10(10):R103. Epub 2009 Oct 1.
We demonstrate that genome sequences approaching finished quality can be generated from short paired reads. Using 36 base (fragment) and 26 base (jumping) reads from five microbial genomes of varied GC composition and sizes up to 40 Mb, ALLPATHS2 generated assemblies with long, accurate contigs and scaffolds. Velvet and EULER-SR were less accurate. For example, for Escherichia coli, the fraction of 10-kb stretches that were perfect was 99.8% (ALLPATHS2), 68.7% (Velvet), and 42.1% (EULER-SR).
ConsensusPsiBlast Powerful fusion: PSI-BLAST and consensus sequences. Przybylski D, Rost B.Bioinformatics. 2008 Sep 15;24(18):1987-93. Epub 2008 Aug 4.
MOTIVATION:A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences. RESULTS: We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a very popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
ConSequenceS Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments. Przybylski D, Rost B.Nucleic Acids Res. 2007;35(7):2238-46. Epub 2007 Mar 16.
Sequence alignments may be the most fundamental computational resource for molecular biology. The best methods that identify sequence relatedness through profile-profile comparisons are much slower and more complex than sequence-sequence and sequence-profile comparisons such as, respectively, BLAST and PSI-BLAST. Families of related genes and gene products (proteins) can be represented by consensus sequences that list the nucleic/amino acid most frequent at each sequence position in that family. Here, we propose a novel approach for consensus-sequence-based comparisons. This approach improved searches and alignments as a standard add-on to PSI-BLAST without any changes of code. Improvements were particularly significant for more difficult tasks such as the identification of distant structural relations between proteins and their corresponding alignments. Despite the fact that the improvements were higher for more divergent relations, they were consistent even at high accuracy/low error rates for non-trivially related proteins. The improvements were very easy to achieve; no parameter used by PSI-BLAST was altered and no single line of code changed. Furthermore, the consensus sequence add-on required relatively little additional CPU time. We discuss how advanced users of PSI-BLAST can immediately benefit from using consensus sequences on their local computers.
AGAPE Improving fold recognition without folds. Przybylski D, Rost B.J Mol Biol. 2004 Jul 30;341(1):255-69.
The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalised sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an extreme value distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through http://www.predictprotein.org. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches.
Membrane Predicting transmembrane beta-barrels in proteomes. Bigelow HR, Petrey DS, Liu J, Przybylski D, Rost B.Nucleic Acids Res. 2004 May 11;32(8):2566-77. Print 2004.
Very few methods address the problem of predicting beta-barrel membrane proteins directly from sequence. One reason is that only very few high-resolution structures for transmembrane beta-barrel (TMB) proteins have been determined thus far. Here we introduced the design, statistics and results of a novel profile-based hidden Markov model for the prediction and discrimination of TMBs. The method carefully attempts to avoid over-fitting the sparse experimental data. While our model training and scoring procedures were very similar to a recently published work, the architecture and structure-based labelling were significantly different. In particular, we introduced a new definition of beta- hairpin motifs, explicit state modelling of transmembrane strands, and a log-odds whole-protein discrimination score. The resulting method reached an overall four-state (up-, down-strand, periplasmic-, outer-loop) accuracy as high as 86%. Furthermore, accurately discriminated TMB from non-TMB proteins (45% coverage at 100% accuracy). This high precision enabled the application to 72 entirely sequenced Gram-negative bacteria. We found over 164 previously uncharacterized TMB proteins at high confidence. Database searches did not implicate any of these proteins with membranes. We challenge that the vast majority of our 164 predictions will eventually be verified experimentally.
Cafasp3 CAFASP3 in the spotlight of EVA. Eyrich VA, Przybylski D, Koh IY, Grana O, Pazos F, Valencia A, Rost B.Proteins. 2003;53 Suppl 6:548-60.
We have analysed fold recognition, secondary structure and contact prediction servers from CAFASP3. This assessment was carried out in the framework of the fully automated, web-based evaluation server EVA. Detailed results are available at http://cubic.bioc.columbia.edu/eva/cafasp3/. We observed that the sequence-unique targets from CAFASP3/CASP5 were not fully representative for evaluating performance. For all three categories, we showed how careless ranking might be misleading. We compared methods from all categories to experts in secondary structure and contact prediction and homology modellers to fold recognisers. While the secondary structure experts clearly outperformed all others, the contact experts appeared to outperform only novel fold methods. Automatic evaluation servers are good at getting statistics right and at using these to discard misleading ranking schemes. We challenge that to let machines rule where they are best might be the best way for the community to enjoy the tremendous benefit of CASP as a unique opportunity for brainstorming.
EVA2003 EVA: Evaluation of protein structure prediction servers. Koh IY, Eyrich VA, Marti-Renom MA, Przybylski D, Madhusudhan MS, Eswar N, Graña O, Pazos F, Valencia A, Sali A, Rost B.Nucleic Acids Res. 2003 Jul 1;31(13):3311-5.
EVA (http://cubic.bioc.columbia.edu/eva/) is a web server for evaluation of the accuracy of automated protein structure prediction methods. The evaluation is updated automatically each week, to cope with the large number of existing prediction servers and the constant changes in the prediction methods. EVA currently assesses servers for secondary structure prediction, contact prediction, comparative protein structure modelling and threading/fold recognition. Every day, sequences of newly available protein structures in the Protein Data Bank (PDB) are sent to the servers and their predictions are collected. The predictions are then compared to the experimental structures once a week; the results are published on the EVA web pages. Over time, EVA has accumulated prediction results for a large number of proteins, ranging from hundreds to thousands, depending on the prediction method. This large sample assures that methods are compared reliably. As a result, EVA provides useful information to developers as well as users of prediction methods.
SecStrPro Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Pollastri G, Przybylski D, Rost B, Baldi P.Proteins. 2002 May 1;47(2):228-35.
Secondary structure predictions are increasingly becoming the workhorse for several methods aiming at predicting protein structure and function. Here we use ensembles of bidirectional recurrent neural network architectures, PSI-BLAST-derived profiles, and a large nonredundant training set to derive two new predictors: (a) the second version of the SSpro program for secondary structure classification into three categories and (b) the first version of the SSpro8 program for secondary structure classification into the eight classes produced by the DSSP program. We describe the results of three different test sets on which SSpro achieved a sustained performance of about 78% correct prediction. We report confusion matrices, compare PSI-BLAST to BLAST-derived profiles, and assess the corresponding performance improvements.
PHDpsi Alignments grow, secondary structure prediction improves. Przybylski D, Rost B.Proteins. 2002 Feb 1;46(2):197-205.
Using information from sequence alignments significantly improves protein secondary structure prediction. Typically, more divergent profiles yield better predictions. Recently, various groups have shown that accuracy can be improved significantly by using PSI-BLAST profiles to develop new prediction methods. Here, we focused on the influences of various alignment strategies on two 8-year-old PHD methods. The following results stood out. (i) PHD using pairwise alignments predicts about 72% of all residues correctly in one of the three states: helix, strand, and other. Using larger databases and PSI-BLAST raised accuracy to 75%. (ii) More than 60% of the improvement originated from the growth of current sequence databases; about 20% resulted from detailed changes in the alignment procedure (substitution matrix, thresholds, and gap penalties). Another 20% of the improvement resulted from carefully using iterated PSI-BLAST searches. (iii) It is of interest that we failed to improve prediction accuracy further when attempting to refine the alignment by dynamic programming (MaxHom and ClustalW). (iv) Improvement through family growth appears to saturate at some point. However, most families have not reached this saturation. Hence, we anticipate that prediction accuracy will continue to rise with database growth.
EVA2001 EVA: continuous automatic evaluation of protein structure prediction servers. Eyrich VA, Martí-Renom MA, Przybylski D, Madhusudhan MS, Fiser A, Pazos F, Valencia A, Sali A, Rost B.Bioinformatics. 2001 Dec;17(12):1242-3.
Evaluation of protein structure prediction methods is difficult and time-consuming. Here, we describe EVA, a web server for assessing protein structure prediction methods, in an automated, continuous and large-scale fashion. Currently, EVA evaluates the performance of a variety of prediction methods available through the internet. Every week, the sequences of the latest experimentally determined protein structures are sent to prediction servers, results are collected, performance is evaluated, and a summary is published on the web. EVA has so far collected data for more than 3000 protein chains. These results may provide valuable insight to both developers and users of prediction methods.

Book Chapters, Reviews, and Other Publications

Genomes2Therapies Predicting simplified features of protein structure. Przybylski D, Rost BIn T. Lengauer (Ed.), Weinheim: Wiley-VCH. 2006 p.261-295.
Predictions of simplified aspects of 3D structure are often very successful. In the absence of experimental or predicted 3D structures, many researchers concentrate on trying to simplify the problem and predict particular structural features. One of the first well-defined problems was the prediction of protein secondary structure. Progress in this field has been steady and current secondary structure predictions are useful to many biological applications. Techniques that were developed in the context of secondary structure predictions were successfully applied to the prediction of many other aspects of protein structure such as solvent accessibility, inter-residue contact maps, disordered regions, domain organization, and specialized for distinctive cases such as transmembrane regions of proteins.
Chemoinformatics Prediction of protein structure through evolution. Rost B, Liu J, Przybylski D, Nair R, Wrzeszczynski KO, Bigelow H, Oran Y.In J. Gasteiger & T. Engel (Eds.), Weinheim: Wiley-VCH. 2003 p.1789-1811
The ultimate goal of protein structure prediction is to extend our knowledge and understanding of the structures and functions of proteins beyond that which is possible by experiment. Virtually all techniques, including 1D, 2D, and 3D structure prediction, and diverse kinds of function prediction use profiles rather than single sequences as the information object for prediction. Database methods rely on structural information to evaluate the fitness of a protein sequence for a given structure according to a statistical model. Energetic methods derive predictions by calculating the fitness according to thermodynamic and kinetic principles. The two approaches have their limitations: database methods suffer from sparse statistics and therefore often over-fit the data, while energetic methods must vastly simplify theory to be tractable with the limited computational power available. The best predictors almost always use a combination of both in an intelligent way.
JuryPredictor Simple jury predicts protein secondary structure best. Rost B, Baldi P, Barton G, Cuff J, Eyrich V, Jones D, Karplus K, King R, Pollastri G, Przybylski D.Web publication
The field of secondary structure prediction methods has advanced again. The best methods now reach levels of 74-76% of the residues correctly predicted in one of the three states helix, strand, or other. In context of the EVA/CASP, we experimented with averaging over the best current methods. The resulting jury decision proved significantly more accurate than the best method. Although the 'jury' seemed the best choice on average,for 60% of all proteins one method was better than the jury. Furthermore, the best individual methods tended to be superior to the jury in estimating the reliability of a prediction. Hence, averaging over predictions may be the method of choice for a quick scan of large data set, while experts may profit from studying the respective method in detail.