Organizers: {Alex Bloemendal, Jon Bloom} ---> {bloem, jbloom}

Founders: Jon Bloom, Bertrand Haas

Supporters: Cotton Seed, Yossi Farjoun, Anthony Philippakis

Room change! **Mon**days in **Mon**adnock, 2nd Floor

1:00pm ---> Impressively timely arrival, hunting, and gathering

1:05pm ---> Chalk talk starts

2:00pm ---> Useful people escape

2:05pm ---> Discussion starts

2:30pm ---> Speaker released back into the wild

Math/Stat/ML/CS resoureces ordered by increasing prerequisite knowledge:

MIT 18.05: Jon and Jerry's intro course on probability, Bayesian stats, and frequentist stats. Completely self-contained on OCW!

MIT 18.06: Gil Strang's legendary linear algebra course.

Talking Machines: incredible podcast on machine learning by friends of the SMRC Ryan Adams and Katherine Gorman.

MIT 6.001x: intro to programming using Python.

Bayesian optimization: Ryan Adams' colloquium at Broad.

Scalable Machine Learning: go big or go home with pySpark in this archived BerkeleyX course.

Bioinformatic Algorithms: excellent 6-part series on Coursera from UCSD.

Pattern Recognition and Machine Learning, Christopher Bishop. Bayesian treatment of ML.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman. Both stats and ML.

Machine Learning: A Probabilistic Perspective, Kevin Murphy. Encyclopedic on ML.

Probability: Theory and Examples, Rick Durrett: a standard reference on modern, measure-theoretic probability theory.

P1, P2, P3: problem sets from Alex's Harvard graduate course on the core concepts of modern, measure-theoretic probability theory.

Biology resources for computationalists:

MIT 7.00x: Eric Lander's introduction to biology!

The Eighth Day of Creation, Horace Judson (1979): a masterpiece of history of science, covering the birth and development of molecular biology, based on interviews with over one hundred of the scientists who played key roles.

*Etched in fine beach sand

Sep 14, 2015

Nikolai Slavov, Northeastern Bioengineering

**Quantifying protein isoforms**

Many protein isoforms -- arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes -- have distinct biological functions. However, the accuracy of quantifying protein isoforms and their stoichiometries by existing mass--spectrometry (MS) methods remains limited because of noise due to variations in protein-digestion and in peptide-ionization. We eliminate the influence of this analytical noise by deriving a first-principles model (HIquant) for quantifying these stoichiometries only from corresponding-ion ratios. This approach allows unprecedented accuracy (error < 10%) in quantifying ratios between different proteins and their isoforms. I will discuss a mathematical proof of the conditions under which HIquant has a unique solution, and algorithms for its optimal solution.

This paper by Nikolai is a good reference for the mathematical issues, but not the biological ones:

Convex Total Least Squares

Sep 21, 2015

Brendan Meade,
Harvard Earth and Planetary Sciences, Google Research

**CS1: Exploiting sparse and quantized signals to solve linear systems**

In a diverse set of problems ranging from dimensionality reduction to earthquake imaging it is often the case that we seek to identify sparse or quantized representations of signals. The rationale for this may be: 1) physical (I believe that there are only a few interesting things), 2) philosophical (I only want to think about a few interesting things), or 3) computational (I only have enough computer to work with a few interesting things). The past two decades have seen a radical progress in efficiently solving sparse recovery problems and these approaches are now being used for non-trivial estimation problems. This talk focuses on sparse and quantized signal recovery from historical, geometric, and philosophical perspectives, with examples from earthquake physics.

Here are two of Brendan's papers with examples:

Total variation regularization of geodetically and geologically constrained block models for the Western United States

Geodetic imaging of coseismic slip and postseismic afterslip: Sparsity promoting methods applied to the great Tohoku earthquake

Sep 28, 2015

Alex Bloemendal, Broad, ATGU

**CS2: Compressed Sensing**

How many linear measurements (equations) do you need to recover a high-dimensional signal (unknowns)? If you know a basis in which the signal is sparse, and your measurements are not too aligned with this basis, then far fewer than you might expect. Moreover, you can recast your underdetermined problem as a convex program and solve it efficiently. I will talk about when and why this works, mentioning some now-classic applications and a few exciting possibilities in biology.

Here are two review articles by authors who made major theoretical contributions:

Precise undersampling theorems, Donoho and Tanner

Mathematics of sparsity (and a few other things), Candes

Oct 13, 2015

Jon Bloom, Broad, ATGU

**DM1: Bayesian logistic regression and mixed models: revenge of the Gibbs**

Our aim is to give background and motivation for Scott's talk next week. Consider SNP association testing against a binary phenotype (disease vs. no disease). While linear regression enjoys very efficient inference, the simplest version is lacking due to:

- erroneous hard calls of variants (go with probabilities)
- multiple testing (go Bonferonni, FDR)
- confounding by ancestry, batch effects (go add PCs)
- cryptic relatedness (go full mixed model)
- binary phenotype (go logistic)
- overfitting (go Bayesian)
- nonlinear dependence of phenotype on covariates (go Gaussian process?)
- admixture (go topic model?)
- non-normal distribution of effect sizes (go GMM prior?)
- sparsity (go lasso?)
- epistatis (go neural net?)
- ascertainment bias (go do some research)
- high-dimensional phenotypes, both continuous and categorical (go do some modeling)

We will describe models addressing some of these points including Bayesian probit, logit, and mixed logit models, and time-permitting, some fancier models mixing continuous and discrete structure. Our emphasis will be on how exponential-family conjugacy makes inference easy via Gibbs sampling in certain cases, whereas its absence leads one toward despair (at least for six more days).

Probit as latent variable model

Bayesian logistic regression

Conjugate priors

Jon's notes on beta-binomial and normal-normal conjugacy

PS. We may not have time to cover topic models before Scott's talk, but here are references on topic models and bio applications.

Oct 19, 2015

Scott Linderman, HIPS

**DM2: Discrete models with continuous latent structure: a new hope**

We often have discrete count data with continuous latent structure or continuous regressors. It can be hard to match these two up in a Bayesian framework because of lack of conjugacy. Fortunately, there's a cool trick (Polya-gamma augmentation) that allows us to render the discrete observations conjugate with a Gaussian prior, facilitating:

- Bayesian logistic regression, more efficiently
- structured sparse Gaussian models
- hierarchical Gaussian models (eg GMMs) with binary observations
- time series or Gaussian processes to capture dependencies between observations

We can extend this to other observation models too, like binomial, negative binomial, and multinomial observations. So if you know about LDA, now it's easy to combine LDA with Gaussian structure like correlated or dynamic topics.

Here is the original Polya-gamma augmentation paper from 2013, as well as Scott's hot-off-the-press work with Ryan Adams and Matt Johnson:

Bayesian inference for logistic models using P´olya-Gamma latent variables

Dependent Multinomial Models Made Easy: Stick Breaking with the Pólya-Gamma Augmentation

Oct 26, 2015 (postponed to 2016)

Caroline Uhler, CSAIL, IDSS

**Gene Regulation in Space and Time**

Although the genetic information in each cell within an organism is identical, gene expression varies widely between different cell types. The quest to understand this phenomenon has led to many interesting mathematics problems. First, I will present a new method for learning gene regulatory networks. It overcomes the limitations of existing algorithms for learning directed graphs and is based on algebraic, geometric and combinatorial arguments. Second, I will analyze the hypothesis that the differential gene expression is related to the spatial organization of chromosomes. I will describe a bi-level optimization formulation to find minimal overlap configurations of ellipsoids and model chromosome arrangements. Analyzing the resulting ellipsoid configurations has important implications for the reprogramming of cells during development.

Here are some of Caroline's papers on packing problems and learning graphical models:

Packing Ellipsoids with Overlap

Sphere Packing with Limited Overlap

Learning directed acyclic graphs based on sparsest permutations

Faithfulness and learning hypergraphs from discrete distributions

Nov 2, 2015

Dougal Maclaurin, HIPS

**NN I. Reverse-mode differentiation and autograd**

Much of machine learning boils down to constructing a loss function and optimizing it, often using gradients. Reverse-mode differentation (sometimes called "backpropagation") is a general and computationally efficient way to compute these gradients. I'll explain reverse-mode differentiation and show how we've implemented it for Python/Numpy in our automatic differentation package autograd. I'll finish with some demos showing how easy it is to implement several machine learning models once you have automatic differentiation in your toolbox.

Here is Dougal and David's note on autograd and repository, and some background:

Autograd: Effortless Gradients in Numpy

Autograd repository

Neural nets

Backprop

Nov 9, 2015
David Duvenaud, HIPS

**NN II. Convolutional Networks on Graphs for Learning Molecular Fingerprints**

Predicting properties of molecules requires functions that take graphs as inputs. Molecular graphs are usually preprocessed using hash-based functions to produce fixed-size fingerprint vectors, which are used as features for making predictions. We introduce a convolutional neural network that operates directly on graphs, allowing end-to-end learning of the feature pipeline. This architecture generalizes standard molecular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.

Here is David and Dougal's paper, the github repositories, as well as a great review of conv nets:

Convolutional Networks on Graphs for Learning Molecular Fingerprints

Molecular fingerprint repository

Autograd repository

Review of conv nets

Nov 12, 2015

Professor Ryan Adams

**Machine Learning and the Life Science: Beyond Data Analysis**

1pm: Colloquium in the auditorium

2pm: light refreshment in the lobby

Machine learning is about understanding and building computational processes for adapting to data and experience, something that for most of natural history has only existed in living organisms. Lying at the interface between computer science and statistics, machine learning has in recent years come into the spotlight for providing rich new tools for data analysis. While machine learning is interacting with many different scientific areas, collaborations with the life sciences have been particular exciting as biology invests increasingly in automation and high-throughput data collection methods.

It is an amazing time for computer scientists and biologists to work together, but we can go far beyond data analysis. I will discuss two such collaborative areas that push this boundary: the automated design of biologically-relevant systems, and the exploration of adaptive algorithms in biological substrates. For the former, I will describe ongoing work to automate the process of design of systems such as organic molecules, DNA sequences, and biomimetic robots. For the latter, I will give an overview of recent work showing how important classes of machine learning algorithms can be implemented with biomolecules, without resorting to digital models for chemical computation.

BIO: Ryan Adams is Head of Research at Twitter Cortex and an Assistant Professor of Computer Science at Harvard. He received his Ph.D. in Physics at Cambridge as a Gates Scholar. He was a CIFAR Junior Research Fellow at the University of Toronto before joining the faculty at Harvard. He has won paper awards at ICML, AISTATS, and UAI, and his Ph.D. thesis received Honorable Mention for the Savage Award for Theory and Methods from the International Society for Bayesian Analysis. He also received the DARPA Young Faculty Award and the Sloan Fellowship. Dr. Adams was the CEO of Whetlab, a machine learning startup that was recently acquired by Twitter, and co-hosts the Talking Machines podcast.

Here are several resources on Bayesian optimization and chemical reaction networks (see also, Jon's talk on May 4, 2015):

Practical Bayesian optimization of machine learning algorithms

Spearmint github

Gaussian processes

Message passing inference with chemical reaction networks

Nov 16, 2015

David Kelley, Rinn Lab @ Harvard Stem Cell

**NN III. Learning the regulatory code of the accessible genome with deep convolutional neural nets**

The complex language of eukaryotic gene expression remains incompletely understood. Thus, most of the many noncoding variants statistically associated with human disease have unknown mechanism. Here, we address this challenge using an approach based on a recent machine learning advance—deep convolutional neural networks (CNNs). We introduce an open source package Basset (https://github.com/davek44/Basset) to apply deep CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNaseI-seq. Basset predictions for the change in accessibility between two variant alleles were far greater for GWAS SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell???s chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.

Here is David's paper with Jasper Snoek (HIPS, Twitter Cortext) and PI John Rinn, as well as a great graphical explanation of (2d) conv nets:

Learning the regulatory code of the accessible genome with deep convolutional neural nets

Review of conv nets

Nov 23, 2015

Alex Wiltschko, Datta Lab @ Harvard Neuro, HIPS, Twitter

**TS I. Mapping Sub-Second Structure in Mouse Behavior**

Complex animal behaviors are likely built from simpler modules, but their systematic identification in mammals remains a significant challenge. We use depth imaging to show that three-dimensional (3D) mouse pose dynamics are structured at the sub-second timescale. Computational modeling of these fast dynamics effectively describes mouse behavior as a series of reused and stereotyped modules with defined transition probabilities, which collectively encapsulate the underlying structure of mouse behavior within a given experiment. By deploying this 3D imaging and machine learning method in a variety of experimental contexts, we show that it unmasks potential strategies employed by the brain to generate specific adaptations to changes in the environment, and captures both predicted and previously-hidden phenotypes induced by genetic or neural manipulations. Further, we demonstrate its utility in automatically unblinding the behavioral effects of pharmacological manipulation. This work demonstrates that mouse body language is built from identifiable components and is organized in a predictable fashion; deciphering this language establishes an objective framework for characterizing the influence of environmental cues, genes and neural activity on behavior.

Joint with Matt Johnson, who will tell us more about the underlying models next week.

Mapping Sub-Second Structure in Mouse Behavior

Video abstract

Nov 30, 2015

Matthew Johnson, HIPS

**TS II. Modeling structure in time series**

Probabilistic generative modeling can help us discover structured representations from unsupervised time series data. I'll survey some basic ideas from Bayesian modeling and inference for time series and give examples of how they can be composed and extended. In particular, I'll focus on building up Bayesian switching linear dynamical systems (SLDS) and associated sampling and structured mean field inference algorithms, motivated by applications to behavior modeling from last week.

If time permits, I'll also talk about our current work on integrating these structured Bayesian generative models with the right amount of "neural net goo" to combine their respective strengths. I might also show some magic tricks with autograd.

Linear dynamical systems: https://en.wikipedia.org/wiki/Linear_dynamical_system

Talking Machines on LDS and SLDS: 1m45s - 10m

Matt's thesis on Bayesian time-series modeling: http://www.mit.edu/~mattjj/thesis.pdf

Code links:

http://github.com/mattjj/pyhsmm

http://github.com/mattjj/pyhsmm-slds

http://github.com/mattjj/pyhsmm-autoregressive

http://github.com/mattjj/pylds

http://github.com/mattjj/pybasicbayes

And here's an interesting autograd example which uses that gradient-through-forward-pass method: https://github.com/HIPS/autograd/blob/master/examples/hmm_em.py

Sep 22, 2014

Bertrand Haas

Contingency tables I: t-test and z-test

Sep 29, 2014

Bertrand Haas

Contingency tables II: correlation, Pearson chi-squared test, and Fisher exact test

http://en.wikipedia.org/wiki/Fisher's_exact_test

http://en.wikipedia.org/wiki/Lady_tasting_tea

Oct 6, 2014

Bertrand Haas

Contingency tables III: examples in genetics

Oct 20, 2014

Yossi Farjoun

Detecting sample swap

Swap whitepaper

Ped whitepaper

Het whitepaper

Oct 27, 2014

Jon Bloom

Optimal coverage in rare variant association studies

Coverage analysis

Power slides

Nov 3, 2014

Jon Bloom and Bertrand Haas

Puzzle day: drunk Monty Hall, the two envelopes, the bloody crime scene, and Simpson's paradox

https://en.wikipedia.org/wiki/Simpson%27s_paradox

Nov 24, 2014

Alex Bloemendal

Principle component analysis (PCA) and the Marchenko-Pastur law

http://arxiv.org/abs/1404.0788

Dec 8, 2014

Alex Bloemendal

Non-negative matrix factorization (NMF)

http://www.columbia.edu/~jwp2128/Teaching/W4721/papers/nmf_nature.pdf

Dec 15, 2014

Jon Bloom and Cotton Seed

Non-linear dimensional reduction: tSNE and diffusion maps

http://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf

http://www.sciencedirect.com/science/article/pii/S1063520306000546

Jan 26, 2015

Bertrand Haas

Independent component analysis (ICA) and projection pursuit

https://en.wikipedia.org/wiki/Independent_component_analysis

Feb 24, 2015

Alex Bloemendal, Jon Bloom, Bertrand Haas, Cotton Seed

Comparison of dimensional reduction methods: PCA, ICA, NMF, tSNE, and diffusion maps

Mar 2, 2015

Jon Bloom

Introduction to Bayesian graphical models: the Gaussian mixture model (Bishop, Ch8)

http://www.mit.edu/~mattjj/thesis.pdf

Mar 9, 2015

Jon Bloom

Expectation maximization and inference on Gaussian mixture models (Bishop, Ch9)

https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

Mar 16, 2015

Cotton Seed

Variational Bayes and inference on Gaussian mixture models (Bishop, Ch10)

https://en.wikipedia.org/wiki/Variational_Bayesian_methods

Mar 23, 2015

Laura Gauthier and Bertrand Haas

Variant quality score recalibration

Mar 30, 2015

Alex Bloemendal

Markov Chain Monte Carlo and Gibbs sampling on Gaussian mixture models (Bishop, Ch11)

https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo

April 7, 2015

Yossi Farjoun

Genetic fingerprints and contamination estimation

April 14, 2015

Brendan Bulik-Sullivan

Linear mixed models for genetic association analysis

http://www.nature.com/ng/journal/v47/n3/full/ng.3190.html

April 28, 2015

Ian Smith

Connectivity map and challenges in data normalization

http://www.lincscloud.org/

May 4, 2015

Jon Bloom

Introduction to Gaussian processes and Bayesian optimization (Bishop, Ch6)

https://www.cs.ubc.ca/~hutter/EARG.shtml/earg/papers05/rasmussen_gps_in_ml.pdf

http://arxiv.org/abs/1012.2599

http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf

June 1, 2015

Heng Li

Graph-based genetic sequence representation

http://lh3.github.io/2014/07/25/on-the-graphical-representation-of-sequences/

June 8, 2015

Bertrand Haas

Choosing priors in Bayesian inference

http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading15b.pdf

June 15, 2015

Jon Bloom

Discussion of Pachter's p-value prize

https://liorpachter.wordpress.com/2015/05/26/pachters-p-value-prize/

https://joepickrell.wordpress.com/2015/06/11/in-which-im-pretty-sure-i-disagree-with-lior-pachter-and-try-to-figure-out-why/

https://liorpachter.wordpress.com/2015/06/09/i-was-wrong/

June 22, 2015

Bertrand Haas

Conjugate priors and Hardy-Weinberg equilibrium (Bishop, Ch2)

http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading15a.pdf

https://en.wikipedia.org/wiki/Conjugate_prior

June 29, 2015

David Benjamin

Introduction to Dirichlet processes

http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf

July 6, 2015

David Benjamin

The Chinese restaurant process and Indian buffet process

https://en.wikipedia.org/wiki/Chinese_restaurant_process

July 13, 2015

Mark Flaherty

Introduction to evolutionary algorithms and NEAT

https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_topologies

http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf

July 20, 2015

Brendan Bulik-Sullivan

LD score regression for distinguishing confounding from polygenicity

http://www.nature.com/ng/journal/v47/n3/full/ng.3211.html

July 27, 2015

Andrea Byrnes

Challenges in normalization of RNAseq data

http://biorxiv.org/content/early/2015/07/22/021212

http://www.nature.com/nbt/journal/v32/n9/full/nbt.2931.html

In [ ]:

```
```