SMRC has gone MIA: see the Models, Inference & Algorithms Initiative

smrc == "Stat Math Reading Club"

Organizers: {Alex Bloemendal, Jon Bloom} ---> {bloem, jbloom}
Founders: Jon Bloom, Bertrand Haas
Supporters: Cotton Seed, Yossi Farjoun, Anthony Philippakis

Room change! Mondays in Monadnock, 2nd Floor

1:00pm ---> Impressively timely arrival, hunting, and gathering
1:05pm ---> Chalk talk starts
2:00pm ---> Useful people escape
2:05pm ---> Discussion starts
2:30pm ---> Speaker released back into the wild


Math/Stat/ML/CS resoureces ordered by increasing prerequisite knowledge:

Biology resources for computationalists:

  • MIT 7.00x: Eric Lander's introduction to biology!

  • The Eighth Day of Creation, Horace Judson (1979): a masterpiece of history of science, covering the birth and development of molecular biology, based on interviews with over one hundred of the scientists who played key roles.

Fall 2015 Schedule*

Date Speaker Affiliation Title
Sep 14 Nikolai Slavov Northeastern Bio Eng Quantifying protein isoforms
Sep 21 Brendan Meade Harvard Earth, Google Research CS1: Exploiting sparse and quantized signals to solve linear systems
Sep 28 Alex Bloemendal Broad, ATGU CS2: Compressed sensing
Oct 13 Jon Bloom Broad, ATGU DM I. Bayesian logistic regression and mixed models: revenge of the Gibbs
Oct 19 Scott Linderman HIPS DM II. Discrete models with continuous latent structure: a new hope
Oct 26, postponed Caroline Uhler CSAIL, IDSS Gene Regulation in Space and Time
Nov 2 Dougal Maclaurin HIPS NN I. Reverse-mode differentiation and autograd
Nov 9 David Duvenaud HIPS NN II. Convolutional Networks on Graphs for Learning Molecular Fingerprints
Nov 12 Ryan Adams The Original HIPSter, Twitter Cortex, Talking Machines Machine Learning and the Life Sciences: Beyond Data Analysis
Nov 16 David Kelley Rinn Lab @ Harvard Stem Cell NN III. Learning the regulatory code of the accessible genome with deep convolutional neural nets
Nov 23 Alex Wiltschko Datta Lab @ Harvard Neuro, HIPS, Twitter TS I. Mapping Sub-Second Structure in Mouse Behavior
Nov 30 Matthew Johnson HIPS TS II. Modeling structure in time series

*Etched in fine beach sand


Sep 14, 2015
Nikolai Slavov, Northeastern Bioengineering
Quantifying protein isoforms

Many protein isoforms -- arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes -- have distinct biological functions. However, the accuracy of quantifying protein isoforms and their stoichiometries by existing mass--spectrometry (MS) methods remains limited because of noise due to variations in protein-digestion and in peptide-ionization. We eliminate the influence of this analytical noise by deriving a first-principles model (HIquant) for quantifying these stoichiometries only from corresponding-ion ratios. This approach allows unprecedented accuracy (error < 10%) in quantifying ratios between different proteins and their isoforms. I will discuss a mathematical proof of the conditions under which HIquant has a unique solution, and algorithms for its optimal solution.

This paper by Nikolai is a good reference for the mathematical issues, but not the biological ones:
Convex Total Least Squares

Sep 21, 2015
Brendan Meade, Harvard Earth and Planetary Sciences, Google Research
CS1: Exploiting sparse and quantized signals to solve linear systems

In a diverse set of problems ranging from dimensionality reduction to earthquake imaging it is often the case that we seek to identify sparse or quantized representations of signals. The rationale for this may be: 1) physical (I believe that there are only a few interesting things), 2) philosophical (I only want to think about a few interesting things), or 3) computational (I only have enough computer to work with a few interesting things). The past two decades have seen a radical progress in efficiently solving sparse recovery problems and these approaches are now being used for non-trivial estimation problems. This talk focuses on sparse and quantized signal recovery from historical, geometric, and philosophical perspectives, with examples from earthquake physics.

Here are two of Brendan's papers with examples:
Total variation regularization of geodetically and geologically constrained block models for the Western United States
Geodetic imaging of coseismic slip and postseismic afterslip: Sparsity promoting methods applied to the great Tohoku earthquake

Sep 28, 2015
Alex Bloemendal, Broad, ATGU
CS2: Compressed Sensing

How many linear measurements (equations) do you need to recover a high-dimensional signal (unknowns)? If you know a basis in which the signal is sparse, and your measurements are not too aligned with this basis, then far fewer than you might expect. Moreover, you can recast your underdetermined problem as a convex program and solve it efficiently. I will talk about when and why this works, mentioning some now-classic applications and a few exciting possibilities in biology.

Here are two review articles by authors who made major theoretical contributions:
Precise undersampling theorems, Donoho and Tanner
Mathematics of sparsity (and a few other things), Candes

Oct 13, 2015
Jon Bloom, Broad, ATGU
DM1: Bayesian logistic regression and mixed models: revenge of the Gibbs

Our aim is to give background and motivation for Scott's talk next week. Consider SNP association testing against a binary phenotype (disease vs. no disease). While linear regression enjoys very efficient inference, the simplest version is lacking due to:

  • erroneous hard calls of variants (go with probabilities)
  • multiple testing (go Bonferonni, FDR)
  • confounding by ancestry, batch effects (go add PCs)
  • cryptic relatedness (go full mixed model)
  • binary phenotype (go logistic)
  • overfitting (go Bayesian)
  • nonlinear dependence of phenotype on covariates (go Gaussian process?)
  • admixture (go topic model?)
  • non-normal distribution of effect sizes (go GMM prior?)
  • sparsity (go lasso?)
  • epistatis (go neural net?)
  • ascertainment bias (go do some research)
  • high-dimensional phenotypes, both continuous and categorical (go do some modeling)

We will describe models addressing some of these points including Bayesian probit, logit, and mixed logit models, and time-permitting, some fancier models mixing continuous and discrete structure. Our emphasis will be on how exponential-family conjugacy makes inference easy via Gibbs sampling in certain cases, whereas its absence leads one toward despair (at least for six more days).

Probit as latent variable model
Bayesian logistic regression
Conjugate priors
Jon's notes on beta-binomial and normal-normal conjugacy

PS. We may not have time to cover topic models before Scott's talk, but here are references on topic models and bio applications.

Oct 19, 2015
Scott Linderman, HIPS
DM2: Discrete models with continuous latent structure: a new hope

We often have discrete count data with continuous latent structure or continuous regressors. It can be hard to match these two up in a Bayesian framework because of lack of conjugacy. Fortunately, there's a cool trick (Polya-gamma augmentation) that allows us to render the discrete observations conjugate with a Gaussian prior, facilitating:

  • Bayesian logistic regression, more efficiently
  • structured sparse Gaussian models
  • hierarchical Gaussian models (eg GMMs) with binary observations
  • time series or Gaussian processes to capture dependencies between observations

We can extend this to other observation models too, like binomial, negative binomial, and multinomial observations. So if you know about LDA, now it's easy to combine LDA with Gaussian structure like correlated or dynamic topics.

Here is the original Polya-gamma augmentation paper from 2013, as well as Scott's hot-off-the-press work with Ryan Adams and Matt Johnson:
Bayesian inference for logistic models using P´olya-Gamma latent variables
Dependent Multinomial Models Made Easy: Stick Breaking with the Pólya-Gamma Augmentation

Oct 26, 2015 (postponed to 2016)
Caroline Uhler, CSAIL, IDSS
Gene Regulation in Space and Time

Although the genetic information in each cell within an organism is identical, gene expression varies widely between different cell types. The quest to understand this phenomenon has led to many interesting mathematics problems. First, I will present a new method for learning gene regulatory networks. It overcomes the limitations of existing algorithms for learning directed graphs and is based on algebraic, geometric and combinatorial arguments. Second, I will analyze the hypothesis that the differential gene expression is related to the spatial organization of chromosomes. I will describe a bi-level optimization formulation to find minimal overlap configurations of ellipsoids and model chromosome arrangements. Analyzing the resulting ellipsoid configurations has important implications for the reprogramming of cells during development.

Here are some of Caroline's papers on packing problems and learning graphical models:
Packing Ellipsoids with Overlap
Sphere Packing with Limited Overlap
Learning directed acyclic graphs based on sparsest permutations
Faithfulness and learning hypergraphs from discrete distributions

Nov 2, 2015
Dougal Maclaurin, HIPS
NN I. Reverse-mode differentiation and autograd

Much of machine learning boils down to constructing a loss function and optimizing it, often using gradients. Reverse-mode differentation (sometimes called "backpropagation") is a general and computationally efficient way to compute these gradients. I'll explain reverse-mode differentiation and show how we've implemented it for Python/Numpy in our automatic differentation package autograd. I'll finish with some demos showing how easy it is to implement several machine learning models once you have automatic differentiation in your toolbox.

Here is Dougal and David's note on autograd and repository, and some background:
Autograd: Effortless Gradients in Numpy
Autograd repository
Neural nets

Nov 9, 2015 David Duvenaud, HIPS
NN II. Convolutional Networks on Graphs for Learning Molecular Fingerprints

Predicting properties of molecules requires functions that take graphs as inputs. Molecular graphs are usually preprocessed using hash-based functions to produce fixed-size fingerprint vectors, which are used as features for making predictions. We introduce a convolutional neural network that operates directly on graphs, allowing end-to-end learning of the feature pipeline. This architecture generalizes standard molecular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.

Here is David and Dougal's paper, the github repositories, as well as a great review of conv nets:
Convolutional Networks on Graphs for Learning Molecular Fingerprints
Molecular fingerprint repository
Autograd repository
Review of conv nets

Nov 12, 2015
Professor Ryan Adams
Machine Learning and the Life Science: Beyond Data Analysis
1pm: Colloquium in the auditorium
2pm: light refreshment in the lobby

Machine learning is about understanding and building computational processes for adapting to data and experience, something that for most of natural history has only existed in living organisms. Lying at the interface between computer science and statistics, machine learning has in recent years come into the spotlight for providing rich new tools for data analysis. While machine learning is interacting with many different scientific areas, collaborations with the life sciences have been particular exciting as biology invests increasingly in automation and high-throughput data collection methods.

It is an amazing time for computer scientists and biologists to work together, but we can go far beyond data analysis. I will discuss two such collaborative areas that push this boundary: the automated design of biologically-relevant systems, and the exploration of adaptive algorithms in biological substrates. For the former, I will describe ongoing work to automate the process of design of systems such as organic molecules, DNA sequences, and biomimetic robots. For the latter, I will give an overview of recent work showing how important classes of machine learning algorithms can be implemented with biomolecules, without resorting to digital models for chemical computation.

BIO: Ryan Adams is Head of Research at Twitter Cortex and an Assistant Professor of Computer Science at Harvard. He received his Ph.D. in Physics at Cambridge as a Gates Scholar. He was a CIFAR Junior Research Fellow at the University of Toronto before joining the faculty at Harvard. He has won paper awards at ICML, AISTATS, and UAI, and his Ph.D. thesis received Honorable Mention for the Savage Award for Theory and Methods from the International Society for Bayesian Analysis. He also received the DARPA Young Faculty Award and the Sloan Fellowship. Dr. Adams was the CEO of Whetlab, a machine learning startup that was recently acquired by Twitter, and co-hosts the Talking Machines podcast.

Here are several resources on Bayesian optimization and chemical reaction networks (see also, Jon's talk on May 4, 2015):
Practical Bayesian optimization of machine learning algorithms
Spearmint github
Gaussian processes
Message passing inference with chemical reaction networks

Nov 16, 2015
David Kelley, Rinn Lab @ Harvard Stem Cell
NN III. Learning the regulatory code of the accessible genome with deep convolutional neural nets

The complex language of eukaryotic gene expression remains incompletely understood. Thus, most of the many noncoding variants statistically associated with human disease have unknown mechanism. Here, we address this challenge using an approach based on a recent machine learning advance—deep convolutional neural networks (CNNs). We introduce an open source package Basset ( to apply deep CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNaseI-seq. Basset predictions for the change in accessibility between two variant alleles were far greater for GWAS SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell???s chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.

Here is David's paper with Jasper Snoek (HIPS, Twitter Cortext) and PI John Rinn, as well as a great graphical explanation of (2d) conv nets:
Learning the regulatory code of the accessible genome with deep convolutional neural nets
Review of conv nets

Nov 23, 2015
Alex Wiltschko, Datta Lab @ Harvard Neuro, HIPS, Twitter
TS I. Mapping Sub-Second Structure in Mouse Behavior

Complex animal behaviors are likely built from simpler modules, but their systematic identification in mammals remains a significant challenge. We use depth imaging to show that three-dimensional (3D) mouse pose dynamics are structured at the sub-second timescale. Computational modeling of these fast dynamics effectively describes mouse behavior as a series of reused and stereotyped modules with defined transition probabilities, which collectively encapsulate the underlying structure of mouse behavior within a given experiment. By deploying this 3D imaging and machine learning method in a variety of experimental contexts, we show that it unmasks potential strategies employed by the brain to generate specific adaptations to changes in the environment, and captures both predicted and previously-hidden phenotypes induced by genetic or neural manipulations. Further, we demonstrate its utility in automatically unblinding the behavioral effects of pharmacological manipulation. This work demonstrates that mouse body language is built from identifiable components and is organized in a predictable fashion; deciphering this language establishes an objective framework for characterizing the influence of environmental cues, genes and neural activity on behavior.

Joint with Matt Johnson, who will tell us more about the underlying models next week.

Mapping Sub-Second Structure in Mouse Behavior
Video abstract

Nov 30, 2015
Matthew Johnson, HIPS
TS II. Modeling structure in time series

Probabilistic generative modeling can help us discover structured representations from unsupervised time series data. I'll survey some basic ideas from Bayesian modeling and inference for time series and give examples of how they can be composed and extended. In particular, I'll focus on building up Bayesian switching linear dynamical systems (SLDS) and associated sampling and structured mean field inference algorithms, motivated by applications to behavior modeling from last week.

If time permits, I'll also talk about our current work on integrating these structured Bayesian generative models with the right amount of "neural net goo" to combine their respective strengths. I might also show some magic tricks with autograd.

Linear dynamical systems:
Talking Machines on LDS and SLDS: 1m45s - 10m
Matt's thesis on Bayesian time-series modeling:

Code links:

And here's an interesting autograd example which uses that gradient-through-forward-pass method:

Fall 2014

Sep 22, 2014
Bertrand Haas
Contingency tables I: t-test and z-test

Sep 29, 2014
Bertrand Haas
Contingency tables II: correlation, Pearson chi-squared test, and Fisher exact test's_exact_test

Oct 6, 2014
Bertrand Haas
Contingency tables III: examples in genetics

Oct 20, 2014
Yossi Farjoun
Detecting sample swap
Swap whitepaper
Ped whitepaper
Het whitepaper

Oct 27, 2014
Jon Bloom
Optimal coverage in rare variant association studies
Coverage analysis
Power slides

Nov 3, 2014
Jon Bloom and Bertrand Haas
Puzzle day: drunk Monty Hall, the two envelopes, the bloody crime scene, and Simpson's paradox

Nov 24, 2014
Alex Bloemendal
Principle component analysis (PCA) and the Marchenko-Pastur law

Dec 8, 2014
Alex Bloemendal
Non-negative matrix factorization (NMF)

Dec 15, 2014
Jon Bloom and Cotton Seed
Non-linear dimensional reduction: tSNE and diffusion maps

Spring 2015

Jan 26, 2015
Bertrand Haas
Independent component analysis (ICA) and projection pursuit

Feb 24, 2015
Alex Bloemendal, Jon Bloom, Bertrand Haas, Cotton Seed
Comparison of dimensional reduction methods: PCA, ICA, NMF, tSNE, and diffusion maps

Mar 2, 2015
Jon Bloom
Introduction to Bayesian graphical models: the Gaussian mixture model (Bishop, Ch8)

Mar 9, 2015
Jon Bloom
Expectation maximization and inference on Gaussian mixture models (Bishop, Ch9)

Mar 16, 2015
Cotton Seed
Variational Bayes and inference on Gaussian mixture models (Bishop, Ch10)

Mar 23, 2015
Laura Gauthier and Bertrand Haas
Variant quality score recalibration

Mar 30, 2015
Alex Bloemendal
Markov Chain Monte Carlo and Gibbs sampling on Gaussian mixture models (Bishop, Ch11)

April 7, 2015
Yossi Farjoun
Genetic fingerprints and contamination estimation

April 14, 2015
Brendan Bulik-Sullivan
Linear mixed models for genetic association analysis

April 28, 2015
Ian Smith
Connectivity map and challenges in data normalization

May 4, 2015
Jon Bloom
Introduction to Gaussian processes and Bayesian optimization (Bishop, Ch6)

Summer 2015

June 1, 2015
Heng Li
Graph-based genetic sequence representation

June 8, 2015
Bertrand Haas
Choosing priors in Bayesian inference

June 15, 2015
Jon Bloom
Discussion of Pachter's p-value prize

June 22, 2015
Bertrand Haas
Conjugate priors and Hardy-Weinberg equilibrium (Bishop, Ch2)

June 29, 2015
David Benjamin
Introduction to Dirichlet processes

July 6, 2015
David Benjamin
The Chinese restaurant process and Indian buffet process

July 13, 2015
Mark Flaherty
Introduction to evolutionary algorithms and NEAT

July 20, 2015
Brendan Bulik-Sullivan
LD score regression for distinguishing confounding from polygenicity

July 27, 2015
Andrea Byrnes
Challenges in normalization of RNAseq data

In [ ]: