Expanding the Atlas of Functional Missense Variants in Human Genes

—Downloadable data—

Preface

This page offers all data associated with our paper "Expanding the Atlas of Functional Missense Variants in Human Genes" (Weile et al., submitted) for download. If you're interested in using any of our data please cite our paper and feel free to contact us. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The authors gratefully acknowledge funding by the National Institutes of Health and the National Human Genome Research Institute (NIH/NHGRI) Center of Excellence in Genomic Science (CEGS) Initiative, the Canadian Excellence Research Chair (CERC) and the Ontario Ministry of Research and Innovation (MRI) and the Canadian Institute for Advanced Research (CIFAR).

Abstract

Although we now routinely sequence human genomes, we cannot currently confidently identify functional variants. Here we developed a deep mutational scanning framework that produces exhaustive maps for human missense variants by combining random codon-mutagenesis and multiplexed functional variation assays with computational imputation and regularization. We applied this framework to four proteins corresponding to six human genes: UBE2I (encoding SUMO E2 conjugase), SUMO1 (small ubiquitin-like modifier), TPK1 (thiamin pyrophosphokinase), and CALM1/2/3 (three genes encoding the protein calmodulin). The resulting maps recapitulate known protein features, and serve to confidently identify pathogenic variation. Analysis of large-scale phenotypic screens suggests that assays potentially amenable to deep mutational scanning are already available for 57% of human disease genes.

Methods

We employ a modular workflow encompassing five phases: (1) Mutagenesis; (2) Library generation; (3) Selection of functional variants; (4) Assay readout; and (5) Imputation and Regularization. Mutagenesis was performed using our custom POPCode mutagenesis protocol, which ensures even coverage across all possible amino acid changes, beyond those reachable by single nucleotide variants. Libraries were then generated by en-masse Gateway cloning into pools of barcoded complementation and Y2H vectors. Genotype and barcode identity of each clone in the library was then determined using kiloSEQ. Selection for functional variants was achieved through yeast complementation and Y2H assays, which couple overall functionality of the protein or its ability to interact with other proteins to yeast fitness, respectively. The readout of the of these pooled competitive growth assays was then obtained via two different sequencing methods: Barcode Sequencing (BarSEQ) and TileSEQ. Finally, fitness scores are calculated, brought to a common scale, missing data is imputed using machine learning and less confidently measured data points are regularized. See here for extended methodological details.

Variant maps

SUMO E2 conjugase (UBE2I)

Structure of UBE2I colored to show each amino acid's median mutant fitness in the complementation assay.

Functional atlas of UBE2I

Small Ubiquitin-like Modifier 1 (SUMO1)

Structure of SUMO1 colored to show each amino acid's median mutant fitness in the complementation assay.

Functional atlas of SUMO1

Thiamine Pyrophosphokinase 1 (TPK1)

Functional atlas of TPK1

Calmodulin (CALM1/2/3)

Functional atlas of Calmodulin

Downloads

The data below is free to use under a Creative Commons Attribution-ShareAlike 4.0 International License. Software elements are free to use under the GNU Lesser General Public License 3.0.

Creative Commons License GPL3.0

Software

Raw read counts

 UBE2I Complementation Timeseries BarSEQ counts

 UBE2I Complementation TileSEQ counts

 SUMO1 Complementation TileSEQ counts

 TPK1 Complementation TileSEQ counts

 CALM1/2/3 Complementation TileSEQ counts

Experimental scores

 UBE2I Y2H BarSEQ scores (per clone)

 UBE2I Y2H BarSEQ scores (per mutation)

 UBE2I Complementation Timeseries BarSEQ scores (per clone)

 UBE2I Complementation Timeseries BarSEQ scores (per mutation)

 UBE2I Complementation TileSEQ scores

 SUMO1 Complementation TileSEQ scores

Re-scaled scores

 UBE2I Complementation TileSEQ scores re-scaled

 SUMO1 Complementation TileSEQ scores re-scaled

 TPK1 Complementation TileSEQ scores re-scaled

 CALM1/2/3 Complementation TileSEQ scores re-scaled

Joint complementation scores

 UBE2I Complementation joint scores

Imputation and regularization

 UBE2I complementation machine learning features

 UBE2I complementation scores (imputed and regularized)

 UBE2I complementation scores (flipped, imputed and regularized)

 SUMO1 complementation machine learning features

 SUMO1 complementation scores (imputed and regularized)

 SUMO1 complementation scores (flipped, imputed and regularized)

 TPK1 complementation machine learning features

 TPK1 complementation scores (imputed and regularized)

 TPK1 complementation scores (flipped, imputed and regularized)

 CALM1/2/3 complementation machine learning features

 CALM1/2/3 complementation scores (imputed and regularized)

 CALM1/2/3 complementation scores (flipped, imputed and regularized)

Downstream analysis results

 Calmodulin variant pathogenicity calls

 UBE2I intragenic epistasis

 UBE2I intramolecular distance matrix

 UBE2I genetic interactions VS intramolecular distances