Phenotypic variation in mice can be induced with N-ethyl-N-nitrosourea (ENU), which creates single base pair substitutions in germline DNA. Methods employed in the Center for the Genetics of Host Defense, described on this page (Figure 1), are used to identify ENU-induced mutations causative for phenotypes “instantly,” that is, concurrent with phenotypic screening (Wang et al. Proc.Natl.Acad.Sci.U.S.A. 112, E440-9). This method is distinguished from conventional forward genetic methods because it permits (1) unbiased declaration of mappable phenotypes, including those that are incompletely penetrant, (2) automated identification of causative mutations concurrent with phenotypic screening, without the need to outcross mutant mice to another strain and backcross them, and (3) exclusion of genes not involved in phenotypes of interest.
Only recently has instant mutation identification by automated mapping become a reality, made possible by cumulative methodological and technological advances achieved during the past ~20 years (Timeline). These advances, including several made in the Center for the Genetics of Host Defense, are described briefly.
Prior to the advent of molecular cloning and DNA sequencing techniques, the causes of heritable phenotypes could only be established using biochemical assays. This limitation greatly restricted the number of traits that could be analyzed, but in the early 1980s researchers developed and applied a genetic approach to define the molecular basis of heritable traits (Botstein et al. Am.J.Hum.Genet. 32, 314-331; Davies et al. Nucleic Acids Res. 11, 2303-2312; Gusella et al. Nature. 306, 234-238; Royer-Pokora et al. Nature. 322, 32-38). Now termed “positional cloning,” the method involved four basic steps: (1) high-resolution genetic mapping to establish a critical region within which the genetic cause of the phenotype was proven to reside, (2) physical mapping of the critical region, which was performed by cloning all of the critical region as large, overlapping pieces of DNA, (3) gene identification within the critical region by exon trapping, cDNA selection, etc., and (4) mutation identification, by sequencing all candidate genes and detecting a mutation invariably associated with the trait in affected individuals. The development of the polymerase chain reaction (PCR) (Saiki et al. Science. 230, 1350-1354), databases of expressed sequence tags (ESTs) (Marra et al. Nat.Genet. 21, 191-194), and fluorescence-based capillary sequencing (Smith et al. Nature. 321, 674-679) facilitated steps (2), (3), and (4), respectively, yet the entire process typically required 5 to 8 years.
The speed of positional cloning improved as a result of two breakthroughs. First, the annotated C57BL/6J mouse genome sequence was published as a draft in 2002 (Mouse Genome Sequencing Consortium et al. Nature. 420, 520-562), eliminating the second and third steps in positional cloning. Second, in 2010 it became possible to sequence whole mammalian genomes, and eventually whole exomes (Metzker. Nat.Rev.Genet. 11, 31-46), making it possible to see all the candidate mutations that might be responsible for a phenotype in a given pedigree. In one of our initial uses of whole genome sequencing, we identified a panel of 127 single nucleotide polymorphisms distinguishing the C57BL/6J and C57BL/10J mouse strains (Xia et al. Genetics. 186, 1139-1146). This facilitated the use of C57BL/10J as a mapping strain for phenotypes induced in the C57BL/6J strain, which was desirable since the strains are closely related and phenotypes are therefore less likely to be altered by modifier loci present in more distantly related strains. Even with these advances, the genetic mapping step remained necessary and lengthy; thus, methods to speed mapping were developed.
Our contributions to increasing the speed of mapping began with the use of bulk segregation analysis (BSA) for quick, low-resolution mapping of mouse phenotypes (Arnold et al. Genetics. 187, 633-641). BSA measures mutant vs. mapping strain allele frequency at strain-specific markers across the genome in pools of DNA from phenotypically affected and nonaffected F2 offspring (from mutants outcrossed to a mapping strain). For each marker, enrichment of the mutant strain allele in the affected DNA pool and depletion in the nonaffected DNA pool are used to establish linkage. With only about 20 meioses, BSA can localize a mutation to a sub-chromosomal region, within which there may be only one mutation identified by whole genome or exome sequencing.
Causative mutation identification directly from exome sequencing data without the need for a separate genetic mapping step was first reported in 2012 (Andrews et al. Open Biol. 2, 120061; Sun et al. G3 (Bethesda). 2, 143-150), and paved the way for development of the automated mapping process now used in our laboratory for instant mutation identification (Wang et al. Proc.Natl.Acad.Sci.U.S.A. 112, E440-9). With this technology, it is possible to rationally calculate genome saturation for specific screens, to detect associations between mutations and lethality, and to conduct screens for complex phenotypes or the suppression of disease (Wang et al. Nat.Commun. 9, 441).
G3 mice carrying homozygous and heterozygous mutations induced by ENU in germ cells of G0 male C57BL/6J mice are generated using two possible breeding schemes (Figure 2). Mutagenized G0 males are bred to either C57BL/6J females, or to G0’ females carrying ENU-induced mutations from her father. The resulting G1 males are crossed to C57BL/6J females to produce G2 mice. G1 males are bred to G2 females over about 12 weeks to produce ~50 G3 offspring per G1 x G2 pedigree.
Identification of causative mutations concurrent with phenotypic screening requires the determination of genotype at all mutation sites in every G3 mouse prior to phenotypic assessment. This is accomplished by exome sequencing of the G1 male progenitor of each pedigree to identify all coding and splice site mutations that could possibly be present in the G3 mice. The G3 mice are then genotyped at each mutation site before phenotypic screening; mutations are also validated by genotyping the G1 and G2 mice. REF (homozygous for C57BL/6J reference allele), HET (heterozygous for reference allele and variant allele), or VAR (homozygous for variant allele) genotypes are registered for each mutation site in each mouse, and data are stored in the Mutagenetix database for analysis with phenotypic data.
Once ~50 genotyped G3 mice from a single pedigree are of age for screening, they are tested in phenotypic screens (see Research Areas). If possible, all mice from a pedigree are screened in the same experiment on the same day to minimize phenotypic differences due to experimental variability.
Identification of causative mutations depends on purpose-built software that performs automated linkage analyses (Linkage Analyzer), and a sophisticated display platform that permits searching and presentation of the resulting data (Linkage Explorer).
Analyses of genotype and phenotype data are automatically performed using Linkage Analyzer, a software program designed and written in our laboratory to test the probability of single locus linkage to phenotypes using recessive, semidominant (additive), and dominant transmission models, and to assess the probability of preweaning lethal effects due to single locus mutations. Linkage Analyzer detects phenovariance when it is statistically linked to genotype as determined by a linear regression model. For each mutation, the null hypothesis of nonlinkage is tested assuming a normal or a binomial distribution of phenotype scores for quantitative and qualitative phenotypes, respectively. The P value of association between genotype and phenotype is calculated using a likelihood ratio test from a generalized linear model or generalized linear mixed effect model.
Linkage Analyzer operates at a scalable speed depending on the capabilities of the cluster on which it is run. As presently configured it processes data at a rate that exceeds our capacity to produce mutations and develop screening data and delivers linkage assessments in real time. When phenotypic data are uploaded, the genetic cause of any phenovariance that may exist in the dataset is usually known within a few minutes. The production and phenotypic analysis of G3 mutant mice are thus the rate-limiting steps in the forward genetic approach used in our laboratory.
For each variant phenotype identified by phenotypic screening, Linkage Analyzer performs automated computation of P values of association between genotype and phenotype for every mutation in the pedigree using all three transmission models. These data are accessible through the Linkage Explorer application.
Linkage Explorer may be used to search for phenotypes (among those screened in the Center for the Genetics of Host Defense) linked to a gene of interest, or conversely, for mutated genes linked to a phenotype of interest (Video 1). Several parameters may be specified to target analyses to specific genes, phenotypes, pedigrees, mutation types or effects, or to limit the results to genotype-phenotype associations in which a specified number of linkage peaks was found in the Manhattan plot (Table 1). Other parameters set the stringency of criteria for linkage (Table 1). Three settings dramatically alter the sensitivity and specificity of automated mapping assignments: the number of mice with VAR genotype tested, the P value cutoff, and the requirement for both raw and normalized datasets to reveal linkage in a given screen. By varying the stringency of such criteria for linkage, the specificity and sensitivity of the search are varied accordingly.
For each search, Linkage Explorer displays in a results table the P value of association calculated by Linkage Analyzer under the three transmission models, with a clickable link to the corresponding Manhattan plot for each inheritance mode, from which raw or normalized phenotypic data for mice of REF, HET, or VAR genotypes can be accessed in a table or scatter plot (Figure 3). Linkage Explorer also displays the mutation coordinate, mutation type, phenotypic screen, numbers of mice with REF, HET, or VAR genotypes, and precalculated information about each implicated gene and mutation, including the predicted effect of the mutation as determined by PolyPhen-2 (Adzhubei et al. Nat.Methods. 7, 248-249) or by a splice site prediction program. The “candidate status” is determined by the Candidate Explorer program (see below) and indicates one of four potential ratings of the likelihood that a mutation would be validated as causative (excellent, good, potential, and not good).
Approximately 3.1% of ENU-induced mutations in our colony are shared between two or more pedigrees, inherited from a common ancestral G0 male. To date, multiple alleles have been identified for approximately 87% of genes with validated mutations. Because the genotypes at all mutation sites in all G3 mice are known, combining pedigrees with identical or non-identical allelic mutations to make “superpedigrees” is possible. This increases the power to detect linkage, especially for weak or low penetrance phenotypes, and can help resolve a causative mutation where causative and non-causative mutations are closely linked. Relative to single pedigree analysis, combining pedigrees in this manner can greatly increase the strength of a genotype-phenotype association, or eliminate it from consideration. As data accumulate from many pedigrees over time, the power to implicate or exonerate genes from participation in defined biological processes increases.
Superpedigrees are automatically generated and analyzed by Linkage Analyzer whenever allelic mutations and phenotypic data are added to the database. “Gene-based” superpedigrees consist of pedigrees containing non-identical mutations of the same gene, whereas “position-based” superpedigrees consist of pedigrees containing identical mutations of the same gene. Superpedigree linkage data are accessed via Linkage Explorer, and searches can be restricted by specifying the gene(s), phenotype(s), pedigree(s), P value cutoff, number of VAR mice tested, and/or application of the “raw+normalized” restriction (Video 2). The results table output by Linkage Explorer is similar to the results table for single pedigree linkage data, except that the number of alleles and number of pedigrees in the superpedigree are given in place of a single mutation coordinate. The minimum PolyPhen-2 score among the set of alleles in the superpedigree is also provided. P values are linked to the Manhattan plot for each transmission mode, from which raw or normalized phenotypic data can be accessed in a table or scatter plot (Figure 4). Phenotypic data in scatter plots are color-coded by pedigree so the contribution of each pedigree to a particular gene-phenotype association can be easily observed.
Confirmation of causative mutations depends on duplication of the mutant phenotype by a second allele, which may be generated by CRISPR/Cas9 gene targeting.
The Candidate Explorer program aids in the identification of borderline candidate mutations, i.e., those which may show weak phenotypic effects or relatively wider variance of the measured phenotype. Based on previous experience with CRISPR validation, Candidate Explorer rates new mutations as “excellent,” “good,” “potential,” or “not good” candidates for CRISPR validation. The ratings are displayed on the Phenotypic Mutations list, on individual phenotypic mutation records, and on Linkage Explorer search results. Ratings are based on a variety of criteria such as pedigree size, number of homozygous G3 mice in the pedigree, phenotypic screen, predicted deleteriousness of the mutation (by PolyPhen-2), variance of the measured phenotypic data, P value, and many others, each of which may be differentially weighted by the program. Candidate Explorer is trained using the Random Forest machine learning algorithm on an ongoing basis as new mutation and phenotypic data are acquired. Two types of training sets have been tested: mutations verified to cause phenotype by analysis of mice with CRISPR-targeted alleles (“CRISPR-verified”); or CRISPR-verified mutations plus mutations presumed to cause phenotype based on published reports of gene function, a strong or distinctive phenotype, and the predicted effect of the mutation as surveyed by a human researcher (“literature-verified”). The performance of Candidate Explorer is similar whether trained using CRISPR-verified or CRISPR+literature-verified mutations, and all mutations are automatically assessed using Candidate Explorer trained on CRISPR-verified mutations. The precision, accuracy, and recall of the program were assessed for candidate causative mutations with P ≤ 0.002 (for both raw and normalized phenotype data), from pedigrees with at least 20 G3 mice and REF ≥ 4 and VAR ≥ 3 (Table 2). Of 151 alleles generated by CRISPR-mediated gene targeting and tested in 344 phenotypic assays, 90 alleles were confirmed and 61 alleles were excluded from causation. For alleles rated “good” or better, Candidate Explorer demonstrated 96.25% precision, 89.40% accuracy, and 85.56% recall for mutations rated “good” or better.