Reconstruction of Genome-Scale Metabolic Models

There is growing interest in elucidating the minimal number of genes needed for life. This challenge is important not just for fundamental but also practical considerations arising from the need to design microorganisms exquisitely tuned for particular applications. With a genome size of ~580 kb and approximately 480 protein coding regions Mycoplasma genitalium is one of the smallest known self-replicating organisms and, additionally, has extremely fastidious nutrient requirements. The reduced genomic content of M. genitalium has led researchers to suggest that the molecular assembly contained in this organism may be a close approximation to the minimal set of genes required for bacterial growth. We have introduced a systematic approach for the construction and curation of a genome-scale in silico metabolic model for M. genitalium. The model accounts for 189 of the 482 genes listed in the latest genome annotation and is named iPS189. We used computation tools during the process to bridge network gaps in the model (i.e., GapFind and GapFill) and restore consistency with experimental data that determined which gene deletions led to cell death using GrowMatch. We achieved 87% correct model predictions for essential genes and 89% for non-essential genes. We subsequently have used the metabolic model to determine components that must be part of the growth medium. The general aproaches we used during the construction of the M. genitalium model are being used to develop models of other organisms.

Figure Caption: Classification of the ORFs included in iPS189 grouped into COG functional categories. The percent assigned to each class refers to the coverage of the total number in the genome accounted for in the model. Some of the ORFs in Mycoplasma genitalium do not currently have a COG functional category assignment (here represented as N/A). Note that although each ORF is only counted once within each COG functional category, some ORFs have multiple COG category assignments.

Over 700 genomes have been fully sequenced whereas only about 25 organism-specific genome-scale metabolic models have been constructed. The time required during the task of model-construction is becoming increasingly critical as genome sequencing for new organisms is proceeding at an ever-accelerating pace. The approaches and tools used during the M. genitalium model generation provide a roadmap for the automated metabolic reconstruction of other organisms. The application of the automated methodologies GapFind/GapFill and Growmatch can be effectively used during the construction process (as opposed to an a posteriori mode of deployment). Additionally, we have explored how GrowMatch is useful at improving model predictions when using growth phenotypes from various nutrient sources (i.e., carbon, nitrogen, phosphorus, sulfur) instead of single gene deletions.

Figure Caption: The four steps used during reconstruction are 1) identification of biotransformations using automated homology searches 2) assembly of reaction sets into a genome-scale metabolic model 3) automated network connectivity analysis and restoration, and 4) automated evaluation and improvement of model performance when compared to in vivo gene essentiality or growth data.

[Suthers et al (2009)]

Curation of Genome-Scale Metabolic Models

Currently, there exists tens of different microbial and eukaryotic metabolic reconstructions (e.g.,Escherichia coli, Saccharomyces cerevisiae, Bacillus subtilis, Homo Sapiens) with many more under development. There are inaccuracies in these reconstructions due to the presence/absence of metabolic functionalities that the organism does not/ does actually possess. Our research is focused on developing optimization tools to pinpoint these inaccuracies and subsequently generate hypotheses to correct them. To this end, we have been involved in projects to develop the following tools:

1. GapFind/GapFill

Using GapFind , we identify the metabolites in the metabolic network reconstruction which cannot be produced under any uptake conditions. Subsequently, using GapFill, we identify the reactions from a customized multi-organism database that restores the connectivity of these metabolites to the parent network using four mechanisms. We demonstrate this procedure for the genome- scale reconstruction of Escherichia coli and also Saccharomyces cerevisiae wherein compartmentalization of intra-cellular reactions results in a more complex topology of the metabolic network. We determine that about 10% of metabolites in E. coli and 30% of metabolites in S. cerevisiae cannot carry any flux. Interestingly, the dominant flow restoration mechanism is directionality reversals of existing reactions in the respective models.

Figure Caption: GapFind identifies problem metabolites and GapFill generates hypotheses to restore the connectivity of these metabolites with the rest of the network.

2. GrowMatch

Genome-scale metabolic reconstructions are typically validated by comparing in silico growth predictions across different mutants utilizing different carbon sources with in vivo growth data. This comparison results in two types of model- prediction inconsistencies; either the model predicts growth when no growth is observed in the experiment (GNG inconsistencies) or the model predicts no growth when the experiment reveals growth (NGG inconsistencies). In this project, we developed an optimization-based framework, GrowMatch, to automatically reconcile GNG predictions (by suppressing functionalities in the model) and NGG predictions (by adding functionalities to the model). We use GrowMatch to resolve inconsistencies between the predictions of the latest in silico E. coli (iAF1260) model and the in vivo data available in the Keio collection and improved the consistency of in silico with in vivo predictions from 90.6% to 96.7%.

Figure Caption: Application of GrowMatch to the iAF1260 model to increase agreement of in silico predictions with in vivo results.

For many cases, GrowMatch identified fairly non-intuitive model modification hypotheses that would have been difficult to pinpoint through inspection alone. In addition, GrowMatch can be used during the construction phase of new, as opposed to existing, genome-scale metabolic models, leading to more expedient and accurate reconstructions.

[Satish Kumar and Maranas (2009); Satish Kumar et al (2007)]

Analysis of Network Properties of Metabolic Models

In this research, we have introduced the Flux Coupling Finder (FCF) framework for elucidating the topological and flux connectivity features of genome-scale metabolic networks. The framework is demonstrated on genome-scale metabolic reconstructions of Helicobacter pylori, Escherichia coli, and Saccharomyces cerevisiae. The analysis allows one to determine if any two metabolic fluxes, v1 and v2, are (i) directionally coupled, if a non-zero flux for v1 implies a non-zero flux for v2 but not necessarily the reverse; (ii) partially coupled, if a non-zero flux for v1 implies a non-zero, though variable, flux for v2 and vice-versa; or (iii) fully coupled, if a non-zero flux for v1 implies not only a non-zero but also a fixed flux for v2 and vice-versa. Flux coupling analysis also enables the global identification of blocked reactions, which are all reactions incapable of carrying flux under a certain condition, equivalent knockouts, defined as the set of all possible reactions whose deletion forces the flux through a particular reaction to zero, and sets of affected reactions denoting all reactions whose fluxes are forced to zero if a particular reaction is deleted. The FCF approach thus provides a novel and versatile tool for aiding metabolic reconstructions and guiding genetic manipulations.

In analogy with the FCF method for fluxes we have developed an analogous method for metabolite pools. Specifically, conservation relationships for metabolite concentrations are important biophysical barriers, selected through evolution, to protect cellular organisms from stresses (e.g., osmotic) and provide global metabolic regulation.. The conservation relationships are linear combinations of metabolite concentrations that do not change over time and are solely determined by the organism's stoichiometry and uptake/secretion transport conditions. To this end, we have introduced an optimization-based framework to elucidate and analyze conservation relationships for metabolite concentrations in the context of genome-scale reaction networks. The framework is comprised of Metabolite Concentration Coupling Analysis (MCCA) and the Minimal Conserved Pool Identification (MCPI) procedure.

Figure Caption: Pictorial overview of coupled metabolite pools for E. coli.

[Nikolaev et al (2005); Burgard et al (2004); Burgard et al (2001)]

Genome-scale Gene/Reaction Essentiality and Synthetic Lethality Analysis

Synthetic lethals refer to pairs of non-essential genes whose simultaneous deletion prohibits growth. One can extend the concept of synthetic lethality by considering gene groups of increasing size where only the simultaneous elimination of all genes is lethal whereas individual gene deletions are not. We developed optimization-based procedures for the exhaustive and targeted enumeration of multi-gene (and by extension multi-reaction) lethals for genome-scale metabolic models. Specifically, we are applying these approaches to the iAF1260 model of E. coli. This analysis for the iAF1260 model led to the complete identification of all double and triple synthetic lethals as well as the targeted identification of higher-order synthetic lethals. Graph representations of these synthetic lethals reveal a variety of motifs ranging from hub-like to highly connected sub-graphs providing a birds-eye view of the avenues available for redirecting metabolism and uncovering complex patterns of gene utilization and interdependence. The procedure also enables the use of falsely predicted synthetic lethals for metabolic model curation. By analyzing the functional classifications of the genes involved in synthetic lethals we reveal surprising connections within and across COG functional classifications.

Figure Caption: Topological and functional classification of clusters of SL gene pairs. Three types of network motifs are present: disjoint pairs, stars (1-connected motifs) and k-connected motifs (highly-connected subgraphs). Genes are color-coded in accordance to the COG (Tatusov et al, 2003) functional categorization.

Modeling of Anaerobic Organism (Archaea, Clostridia)

Clostridia are anaerobic, gram positive bacteria with the ability to consume a broad range of substrates starting from cellulose to glycerol and produce a spectrum of useful metabolites such as propanediol, butanol. The industrial application of these non-standard organisms is hindered by the lack of understanding of their metabolic repertoire which can resolved by the aid of stoichiometric and kinetic models of metabolism. The major challenges in such modeling are redox balance systems (transhydrogenase, hydrogenase etc.), cellulosome complex (whose composition varies based on substrate utilized), biphasic growth (shift between growth phases based on media conditions) and the variation in TCA cycle and PP pathway across organism. We have previously constructed and used a stoichiometric model for understanding C. acetobutylicum's response to stressors (Dash et al, 2014). We are currently working on C. thermocellum stoichiometric and kinetic models to identify ethanol over-producing genetic perturbations.

Figure Caption: Metabolic map of the pathway coverage in different clostridia. The unique pathways in different clostridia have been highlighted: cellulose degradation (C. cellulolyticum and C. thermocellum), glycerol metabolism (C. butyricum), Wood-Ljungdahl pathway and butanediol production (C. ljungdahlii), ABE fermentation (C. acetobutylicum and C. beijerinckii) and TCA cycle.

[Dash et al (2016); Dash et al (2014)]

Elucidation of Metabolic Fluxes using Labeled Isotopes

Atom mappings

A key consideration in metabolic engineering is the determination of fluxes of the metabolites within the cell. The gold standard in elucidating metabolic flows is the use of metabolic flux analysis (MFA). Flux elucidation requires both a detailed mapping of all possible paths different isotope forms can take through metabolic networks and experimental data in the form of labeled isotopes (usually 13C) sampled by NMR or GC-MS. When cells are fed a growth substrate with certain carbon positions labeled with 13C, the distribution of this label in the intracellular metabolites can be calculated based on the known biochemistry of the participating pathways. Most labeling studies have focused on skeletal representations of central metabolism that ignore many flux routes that could contribute to the observed isotopic labeling patterns. In contrast, our approach investigates the importance of carrying out isotopic labeling studies using a more complete reaction network. Our group has contributed and made available a large-scale isotope mapping model for E. coli spanning 238 reactions and 184 metabolites yielding 17,346 distinct isotopes by manually tracking the fate of labeled carbons. This network includes glycolysis, the TCA cycle, the pentose phosphate pathway, anaplerotic reactions, amino acid biosynthesis/ degradation, and oxidative phosphorylation. Many of these reactions are reversible and the model also includes a reaction to account for biomass production. Furthermore, the model includes global metabolite balances on cofactors such as ATP, NADH, and NADPH.

Figure Caption: The TCA cycle appears as a group of linked metabolites connected by reactions (lines) in a metabolic model (top). A carbon isotope model takes into account the additional linkages between the carbon atoms (numbered circles) and their transitions in the network (bottom). The symmetric nature of succinate (suc) and fumarate (fum) generates multiple path lines per atom. Some transitions are non-intuitive (e.g, accoa + oaa → cit), and all depend on the numbering scheme used for the atoms in each metabolite.

Flux elucidation

The experimental system was a strain of E. coli that produces amorphadiene, a precursor to the antimalarial drug artemisinin, grown on aerobic glucose conditions. We found that hundreds of distinct flux distributions were identified with sum of squared error (SSE) values that met the statistical cutoff for goodness-of-fit (i.e., chi-square). This lack of unique elucidation prompted us to deduce flux ranges spanned by equivalent solutions given the experimental error in the measurements. First, we calculated the variability in all fluxes subject to only network stoichiometry. As expected, nearly all of the central metabolic fluxes span wide ranges of values, motivating the need for 13C-based MFA. We subsequently recalculated flux variability after including the isotopomer balancing constraints to determine (i) which fluxes were fully resolved; (ii) how many of them were only partially resolved; and (iii) how many remaining fluxes in the model were completely unaffected by the isotopomer data.

Figure Caption: Flux map of only the central metabolism portion of the large-scale model. The net flux ranges are derived using optimization and incorporating the isotopic data. For reversible fluxes, a solid arrowhead indicates the forward direction.

Design of MFA experiments

One of the important considerations for analysis of the isotope model is how isotope measurements impact the elucidation of fluxes in large-scale metabolic reconstructions. This identifiability problem in MFA with isotopic considerations is very difficult, as isotopic balances yield nonlinear constraints. We have developed an integer programming framework, OptMeas, for the mathematical analysis of metabolic pathways to answer this question. By using a structural analysis-based optimization method, it is possible to exhaustively identify all combinations of isotope labeling experiments and flux measurement that completely resolve all flux values in the network. This approach results in an integer linear programming formulation while accounting for the case of partial measurements (e.g., when only some fragments are measured). These measurements, consisting of both partial or full isotope state determination, were assigned relative costs that allow the experimentalist to select the measurements that will be both sufficient and economical. The proposed framework has been tested on well-studied small demonstration examples. We have also benchmarked the proposed framework by applying it to medium-scale metabolic networks of E. coli and by revisiting our large-scale E. coli model to exhaustively identify all measurements options.

Figure Caption: OptMeas works by identifying the smallest measurement set that renders the incidence matrix of variables-equations full-column rank. By measuring fluxes and/or isotopomer distributions they cease to be variables implying that the corresponding columns in the structural matrix can be removed, eventually producing a full-column rank matrix that is structurally nonsingular.

[Chang et al (2008); Suthers et al (2007)]

Computational Procedures for Pathway Optimization Using Stoichiometric Models of Metabolism


Our group introduced the computational framework termed OptKnock for suggesting gene deletion strategies that are likely to lead to the overproduction of specific chemical products. Specifically, a nested optimization framework (see Figure) was proposed to identify multiple gene deletion combinations that force the coupling of a cellular objective, here assumed to be a drain of biosynthetic precursors in the ratios required for biomass formation, with imposed chemical production targets. This coupling is accomplished by ensuring that the production of the desired product becomes an obligatory byproduct of growth by "shaping" the connectivity of the metabolic network. The computational procedure was designed to identify not just straightforward but also non-intuitive knockout strategies by considering the entire genome-scale metabolic networks as abstracted in recently developed in silico models.

Figure Caption: The bilevel optimization structure of OptKnock. The inner problem performs the flux allocation based on the optimization of a particular cellular objective (e.g., maximization of biomass yield, minimization of metabolic adjust, etc.). The outer problem then maximizes the bioengineering objective (e.g., compound overproduction) by restricting access to key reactions available to the optimization of the inner problem.


Most computational strain design methods rely on surrogate biological objectives (e.g., maximize growth rate or minimize metabolic adjustments) and do not make use of flux measurements often available for the wild-type strain. We have also developed the OptForce framework that identifies all possible engineering interventions by classifying reactions in the metabolic model depending upon whether their flux values must increase, decrease or become equal to zero to meet a pre-specified overproduction target. With increasing availability of fluxomics data and improved strain engineering and molecular biology techniques, OptForce predictions can be implemented for increasing various products including malonyl-coA (Xu et al, 2011) and fatty acids (Ranganathan et al, 2012).

Figure Caption: An outline of the OptForce procedure to identify the minimal set of genetic interventions that guarantee the imposed yield.


Computational strain-design prediction accuracy has been the focus for many recent efforts through the selective integration of kinetic information into metabolic models. Existing approaches relying solely on stoichiometry and rudimentary constraint-based regulation overlook the effects of metabolite concentrations and substrate-level enzyme regulation while identifying metabolic interventions. The k-OptForce protocol integrates the available kinetic descriptions of metabolic steps into a genome-scale stoichiometric-based modeling formalism to sharpen the prediction of intervention strategies. k-OptForce uses kinetic information to (re)apportion reaction fluxes in the network by identifying interventions comprised of both direct enzymatic parameter changes (for reactions with available kinetics) and reaction flux changes (for reactions with only stoichiometric information). The introduction of kinetic expressions can significantly alter the identified interventions compared to those identified with stoichiometry-alone analysis. In particular, additional modifications are required in some cases to avoid the violation of metabolite concentration bounds, while in other cases, the kinetic constraints yield metabolic flux distributions that favor the overproduction of the desired product thereby requiring fewer direct interventions.

Figure Caption: k-OptForce protocol

[ Fong et al (2005); Burgard et al (2003); Burgard and Maranas (2001); Ranganathan et al. (2010); Xu et al. (2011); Ranganathan et al. (2012); Chowdhury et al. (2014) ]

Computational Procedures for Strain and Pathway Design


We have also developed the OptStrain framework which aims to guide pathways modifications, through reaction additions and deletions, of microbial networks for the overproduction of targeted compounds. These compounds may range from electrons or hydrogen in bio-fuel cell and environmental applications to complex drug precursor molecules. A comprehensive database of biotransformations, referred to as the Universal database (with over 5,000 reactions), is compiled and regularly updated by downloading and curating reactions from multiple biopathway database sources. Combinatorial optimization is then employed to elucidate the set(s) of non-native functionalities, extracted from this Universal database, to add to the examined production host for enabling the desired product formation. Subsequently, competing functionalities that divert flux away from the targeted product are identified and removed to ensure higher product yields coupled with growth. This work represents an advancement over earlier efforts by establishing an integrated computational framework capable of constructing stoichiometrically balanced pathways, imposing maximum product yield requirements, pinpointing the optimal substrate(s), and evaluating different microbial hosts.

Figure Caption: Pictorial representation of the OptStrain procedure. Step 1 involves the curation of database(s) of reactions to compile the Universal database, which comprises only elementally balanced reactions. Step 2 identifies a maximum-yield path enabling the desired biotransformation from a substrate (e.g., glucose, methanol, xylose) to product (e.g., hydrogen, vanillin) without any consideration for the origin of reactions. Note that the white arrows represent native reactions of the host and the yellow arrows denote non-native reactions. Step 3 minimizes the reliance on non-native reactions, and Step 4 incorporates the non-native functionalities into the microbial host's stoichiometric model and applies the OptKnock procedure to identify and eliminate reactions competing with the targeted product. The red x's pinpoint the deleted reactions.


De novo metabolic pathways can be carefully designed to outperform native pathways with higher carbon and energy efficiencies as well as expand the product range of a host organism. Scanning through the huge repertoire of microbial enzymes to identify potential components of a novel pathway is a challenging task as one has to consider all the possible factors affecting the pathway efficiency. They include the stoichiometric and cofactor balance, thermodynamic feasibility, co-substrate and co-product choices and compatibility of the enzymes with the host. optStoic is a two-stage MILP-based formalism that first explores the maximum extent of converting a carbon substrate to a desired product, or maximizing the profit potential of a conversion through inclusion of non-intuitive combination of co-reactants and co-products while remaining within the limits of mass, energy or thermodynamic balances. Subsequently, it identifies a protein minimal reaction network from a database of reactions to achieve the overall conversion. This procedure helps us improve on existing strategies in terms of carbon and energy efficiency, while entirely novel designs can also be traced.

Figure Caption: Overview of the optStoic procedure.

[Pharkya and Maranas (2006); Pharkya and Maranas (2005); Pharkya et al (2004); Pharkya et al (2003); Chowdhury and Maranas (2015) ]

Analysis and Redesign for Kinetic Models of Metabolism

Moving from stoichiometric to kinetic models of metabolism, in this research, we proposed a general computational procedure to determine which genes/enzymes should be eliminated, repressed or overexpressed to maximize the flux through a product of interest for general kinetic models. The procedure relies on the generalized linearization of a kinetic description of the investigated metabolic system and the iterative application of mixed-integer linear (MILP) optimization implemented using the MATLAB environment. The proposed computational procedure is a general approach that can be applied to any metabolic system for which a kinetic description is provided. All MATLAB input files are made available as supplementary material.

Figure Caption: Hierarchical procedure for identifying optimal gene manipulation strategies. At each iteration of the outer loop q manipulations are enforced and qmax is the maximum number of manipulations. At each iteration of the inner loop a cut is added to exclude the current solution in the next iteration. zq is the objective function of interventions of q manipulations and z*q-1is the best objective function value of the best intervention strategy of q-1 manipulations.

[Vital-Lopez et al (2006); Nikolaev et al (2005); Petkov and Maranas (1997)]

Kinetic modeling and ME-model

In contrast to stoichiometric-based models, the development of large-scale kinetic models of metabolism has been hindered by the challenge of identifying kinetic parameter values and kinetic rate laws applicable to a wide range of environmental and/or genetic perturbations. The recently introduced ensemble modeling (EM) procedure provides a promising remedy to address these challenges by decomposing metabolic reactions into elementary reaction steps and incorporating all phenotypic observations, upon perturbation, in its model parameterization scheme. Here, we present a stepwise procedure to develop kinetic models of metabolism that satisfies the fluxomic data for many mutant strains by making use of the EM concepts. Parameterization is performed using a formal optimization approach that minimizes the discrepancies between model predictions and flux measurements. With an ever increasing advances in omics measurements, the constructed models and the parameterization procedure presented in this study pave the way for the construction of larger-scale kinetic models with more narrowly distributed parameter values.

Figure Caption: A schematic representation of the kinetic model construction procedure. (1) First, a steady state flux distribution is obtained by imposing the available fluxomic data and refining the flux ranges. (2.1) EM procedure is used to decompose each reaction into elementary mechanistic reaction steps (vj(2lj-1) and vj(2lj) represent the forward and reverse flux of the elementary reaction j in elementary step lj, respectively). Thermodynamic constraints are employed to confine the ranges of reactions reversibilities. In the absence of experimental measurements for metabolite concentration, they are normalized to the their steady-state values. Conservation of mass is written for the total available enzyme in the system for each enzyme. Elementary kinetic parameters are then expressed as a function of reaction reversibilities and enzyme fractions. (2.2) An ensemble of elementary kinetic parameters of size P is constructed by uniformly sampling reaction reversibilities and enzyme fractions. (2.3) For a given set of kinetic parameters the system of ODEs representing the conservation of mass for each metabolite and enzyme fraction is integrated until reaching a steady-state. (3) In order to improve model fitness, the optimizer provides a new set of model parameters (see Materials and methods section). Ultimately, a set of kinetic model that is tested and validated along different fluxomics is identified. (4) The model predictions are validated by a comparison between the available metabolomics, kinetic constants and cross-validations.

[ Khodayari et al (2014); Khodayari et al (2014)]

Back to Research
Go to top