Genes and Development

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


GENES & DEVELOPMENT 21:1010-1024, 2007
©2007 by Cold Spring Harbor Laboratory Press; ISSN 0890-9369/ $5.00
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Zhu, X.
Right arrow Articles by Snyder, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhu, X.
Right arrow Articles by Snyder, M.
Related Content
Right arrow Systems Biology
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

REVIEW

Getting connected: analysis and principles of biological networks

Xiaowei Zhu1,2, Mark Gerstein3, and Michael Snyder1,2,4

1 Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA; 2 Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA; 3 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA


    Abstract
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
The execution of complex biological processes requires the precise interaction and regulation of thousands of molecules. Systematic approaches to study large numbers of proteins, metabolites, and their modification have revealed complex molecular networks. These biological networks are significantly different from random networks and often exhibit ubiquitous properties in terms of their structure and organization. Analyzing these networks provides novel insights in understanding basic mechanisms controlling normal cellular processes and disease pathologies.

[Keywords: Network topology; regulatory circuits; scale-free networks]


Proper execution of complex biological systems occurs through the intricate coordination of a large number of events and their participating components. Cellular proliferation, differentiation, and environmental interactions each requires the production, assembly, operation, and regulation of many thousands of components, and they do so with remarkable fidelity in the face of many environmental cues and challenges. Understanding how cellular and developmental events occur at a molecular level with such precision has become a major focus for modern molecular biology, and considerable effort has been devoted to determining the regulatory networks that control and mediate complex biological processes.

Until recently dissection of biological networks has occurred through the efforts of individual laboratories working on one or a few components, limiting a thorough understanding of individual biological processes in the context of the entire cellular network. Detailed analysis of specific components and their interacting partners or substrates can be used to assemble high-confidence pathways. For example, analysis of the NF-{kappa}B and TGF-beta signaling pathways has revealed many components whose functions are reasonably well known for each of these pathways (Mishra et al. 2005Go; Karin 2006Go). Nonetheless, in spite of the intensive study of such pathways, new components of these pathways continue to be discovered (Covert et al. 2005Go; Ma et al. 2006Go), indicating that our analysis of even the most well-studied pathways is likely to be incomplete.

The advent of high-throughput techniques has allowed the large-scale identification of components (genes, RNAs, and proteins), their expression patterns, and their biochemical and genetic interactions. Although useful for generating large amounts of biological information, the data from such studies are often incomplete and contain errors. Nonetheless, they can provide valuable information about the functions of individual components and unexpected relationships between components and cellular processes. For example, Arg5,6, a well-characterized metabolic enzyme, was identified to have a DNA-binding activity through a proteome microarray screen and was later confirmed to regulate gene expression in vivo (Hall et al. 2004Go). Thus far a variety of large-scale data sets have been identified and used to assemble different networks. Below we briefly describe the different types of biological networks and general features and principles that result from the analysis of such networks.


    Types of biological networks
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Interaction data gathered through both individual studies and large-scale screens can be assembled into a network format whose topological structure contains significant biological properties. To date, at least five types of biological networks have been characterized in detail: transcription factor binding, protein–protein interactions, protein phosphorylation, metabolic interactions, and genetic interaction networks (examples of each of these networks and their sizes are presented in Fig. 1 and Table 1). Each of these networks is discussed briefly below.


View this table:
[in this window]
[in a new window]

 
Table 1. Current status of biological networks

 


Figure 1
View larger version (59K):
[in this window]
[in a new window]

 
Figure 1. Examples of the five major biological networks. (A) A yeast transcription factor-binding network, composed of known transcription factor-binding data collected with large-scale ChIP–chip and small-scale experiments. This figure was generated with the program Pajek (de Nooy et al. 2005Go). (B) A yeast protein–protein interaction network, containing protein–protein interactions identified by yeast two-hybrid and protein complexes identified by affinity purification and mass spectrometry (Barabasi and Bonabeau 2003Go). (Reprinted by permission from Macmillan Publishers Ltd: Nature [Jeong et al. 2001Go], © 2001.) Nodes are colored according to the mutant phenotype. (C) A yeast phosphorylation network comprised primarily of in vitro phosphorylation events identified using protein microarrays (Ptacek et al. 2005Go). The figure was generated with Osprey 1.2.0. (Breitkreutz et al. 2003Go). (D) An E. coli metabolic network with 574 reactions and 473 metabolites colored according to their modules (Reprinted by permission from Macmillan Publications Ltd: Nature [Guimera and Nunes Amaral 2005Go], © 2005). (E) A yeast genetic network constructed with synthetic lethal interactions using SGA analysis on eight yeast genes (From Tong et al. 2001Go; reprinted with permission from AAAS). Nodes are colored according to their YPD cellular roles.

 

Transcription factor-binding networks

Transcription factor-binding networks have been assembled in two ways: (1) The analysis of individual components has been used to develop intricate maps in sea urchins and other model organisms (Davidson et al. 2002Go); and (2) the large-scale identification of transcription factor-binding sites using chromatin immunoprecipitation followed by probing of genomic microarrays (ChIP–chip) or DNA sequencing (ChIP–PET or STAGE) has been used to assemble networks in yeast and other organisms (Horak and Snyder 2002Go; Kim et al. 2005Go; Wei et al. 2006Go).

Thus far a large number of ChIP mapping experiments have been performed in yeast and mammalian cells. The data from ChIP experiments are often of variable quality, particularly in mammalian cells. Most of the initial ChIP–chip experiments used genomic arrays comprised of PCR products that allowed crude mapping of binding sites and often lower-quality results. More recent experiments use oligonucleotide arrays that allow higher-resolution mapping of the binding regions (Cawley et al. 2004Go; Borneman et al. 2006Go). The calling of targets is not trivial as there is a considerable range of signals and probability values associated with each target, often leading to arbitrary assignment of thresholds to the data. Nonetheless, interesting networks have been assembled using these data sets.

For yeast, >250 ChIP–chip experiments have been performed using cells incubated in a variety of experimental conditions or treated with different stimuli, and >10,000 interactions have been reported (Horak et al. 2002Go; Lee et al. 2002Go; Harbison et al. 2004Go; Borneman et al. 2006Go). These have been assembled into a variety of global networks and subnetworks. For mammalian cells, a large number of experiments have also been performed, often by analyzing selected regions of the genome (Martone et al. 2003Go; Cawley et al. 2004Go) or promoter regions (Li et al. 2003Go). For example, the global identification of targets of three factors involved in embryonic stem cell maintenance has suggested pathways important for stem cell self-renewal (Boyer et al. 2005Go). Similarly, the analysis of targets of three major transcription factors has revealed a transcriptional map of skeletal myogenesis (Blais et al. 2005Go).

By combining binding data with expression data, the putative effect of binding on transcriptional output (i.e., activation or repression) can often be obtained. For inducible factors, studies with human NF-{kappa}B and STAT1 indicate that only a subset (30%–40%) of differentially expressed genes appear to be direct targets of the factor of interest; presumably many differentially expressed genes are regulated by factors other than the one of interest. Likewise, only a small fraction of binding sites appear to be directly modulating nearby gene expression, as many binding sites do not reside near genes whose expression is altered. For example, the majority of NF-{kappa}B- and STAT1-binding sites reside near genes whose expression is not altered by the conditions that activate the factor (Martone et al. 2003Go; Cawley et al. 2004Go; Hartman et al. 2005Go). In addition, experiments with yeast have shown that deletion of a transcription factor typically affects only a subset of targets (Gasch et al. 2000Go). These observations indicate that many binding sites lack biological function, or more likely, are functionally redundant with other regulatory sites or affect gene expression under other conditions. For the case of mammalian systems, they might also operate on genes that reside at distant locations (Carroll et al. 2005Go).


Protein–protein interaction networks

Protein–protein interaction maps represent the largest and most diverse data sets available to date. The first maps were generated using two-hybrid studies in which interactions of protein partners are accessed in yeast using a transcriptional readout (Uetz et al. 2000Go; Ito et al. 2001Go). Large-scale two-hybrid studies have been used to study interactions in other organisms such as Drosophila, Caenorhabditis elegans, and humans (Giot et al. 2003Go; Li et al. 2004Go; Rual et al. 2005Go). More recently, high-throughput studies using affinity purification followed by identification of associated proteins using mass spectrometry have resulted in large data sets of protein interactions. Two recent studies have described the purification of most proteins present in a eukaryotic cell, and both identified ~500 protein complexes in yeast (Gavin et al. 2006Go; Krogan et al. 2006Go). Considering the coverage of the experiments, these studies suggest there are ~800 protein complexes in yeast. Extrapolation to the human proteome based on gene number predicts an estimate of 3000 human protein complexes.

Interactions studies each have technical concerns associated with them (Goll and Uetz 2006Go). Two-hybrid studies may reveal interactions that do not normally occur in vivo. Affinity purification, on the other hand, may yield protein contaminants and may not detect interactions in which binding partners are present substoichiometrically in a complex. Comparison between these data sets reveals only partial overlap even for the most comprehensive studies. This is likely due to the incomplete coverage of each study and diverse computational methods or stringencies applied to interpret the raw data sets. Nonetheless, these interaction maps, when integrated together, have revealed global topological and dynamic features of interactome networks that relate to known biological properties (see below).


Protein phosphorylation networks

Studies of yeast and humans have suggested that 30% of cellular proteins are phosphorylated in vivo (Cohen 2000Go; Ficarro et al. 2002Go; Manning et al. 2002aGo); this figure is most likely a large underestimate of the number of phosphorylated residues since comprehensive mapping studies have not been performed. Consistent with the importance of phosphorylation as a regulatory mechanism, eukaryotes devote ~2% of their protein-coding genes to protein kinases, ranging from 122 for yeast to 518 for humans (Zhu et al. 2000Go; Manning et al. 2002bGo).

Until recently, protein phosphorylation has generally been mapped on a limited scale. However, newly developed approaches in mass spectrometry have allowed the identification of a large number of phosphorylated residues including those regulated during cell stimuli and developmental responses (Ficarro et al. 2002Go; Gruhler et al. 2005Go; Ptacek and Snyder 2006Go). These approaches usually involve enrichment of phospho-proteins using matrices that bind phospho-modified proteins. For example, one study of the developing forebrain and midbrain tissues of embryonic mice used strong cation exchange columns followed by tandem mass spectrometry to identify >500 serine, threonine, or tyrosine phospho-sites (Ballif et al. 2004Go). Other studies have used immunoprecipitation to enrich for tyrosine phospho-proteins followed by mass spectrometry; these have led to discovery of novel phospho-tyrosine protein modifications in human T cells (Brill et al. 2004Go; Tao et al. 2005Go).

In addition to the identification of phosphorylated residues, two new approaches have shed light on discovering substrates of protein kinases. The use of modified kinases that accept only radiolabeled ATP analogs has revealed many substrates for several yeast kinases including the cyclin-dependent kinases Pho85 and Cdc28 (Dephoure et al. 2005Go; Loog and Morgan 2005Go). A second approach used a proteome microarray containing 4400 yeast proteins to detect in vitro substrates for the majority of yeast protein kinases. This study identified ~4200 phosphorylations affecting >1300 substrates (Ptacek et al. 2005Go). These different studies have identified a large number of phosphorylation events, many of which were validated in vivo. Many of the phosphorylations involved substrates that operate in a known pathway of the kinase; however, several validated substrates function in different cellular processes from those known for the kinase, thereby revealing new functions for the protein kinases.


Metabolic interaction networks

The wealth of biochemical data generated in the past century when combined with genome sequences allows the construction of metabolic networks. The metabolic network usually focuses on the mass flow in basic chemical pathways that generate essential components such as amino acids, sugars, and lipids, and the energy required by the biochemical reactions. As such, these networks typically present both protein and metabolite information. Literature curation and genome annotation have elucidated many complex biochemical pathways (Kanehisa and Goto 2000Go; Overbeek et al. 2000Go) from which various metabolic networks have been reconstructed in a wide variety of organisms such as Escherichia coli (Reed et al. 2003Go), Saccharomyces cerevisiae (Duarte et al. 2004Go), and human mitochondria (Vo et al. 2004Go).

Interactions in metabolic networks are closely related to the gene functions, and therefore have great potential for immediate applications in the interpretation of gene roles. Considerable attention has been focused on the network dynamics using constraint-based analyses such as flux balance analysis (FBA), which assumes the steady state of all metabolites and that the organisms will optimize the metabolite fluxes to maximize biomass production (Segre et al. 2002Go; Famili et al. 2003Go; Forster et al. 2003Go). This approach has led to many successful predictions. For example, an in silico flux model was used to predict the phenotypes of yeast strains containing gene deletion mutations grown under various media conditions and achieved a remarkable 83% accuracy (Duarte et al. 2004Go). In addition, a flux model on a yeast metabolic network was able to explain enzyme dispensability; that is, how loss-of-function mutations of many yeast enzymes result in viable strains (Papp et al. 2004Go). This model suggested that the majority of nonessential enzymes are vital for cell growth under certain previously untested conditions, whereas only a small subset are compensated by isoenzymes or parallel pathways. Other successful constraint-based analyses in metabolic networks have also been performed. These include (1) re-engineering micro-organisms with gene deletions for the purpose of manipulating their chemical products (Burgard et al. 2003Go) and (2) evaluating steady-flux distributions in human mitochondria using constraints related to normal, disease, and dietetic treatment conditions (Thiele et al. 2005Go). Additional examples of constraint-based analysis can be found in a detailed review (Price et al. 2004Go). Although many metabolic network studies were developed in micro-organisms and S. cerevisiae. These studies may also shed light in other organisms since the fundamental network structures may be conserved in evolution. Topological analysis of metabolic networks in 43 organisms covering all three life domains revealed highly similar topological properties, although great diversity exists among individual pathways and components (Jeong et al. 2000Go).


Genetic and small molecule interaction networks

Combining mutations in two different genes can either synergistically reduce or enhance the growth or fitness of an organism, relative to organisms containing individual mutations. One of the most common interactions analyzed is "synthetic lethality" in which mutations that do not individually cause loss of viability are lethal when combined (Bender and Pringle 1991Go; Costigan et al. 1992Go). For many—if not most—species, the majority of genes are not lethal when mutated individually; this is likely because of either genetic redundancy or because the affected genes normally enhance the fitness of the organism rather than are essential for its viability. When mutations are combined in the same strain to produce a phenotype stronger than that caused by an individual mutation, the mutated genes are often thought to reside in parallel redundant pathways, although other interpretations are possible. Regardless of the reason, the ability to combine mutations to produce strong phenotypes provides the opportunity to carry out synthetic lethal analysis on a large scale that provides a wealth of useful information.

Large-scale synthetic lethal screens have been performed in S. cerevisiae in which deletion mutations in only 1100 protein-coding genes (of ~6000 total) prevent growth in standard rich medium (Winzeler et al. 1999Go; Giaever et al. 2002Go). Genetic interaction screens using either plate (SGA) or microarray readouts (dSLAM) with yeast strains containing mutations in nonessential genes have been used to systematically uncover synthetic lethal interactions (Tong et al. 2001Go, 2004Go; Pan et al. 2004Go). One recent study that combined genetic interactions from high-throughput methods and a literature curation of 53,117 publications in PubMed produced an S. cerevisiae genetic network containing 3258 genes and 13,963 interactions; this network revealed a significant overlap with protein–protein interactions (Reguly et al. 2006Go). For essential genes, strains containing conditional mutations such as those that confer a temperature-sensitive growth defect or with the gene under the control of a tetracycline titratable promoter can be analyzed under conditions that reduce, but do not eliminate, the activity of the gene product (Davierwala et al. 2005Go). Analysis of these interactions has also revealed functional relationships between genes and a high correlation with other properties, such as mutant phenotypes and cellular localization, thus helping to assign biological roles for unknown genes and infer novel functions to annotated genes.

In addition to synthetic lethal screens, other types of genetic interactions can be measured. These include combining mutations that disrupt inhibitory interactions and thus enhance growth. In fact, interactions that when combined either enhance or reduce growth have been investigated to generate a detailed genetic interaction map, E-MAPs (for epistatic miniarray profiles), for genes involved in the yeast early secretory pathway (Schuldiner et al. 2005Go). Another type of genetic interaction is a synthetic dosage lethal screen in which overexpressed genes are introduced into a mutant strain background; synthetic dosage lethality can provide additional, and often nonoverlapping interaction data to those found by combining inactivating mutations (Measday et al. 2005Go). For example, overexpression of genes that inhibit growth in a mutant strain background has been used to screen for genes that would negatively regulate protein kinase substrates (Sopko et al. 2006Go). Finally, a conceptually similar approach to synthetic lethality is to screen for mutant strains that are hypersensitive to inhibitory small molecules. Thus far, screens have been performed between inhibitory chemical compounds and deletion mutants of all yeast nonessential genes or strains heterozygous for mutations in essential genes (Giaever et al. 2004Go; Parsons et al. 2004Go). Such chemical genetic interactions, when integrated with genetic interactions, often suggest pathways targeted by the drugs as well as potential direct drug targets. Thus, this approach offers a powerful tool in deciphering the mechanisms of action of drugs as well defining suitable biological pathways that can be targeted for inhibition.


Other biological networks

The global behavior of gene interactions can also be investigated by networks connecting genes and/or proteins sharing certain properties. A coexpression network, in which genes are connected if their transcripts are coregulated, was assembled in S. cerevisiae and contains 4077 genes connected by 65,430 interactions (Stuart et al. 2003Go; van Noort et al. 2004Go). Proteins that share other properties, such as biological processes (Tari et al. 2005Go) and mutant phenotypes (Gunsalus et al. 2005Go; Ohya et al. 2005Go), can also be linked with each other and assembled into networks. The coexpression and homolog networks differ from the other networks described above in that the interactions are based on similarities not related to gene function. Nonetheless, they can still be investigated with similar approaches and often exhibit comparable network topology. Moreover, these networks also share the "guilt by association" property with the five biological networks: Highly connected proteins are likely to be functionally related. Therefore studies on these networks may also discover novel protein roles and help to decipher the complex cellular networks, especially when integrated with other biological networks.


    Global topology
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Interactions are often assembled into network maps comprised of proteins (or genes) termed vertices or nodes and connections between them defined as edges (in undirected networks) or arcs (in directed networks). The directionality of a network is dependent on the characteristics of the biological data. Protein–protein and genetic interactions are usually represented with an undirected network, whereas transcription factor binding, phosphorylation, and metabolic networks have directionality built into their interactions. One feature of nearly all of the interaction studies is that the strength of interactions can vary considerably. For example, the dissociation constant values observed in a biological system can vary by >10 orders of magnitude (Wallis et al. 1995Go). Such quantitative information, however, is rarely used in most network analyses, and interactions are usually reported as binary measurements. Future studies are likely to overcome these limitations as more accurate measurements are obtained, and weighted values can be assigned to network connections as indicators of the interaction strength.

Network topology plays a vital role in understanding network architecture and performance. Several of the most important and commonly used topological features include degree, clustering coefficient, shortest path length, and betweenness (Fig. 2). Detailed descriptions of each these statistics are listed as follows: (1) Degree: The number of links connected to one vertex is defined as its degree. In directed networks, the number of arcs that end at the node is termed as "in-degree," and the number of arcs that start from the node is termed as "out-degree." A node with high degree is better connected in the network and therefore may play a more important role in maintaining the network structure. (2) Distance: The shortest path length between two vertices is defined as their distance. In an interaction network, the maximum distance between any two nodes is termed as the graph diameter. The average distance and diameter of a network measure the approximate distance between vertices in a network. A network with a small diameter is often termed as a "small world" network (Milgram 1967Go), in which any two nodes can be connected with relatively short paths. Many real world networks such as metabolic networks have a small world architecture (Watts and Strogatz 1998Go), which may serve to minimize transition times between metabolic states (Wagner and Fell 2001Go). (3) Clustering coefficient: The clustering coefficient of one vertex can be calculated as the number of links between the vertices within its neighborhood divided by the number of links that are possible between them. A high clustering coefficient for a network is another indicator of a small world. (4) Betweenness: Betweenness is the fraction of the shortest paths between all pairs of vertices that pass through one vertex or link. Betweenness estimates the traffic load through one node or link assuming that the information flows over a network primarily following the shortest available paths.


Figure 2
View larger version (18K):
[in this window]
[in a new window]

 
Figure 2. Topological parameters. Five commonly used topological parameters are illustrated in both graphs and formulae. (A) Degree measures the number of connections one node has. (B) Distance is the length of the shortest path between two nodes. (C) Diameter is the maximum distance between any two nodes in a network. (D) Clustering coefficient measures the percentage of existing links among the neighborhood of one node. (E) Betweenness is the fraction of those shortest paths between all pairs of vertices that pass through one vertex or link. All graphs and formulae are based on an undirected network.

 
Assembly of interactions into networks reveals that current versions of biological networks are not randomly organized but rather have a "scale-free" format containing hubs with many connections and a large number of nodes that have one or a small number of connections (Fig. 3; Barabasi and Oltvai 2004Go). This organization was originally discovered in World Wide Web interactions and later found to exist in four of the types of networks described above: protein–protein interactions, transcription factor binding, metabolic, and genetic data sets (Barabasi and Albert 1999Go; Jeong et al. 2000Go, 2001Go; Guelzim et al. 2002Go; Tong et al. 2004Go). Below we demonstrate that this is also the case for the phosphorylation network as well. Compared with a bell-shaped degree distribution in random networks, scale-free networks have a typical "power law" distribution, P(k) {propto} k{gamma}, in which k is the degree and P(k) is the probability that a randomly selected node has a degree k. This results in a "fat-tailed" distribution in which there are vertices with high degrees termed "hubs." The advantage of this type of organization is that the system is more robust; random loss of individual nonhub vertices is less disruptive in a scale-free network than a random network.


Figure 3
View larger version (25K):
[in this window]
[in a new window]

 
Figure 3. Topological comparison between a random network and a scale-free network. Degree distribution in random networks is bell-shaped. The scale-free network has more high-degree nodes and a power-law degree distribution, which leads to a straight line when plotting the total number of nodes with a particular degree versus that degree in log-log scales.

 
Hub components in a scale-free network are extremely important and therefore usually play essential roles in biological systems. In the yeast protein–protein interaction networks, hubs are more likely to be essential and conserved relative to nonhub proteins (Jeong et al. 2001Go; Barabasi and Oltvai 2004Go). Presumably much of the regulation in a network occurs and is mediated through such proteins. Likewise, key components whose activation is sufficient to induce a cellular process (master regulator genes) have been shown to be regulated by many other components and are thus target hubs; these often lie downstream in the process (Weintraub et al. 1989Go; Borneman et al. 2006Go). Not all components within a regulatory pathway serve as master regulators, probably because noise introduced into the system may inappropriately activate the process at undesired times. Presumably, components that lie within a network are buffered through both positive and negative regulatory contacts that prevent them from directly activating a biological process. The location of master regulators at the bottom of a highly connected network would allow maximum information input to be interpreted through upstream components and relayed into a final decision output; thus master regulators often represent important regulatory nodes in biological networks. For example, Twist, a master regulator controlling gene expression in embryonic morphogenesis, is responsible for tumor invasion and metastasis (Yang et al. 2004Go).

Further analysis of the transcription factor network has also revealed an additional novel aspect of regulatory network hierarchy. When the binding targets of E. coli and S. cerevisiae transcription factors are analyzed with respect to binding to other transcription factors, a pyramid-shaped hierarchical organization can be assembled with a few key regulators at the top to which few other factors bind and most transcription factors on the bottom as the functional units for specific pathways (Yu and Gerstein 2006Go). Similar to the middle managers in social networks such as governmental hierarchies, transcription factors in the middle layers often regulate more targets and have higher betweenness, indicating that they may function as bottlenecks in the hierarchy. With more interaction data gathered in the future, such hierarchical structures can also be investigated in other directed networks such as metabolic networks and phosphorylation networks.


    Similarities between the transcription and phosphorylation networks
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Transcriptional control and post-translational regulation with kinase phosphorylation are two major methods eukaryotes use for gene regulation; each controls a large number of targets. In yeast, humans, and many other organisms, the number of these two types of regulators is within twofold; there are ~250 transcription factors and 122 protein kinases in yeast (Zhu et al. 2000Go; Harbison et al. 2004Go) and ~1300 transcription factors and 518 protein kinases in humans. As shown in Figure 4, we have performed a detailed comparison of the network topologies of the yeast transcription factor-binding network and phosphorylation network under rich-nutrient conditions. These networks contain a remarkable number of similarities. First, the two networks share similar degree distributions: exponential in-degree distributions (Fig. 4A) and power law out-degree distributions (Fig. 4B). Second, many topological parameters are comparable between the two networks; however, the phosphorylation network is denser than the transcription factor-binding network and contains more nodes with large in- and out-degrees. Finally, the current phosphorylation network is smaller than the transcription factor-binding network. Both networks are built on incomplete data sets and may contain errors. The yeast phosphorylation data, in particular, are primarily collected from one large-scale study covering only two-thirds of all the yeast kinases. The transcription factor-binding network has more experimental sources and therefore a larger coverage. Since diameter is positively correlated with the network size, and limited sampling of a network often lowers the average clustering coefficient (Friedel and Zimmer 2006Go), the difference in the network size may explain why the transcription factor-binding network has a larger diameter and a higher clustering coefficient.


Figure 4
View larger version (19K):
[in this window]
[in a new window]

 
Figure 4. The yeast phosphorylation network resembles the transcription factor-binding network in their topological structures. (A) The in-degree and out-degree distributions were plotted after the nodes were binned to several degree intervals. Both networks have power-law in-degree distributions and exponential out-degree distributions. (B) Many topological parameters are comparable between the two networks, except that the transcriptional network is larger and the phosphorylation network is denser.

 

    Network modules
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Although initial studies have characterized the global topological structure of biological networks, recently much attention has been paid to the local units of the networks. Large subgraph units, assembled by groups of densely associated proteins and connected to each other with loose links, are defined as network modules (Girvan and Newman 2002Go; Rives and Galitski 2003Go; Newman 2006Go). Such community-like network modules have been uncovered in many types of social networks as well as biological networks, in which they often function as essential components of the network. For example, one study of protein interactions in a transcriptional network indicates that different types of transcriptional regulators such as transcription factors, nuclear transporters, and nucleosome remodeling proteins prefer to form modules within each class, and the modules are jointed with sparse connections (Tsankov et al. 2006Go). The modules often contain proteins of unknown function, and therefore may shed light on protein function predictions. Furthermore, two classes of proteins are revealed by studies of modular structures. "Module organizer" proteins are highly connected to other proteins within modules and are essential to the module functions. "Module connector" proteins link different modules together and are vital for intermodule communications (Rives and Galitski 2003Go).

Many methods have been developed to identify possible network modules. A traditional method, hierarchical clustering, assigns a weight value to the distance between any two nodes in a network, and then gathers nodes with similar weight vectors together into strongly connected cores (Rives and Galitski 2003Go). Instead of detecting cores of modules in hierarchical clustering, the Girvan-Newman algorithm focuses on defining the boundaries of modules by searching for edges with high betweenness and therefore those that are more likely to link different modules (Girvan and Newman 2002Go). Other algorithms have been introduced recently and may demonstrate improvement in module identification (Guimera and Nunes Amaral 2005Go; Adamcsek et al. 2006Go; Newman 2006Go). One concern, however, is that network modules are often dependent on the methods and parameters used in the initial data partitioning, and in general it is difficult to tell which method is better (Barabasi and Oltvai 2004Go). Furthermore, inaccurate and incomplete data of the interaction networks may also lead to biased module predictions. Nonetheless, networks modules are still ubiquitous structures in most biological networks and may help one to better understand the interplay between network structure and function.


    Network motifs
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
The availability of large interaction data sets allows the identification of much smaller common patterns or motifs within large networks that are used with significantly higher frequencies relative to randomized networks. Analysis of transcription factor-binding data in E. coli has revealed three different types of motifs: feed-forward loops (FFL), single input modules (SIM), and dense overlapping regulons (DOR) (Shen-Orr et al. 2002Go). FFL and DOR are also found to be significantly enriched in yeast transcriptional networks (Milo et al. 2002Go). It is possible that many, and perhaps all, single input motifs in eukaryotes are the result of incomplete data and that most genes probably contain multiple inputs.

We applied a tool, mfinder (Milo et al. 2002Go), to identify enriched three-element and four-element motifs in an updated yeast transcription factor-binding network and the yeast phosphorylation network. Both data sets were generated in yeast cells grown in rich media conditions. Among all possible three-element motifs, the FFL was found to be well overrepresented in transcriptional networks (Fig. 5). Coherent FFL, in which both transcription factors have the same regulation effects (induction or repression) on the target, may suggest a functional design for gene transcription regulation. Studies have shown that coherent FFLs can control downstream processes in a fashion that is resistant to transient noise, since targets in FFL can only be effectively regulated through persistent signals (Shen-Orr et al. 2002Go). A FFL motif can be easily extended to a four-element motif, "bi-FFL," in which the two regulators collectively control two targets. Bi-FFL motifs are also significantly enriched in yeast transcription factor-binding networks.


Figure 5
View larger version (26K):
[in this window]
[in a new window]

 
Figure 5. All three-unit and four-unit motifs enriched in the yeast transcriptional factor-binding (TF) network and phosphorylation (PHO) network. The units are colored red as regulators and green as targets. The significance of enrichment is calculated by comparing motif numbers in the transcription factor or phosphorylation networks (solid bars) with the numbers from randomized networks (hollow bars) and indicated by the z-scores. (A) Bi-fan motifs, in which two regulators bind common targets, are enriched in both the transcription factor network and phosphorylation network. (B) Bi-parallel motifs, in which one regulator controls two other regulators that further regulate one target gene, are enriched in both the transcription factor network and phosphorylation network. (C) FFLs, in which one regulator controls another regulator and both of them bind a common target, are enriched in the transcription factor network only. (D) Bi-FFL motifs, in which one regulator controls another regulator and both of them bind two common targets, are enriched in the transcription factor network only. (E) Single input motifs, in which one regulator binds to multiple targets, are enriched in the phosphorylation network only.

 
Thus far, FFLs are not enriched in the current yeast phosphorylation network. This may be due to the approach used to prepare the network that tends to underestimate the phosphorylation events between kinases, and additional data may be required to properly evaluate this network. However, it is possible that the lack of FFL in phosphorylation networks relative to transcriptional networks also reflects the biology of these networks. Phosphorylation networks are often activated by transient signals that lead to extremely rapid responses on the order of a few minutes. In contrast, transcriptional networks are slower and take longer to reach steady state.

Two four-element motifs were enriched in both the yeast transcriptional network and the phosphorylation network (Fig. 5). A simple version of the DOR motif, the "bi-fan motif," in which two regulators bind common targets, may suggest a way to use a limited number of regulators to precisely control a large number of targets under several different conditions. Moreover, the cooperation of transcription factors to regulate targets can also compensate for the degeneracy and low affinity of single transcription factor-binding sites (Pilpel et al. 2001Go). The other enriched four-element motif, the "bi-parallel motif," comprises a regulator controlling two other regulators that further regulate one target gene. Bi-parallel motifs are found in both transcriptional and phosphorylation networks and indicate redundancy. In addition to the two four-element motifs shared by both networks, the single input motif (SIM) was found to be overrepresented only in the yeast phosphorylation network. This likely reflects the lack of phosphorylation data currently available.


    Network integration
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Integration of different experimental resources is used in several different ways: (1) to improve the accuracy of interactions, (2) to identify composite motifs, and (3) to make functional predictions. Integration of similar data sets generated with different methods provides a crucial way to improve data quality and recover missing data. To remove erroneous interactions in the yeast protein–protein interaction network, a "filtered yeast interactome" (FYI) was constructed with high-confidence interactions observed in at least two experimental sources (Han et al. 2004Go). Studies on C. elegans early embryogenesis genes led to an integrative network containing three types of heterogeneous data: protein–protein interaction, expression profiling similarity, and phenotypic profiling similarity (Gunsalus et al. 2005Go). Further functional analysis demonstrated that gene pairs connected by interactions from multiple sources are more likely from the same GO functional categories, indicating improved accuracy through data integration. In the transcriptional network, integration with the gene expression data set has also proven to be useful to improve the data quality and reveal novel cis-regulatory modules (Bar-Joseph et al. 2003Go).

Recent bioinformatics software platforms enable users to query and integrate very different types of interaction data to learn new information (Breitkreutz et al. 2003Go; Shannon et al. 2003Go; Stark et al. 2006Go). Instead of searching for overlapping interactions, integration of very different types of interaction data can also be performed to reveal composite motifs that contain multiple types of interactions and elements as basic units. An integration of transcription factor binding, protein–protein interactions, and phosphorylation data from yeast has revealed a mega-network of >60,000 interactions (Fig. 6A). Investigations in this mega-network revealed seven three-element kinase-centered composite motifs (Fig. 6B), of which five (motifs 1–5) were shown to be overrepresented (Ptacek et al. 2005Go). These composite motifs involve at least one kinase–substrate interaction pair (referred to as "kinates") and one other type of interaction (protein–protein interaction or transcription factor binding). Thus, network integration combines various data sources together and therefore can assist in uncovering proteins that are important in multiple types of interactions and provide a more comprehensive view on their cellular functions. Moreover, this network can be combined with other networks such as biochemical and gene interaction data to reveal a more comprehensive view of regulation in yeast.


Figure 6
View larger version (67K):
[in this window]
[in a new window]

 
Figure 6. Network integration: mega-network and composite motifs. (A) Three types of interactions—phosphorylation (blue), transcription factor binding (yellow) and protein-protein (magenta)—are combined into a mega-network. (B) Seven three-element kinase-centered composite motifs are listed. (1) Interacting kinate motif in which one kinase phosphorylates two interacting substrates. (2) Scaffold motif in which one protein interacts with both a kinase and its substrate. (3) Transcription factor-regulated kinate motif in which one transcription factor (TF) regulates the expression of both a kinase and its substrate. (4) Kinate regulon motif in which one kinase phosphorylates both a transcription factor and the target bound by the transcription factor. (5) Kinate feedback loop I motif in which a kinase phosphorylates a protein that interacts with a transcription factor that regulates the expression of that kinase. (6) Kinate feedback loop II in which a kinase phosphorylates a transcription factor whose target physically interacts with the kinase. (7) Heterosubstrate regulation motif with an interacting kinase and transcription factor regulating one target together. Motifs (1) to (5) were found to be enriched in the yeast integrated network.

 
In addition to mapping gene roles in a multirelationship network, integration of a variety of relevant genomic data can directly help to predict gene functions and functional relationships such as protein–protein interactions (Jansen et al. 2003Go; Troyanskaya et al. 2003Go). Compared with simple combinations of nonweighted interactions, a probabilistic approach integrating confidence-weighted data sources may be superior in modeling real biological data, considering its complicated and heterogeneous nature. For example, certain interaction data sets may be less error-prone and more reliable than others, and moreover, depending on the purposes of the data integration, certain data sources may be more informative and/or relevant than others. In order to overcome the challenge of data source heterogeneity, several early studies developed Bayesian network approaches to incorporate various data sources such as protein–protein interactions, gene expression profiles, and protein localization data for the purpose of predicting protein functions (Troyanskaya et al. 2003Go) and protein–protein interactions (Jansen et al. 2003Go). In both studies, the statistical reliabilities of different related genomic data were calculated when comparing with "gold-standard" samples consisting of known positives and negatives, and then extrapolated proteome-wide for novel predictions. Later studies also applied probabilistic models for the discovery of unknown components in pathway-specific protein complexes (Myers et al. 2005Go). Overall such probabilistic models have proven to be valuable in integrating heterogeneous genomic data and demonstrated a substantial improvement in prediction accuracy.


    Network dynamics
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Biological networks exhibit complex dynamic behavior, thereby enabling cells to react to various conditions or cell states such as cell cycle progression. Unfortunately, most large-scale data sets do not contain this information; static interactions are often identified from cells exposed to a single condition or at a single time point, often under nonnative conditions (e.g., two-hybrid). Only recently have approaches emerged that attempt to analyze the dynamics of complex biological networks. More interaction data sets have been collected in specific cellular conditions, and more importantly, integration with gene expression profile under various conditions has proven to be very helpful in network dynamics studies.

In protein–protein interaction networks, proteins may vary their partners according to time and location. By integrating gene expression data with a high-quality yeast protein–protein interaction data set, Han et al. (2004)Go studied the network dynamics in protein–protein interaction networks and revealed two types of hubs: "party hubs" and "date hubs." Party hubs interact with all their partners simultaneously—that is, at the same time and spatial locations—and are more likely to function within the same cellular processes. Date hubs, on the other hand, vary their connections to other proteins at different times and locations and therefore link various biological processes. When considering the modular designs of networks, in silico deletions of these hubs implied that party hubs are more likely to be the module organizers and date hubs to be the module connectors.

The dynamics of the transcriptional network in yeast has been examined on a genomic scale by integrating gene expression data for five cellular conditions with known transcriptional regulatory relationships (Luscombe et al. 2004Go). A trace-back algorithm was applied to uncover subnetworks that are active under specific conditions. Luscombe et al. (2004)Go found that these subnetworks exhibit vastly different topologies on both a local and a global level and uncovered two separate groups of cellular states. In so-called exogenous states (e.g., stress response), the network has a shorter diameter and large hubs that should allow cells to respond quickly to external conditions. In endogenous states (e.g., cell cycle), loops and highly intricate connections are more prevalent, indicating a multistage internal program. Different sets of transcription factors become key regulatory hubs at different times, portraying a network that shifts its weight between different foci to bring about distinct cellular states.


    Network evolution
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Various models have been proposed to explain the development of the scale-free topology of the protein–protein interaction network during evolution. A "network growth" model assumes that nodes with fixed degree are constantly added to the network. The probability that a newly added node interacts with an existing node is proportional to its degree, which leads to a so-called preferential attachment model in which rich nodes get richer during evolution and finally form a scale-free network (Barabasi and Albert 1999Go). In biological networks, the addition of nodes is due to gene duplication. This model was supported by the fact that older nodes (proteins having orthologs in evolutionarily distant organisms) tend to have higher degrees than newer nodes (proteins having orthologs in evolutionarily close organisms) (Eisenberg and Levanon 2003Go). However, examination of duplicated genes shows that they will quickly diverge in their connections and thereby rapidly specialize their interacting partners. Thus, most paralogs do not share the same partners. These contradictions lead to a "link dynamics" model that explains the network evolution through interaction loss and preferential interaction gain (Wagner 2001Go, 2003Go).

In general, core components of a network tend to be conserved, whereas components at the periphery or false interactions are not. In transcription factor-binding networks, this concept has been applied to identify functional regulatory elements that are conserved in several yeast species (Cliften et al. 2003Go; Kellis et al. 2003Go). Studies also have shown that interactions in one organism can be mapped to another organism if both partners are highly conserved (Yu et al. 2004Go). Conserved protein–protein interaction pairs are termed as interologs (Walhout et al. 2000Go), whereas conserved transcriptional binding interaction pairs are termed as regulogs (Yu et al. 2004Go). New interactions in novel organisms can then be discovered through mapping interologs or regulogs.

Although conservation of network components and connections is extremely valuable for mapping conserved interactions and common features among organisms, it is likely that many regulatory interactions are not conserved. Mapping of Ste12- and Tec1-binding sites in closely related yeast S. cerevisiae, Saccharomyces mikatae, and Saccharomyces bayanus reveals extensive divergence in binding sites in these different yeasts (A. Borneman and M. Snyder, unpubl.). These changes likely lead to species diversity and the ability of organisms to occupy distinct ecological niches.


    Networks and human disease
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Disruption of network architecture is expected to relate to human diseases. One advantage of scale-free networks is robustness—loss of individual components usually maintains overall network topology. This organization in general should make a system relatively immune to defects that target individual components. Loss of multiple components as occurs in many forms of cancer is required for network breakdown. This architecture may explain, in part, the observation that multiple mutations are often required for the onset of cancer (Knudson 1971Go). Nonetheless, some regions of networks should be more vulnerable to disruptions than others. Loss-of-activity mutations that affect hubs are more likely to cause a defect than those that affect the periphery. In addition, we expect that activating mutations in master regulators (target hubs) are more likely to cause apparent defects in cellular and developmental processes than those that occur elsewhere in the network. Thus, identifying such hubs may suggest possible drug targets for reconstructing the network and therefore curing disease.

Identification of functional roles of unknown pathogenic genes can also shed light on discovering disease pathogenic mechanisms. Proteins connected tightly in biological networks often work in similar processes. Hence, functional annotations of interacting partners may indicate potential roles of unannotated disease-related genes and help us to better understand the pathological mechanisms of the disease. Lim et al. (2006)Go constructed an interactome map focusing on proteins responsive to human inherited ataxias and purkinje cell degeneration with a yeast two-hybrid screen. The majority of known ataxia-causing proteins were connected with short paths, suggesting that other components in the network might contain candidates responsive to other related inherited ataxias with unknown causative genes. Furthermore, the hubs of this network had crucial roles for disease development in animal models, implying a relationship between the disease and the biological processes in which they are involved: RNA binding or splicing. Such systematic studies can easily be applied to other diseases and organisms and will help to identify crucial components for the disease pathology.


    Challenges and future directions
 Top
 Abstract
 Types of biological networks
 Global topology
 Similarities between the...
 Network modules
 Network motifs
 Network integration
 Network dynamics
 Network evolution
 Networks and human disease
 Challenges and future directions
 Acknowledgments
 References
 
Current studies often draw conclusions for complete interaction networks from limited and possibly erroneous samples of the actual biological networks. The yeast two-hybrid protein–protein interaction network, for example, shows a typical scale-free structure and is often used to infer that the complete yeast protein–protein interaction network has the same properties. Recent studies, however, indicate that the scale-free topology might be generated through the experimental designs, which resulted