August 2006
“Most problems have either many answers or no answer. Only a few problems have a single answer.” Edmund C. Berkeley
Section of DNA. The bases lie horizontally between the two spiraling strands. Source: Wikimedia Commons.
The search for genetic differences among people represents one of the most active areas of research made possible by the completion of the human genome sequence. Yet the notion that there is such a thing as “the human genome” carries with it the implication that there are fundamental genomic characteristics that are universal among all members of a species. The most obvious of these relate to the quantity and arrangement of genetic material. We now know that:
- One copy of a human’s genome contains about 3.5 picograms (pg, or 10-12 grams) of DNA packaged into 23 chromosomes.
- Chimpanzees, the closest living relatives of Homo sapiens, carry around a slightly heavier genome (3.75 pg) apportioned into 24 chromosomes.
- An aardvark genome, by contrast, is contained within only 10 chromosomes but weighs in at 5.8 pg.1
The basic idea that the amount of DNA per chromosome set might be consistent across cells within bodies and among individuals within species was hinted at as early as 1885. An explicit “DNA constancy hypothesis,” however, was not developed until the mid-20th century,2 stemming from a 1948 report of “a remarkable constancy in the nuclear DNA content of all the cells in all the individuals within a given animal species,”3 which was interpreted as evidence in favour of DNA, rather than proteins, as the molecule responsible for inheritance.
The DNA constancy hypothesis
In the simplest terms, the DNA constancy hypothesis that emerged in the late 1940s and early 1950s consisted of two central ideas:
- The amount of DNA per chromosome set within an individual organism is constant.
- The DNA content of a single set of chromosomes is largely invariant among members of the same species.
The underlying notion of DNA constancy persists more than a half-century later, even though there are interesting exceptions to both of these postulates (which are beyond the scope of this article). In fact, DNA constancy is an important assumption in modern genome size research, because the two dominant methods of DNA quantification both rely on the use of standards of “known” DNA content for certain conversions.4-6
The C-value paradox
It is due to its constancy that the amount of DNA contained within a haploid chromosome set is commonly referred to as the “C value,” a term coined by Hewson Swift in 1950.7 One year later, scientists provided the first taxonomically broad survey of C values and noted that:
Comparing the largest and one of the smallest examples among vertebrates, one finds that a cell of Amphiuma, a urodele, contains 70 times as much DNA as is found in a cell of the domestic fowl, a far more highly developed animal. It seems most unlikely that Amphiuma contains 70 times as many different genes as does the fowl or that a gene of Amphiuma contains 70 times as much DNA as does one in the fowl. To make a somewhat different comparison: a cell of Amphiuma contains 170 times as much DNA as does a cell of a relatively closely related animal, the trigger fish, whereas a cell of the latter contains only nine times as much DNA as does a cell of a sponge, which is far removed phylogenetically from any vertebrate.8
It is not difficult to understand why observations such as these engendered considerable confusion for the next two decades. As C. A. Thomas put it in 1971, “It was argued that mammals display a greater developmental complexity than primitive fish, therefore, they must have more genes, yet why should the lower forms have more DNA, if DNA is the chemical basis of the gene?”9 To early researchers this seemed downright paradoxical—and indeed, Thomas dubbed the disconnect between genome size and organismal complexity the “C-value paradox.”
The C-value paradox has traditionally been described in three different ways:
More complex organisms do not always have larger genomes than simpler ones. “The quantity of DNA does not seem to be related to the number of genes, for the amount of DNA does not increase unequivocally with the complexity and number of hereditary characters.”10
Any given genome seems to contain more DNA than would be needed for the predicted gene number. “One of the problems of eukaryotic genetics is that higher organisms possess much more DNA in their genome than they are likely to need as genetic information.”11
Some closely related species exhibit divergent DNA contents. “The paradox is the fact that organisms at the same general level of morphological complexity, which presumably have the same genetic requirements, nevertheless often have genomes whose DNA contents differ by orders of magnitude.”12
Consider, for example, the reported genome sizes versus semi-subjective notions of complexity for some well-known organisms:
- Nematode worm (Caenorhabditis elegans): 0.1 pg
- Thale cress (Arabidopsis thaliana): 0.16 pg
- Fruit fly (Drosophila melanogaster): 0.18 pg
- Pufferfish (Takifugu rubripes): 0.4 pg
- Rice (Oryza sativa): 0.5 pg
- Human (Homo sapiens): 3.5 pg
- Leopard frog (Rana pipiens): 6.7 pg
- Onion (Allium cepa): 16.75 pg
- Mountain grasshopper (Podisma pedestris): 16.9 pg
- Tiger salamander (Ambystoma tigrinum): 32 pg
- Easter lily (Lilium longiflorum): 35.2 pg
- Marbled lungfish (Protopterus aethiopicus): 132 pg
The human genome, it turns out, is thoroughly average in size for a mammal and significantly smaller than that of various plants, amphibians, insects, and even some single-celled protozoa. Some authors apparently found this revelation bruising to the human ego, as reflected in this complaint:
Being a little chauvinistic toward our own species, we like to think that man is surely one of the most complicated species on earth and thus needs just about the maximum number of genes. However, the lowly liverwort has 18 times as much DNA as we, and the slimy, dull salamander known as Amphiuma has 26 times our complement of DNA. To further add to the insult, the unicellular Euglena has almost as much DNA as man.13
Noncoding DNA and the end of the paradox
In spite of its label, the “paradox” was not so much the lack of a correlation with complexity, per se, but rather the inability of early researchers to reconcile the constancy of DNA content within species (which occurs because it is the stuff of genes) with the variation in quantity of DNA among species (which does not relate to the number of genes). Today, the solution to the paradox is widely recognized: Most eukaryotic DNA does not code for proteins, so there is no reason to expect a complex organism to have a large genome or a simple organism to have a small one.
To put it succinctly, the C-value paradox vanished the moment geneticists abandoned the concept of the genome consisting of the genes, all the genes, and nothing but the genes.
Stanley K. Sessions may have said it best 20 years ago when, in a review of the influential volume The Evolution of Genome Size,14 he pointed out that:
The C-value paradox is the observation that genome size does not correspond to the amount of DNA needed for protein-coding functions. This observation is a paradox only under the expectation that genome size should be equal or proportional to gene number and should therefore increase with “organismal complexity.” This paradox has literally disappeared with the discovery that genomes contain “excess” (largely repetitive) DNA that is not transcribed into functional products. Thus it is no longer mysterious that salamanders (for example) have larger genomes than humans. The origin and precise function of the “excess” DNA (which may constitute more than 99% of the genomic DNA) remains an unsolved problem, but it is not a paradox.15
Comparatively modest in size though it is, the human genome provides an excellent illustration of the overwhelming abundance of noncoding DNA and thus the solution to the old “C-value paradox.” In 2001, the International Human Genome Sequencing Consortium revealed that each copy of the human genome consists of the following:
- 1.5% protein-coding genes
- 25.9% introns (noncoding regions within gene sequences)
- 20.4% long interspersed nuclear elements (LINEs), including 516,000 copies of the transposable element known asLINE-1
- 13.1% short interspersed nuclear elements (SINEs), including 1,090,000 copies of the Alu element
- 2.9% DNA transposons (mobile DNA elements)
- 8.3% long terminal repeat (LTR) retrotransposons (transposons copied from RNA and flanked by repeated sequences)
- 5% segmental duplications
- 3% simple sequence repeats
- 11.6% miscellaneous unique sequences
- 8% miscellaneous compacted DNA, or heterochromatin
The C-value enigma
As Wendell L. Wilkie once quipped, “a good catchword can obscure analysis for 50 years.” Despite its obvious obsolescence, and in a clear case of linguistic inertia taking precedence over scientific precision, the term “C-value paradox” continues to enjoy widespread use—often with confusion and miscommunication as the outcome. Variation in genome size is not the least bit paradoxical, but as Sessions and many others have noted, it remains a long-standing puzzle in need of resolution. As an alternative to the outdated term “C-value paradox,” which tends to inspire one-dimensional attempts at explanation, the new term “C-value enigma” has been offered in its place.17-19
As an enigma—a complex puzzle—the issue of genome size variation can be explicitly divided into several component questions, each of which must be answered if a complete understanding is to be achieved:
- What are the sources of all this noncoding DNA?
- In what proportions are different types of noncoding DNA represented in the genomes of different species?
- By what mechanisms is noncoding DNA gained and lost over evolutionary time?
- What are the phenotypic implications, or in some cases perhaps even functions, of noncoding DNA?
- Why are the genomes of some species, such as nematodes or rice, streamlined while others, such as those of lungfishes or lilies, are positively enormous?
Unraveling the enigma
While a great deal of work remains to be conducted in terms of each of the component questions of the C-value enigma, research spanning the past 50 years—from the origin of the DNA constancy hypothesis to the modern era of complete genome sequencing—has revealed many important insights regarding the nature and impacts of noncoding DNA. Among the most notable are these findings:
A very large fraction of many eukaryotic genomes is composed of “genomic parasites” in the form of transposable elements; in humans, nearly half of the genome consists of such “selfish DNA.” Moreover, large genomes contain a larger proportion of transposable elements and a lower proportion of protein-coding genes than smaller genomes.
The abundances and/or lengths of several types of both single-copy and repetitive noncoding DNA appear to increase along with genome size, including all types of transposable elements, introns, microsatellites (repetitive short nucleotide sequences), and ribosomal RNA genes. The amplification and loss of these sequence types varies, suggesting that there may be a general mechanism for DNA content modulation that applies across the genome.
Mechanisms exist that are capable of increasing or decreasing genome size over both short and long evolutionary timescales. For example, duplicative transposition of transposable elements and small- and large-scale duplications (from single genes to entire genomes) can add DNA to genomes, sometimes in large amounts and often very rapidly in evolutionary terms. Other processes can either add or remove DNA at a range of scales, such as the insertion or deletion of one or a few nucleotides during DNA replication, recombination events leading to the addition or loss of chromosome segments, and gains or losses of entire chromosomes.
Genome size correlates positively with nucleus and cell size, and negatively with cell division rate, in a wide range of cell types and organisms. The preponderance of the evidence indicates that genome size exerts a causative influence on these cellular parameters.
Depending on the biology of the group in question, the cell-level effects of genome size variation may result in correlations between DNA content and body size, metabolic rate, developmental rate, organ complexity, geographical distribution, and ecological niche.
A new paradox?
Most of the early discussion surrounding the C-value paradox was predicated on the assumption that gene number and organismal complexity would be closely linked. In light of the extraordinary complexity of its bearer, the human genome in particular was expected to contain an exceptionally high number of protein-coding genes. Prior to the completion of the draft genome sequence, 100,000 genes was a common estimate; as it turns out, the human genome contains a mere 20,000 to 25,000 genes.20 Comparing this with the more than 3,000,000 copies of transposable elements present in each human genome, including more than one million copies of the SINEAlu, it is no wonder that W. Ford Doolittle once suggested, only partly facetiously, that our genomes “might be ironically viewed as vehicles for the replication ofAlusequences.”21
An examination of the genomes of other species shows that, like genome size, gene number is a poor predictor of organismal complexity:
- Fruit fly (Drosophila melanogaster): 13,500 genes
- Nematode worm (Caenorhabditis elegans): 20,000 genes
- Human (Homo sapiens): 20,000 to 25,000 genes
- Pufferfish (Takifugu rubripes): 21,000 genes
- Thale cress (Arabidopsis thaliana): 25,500 genes
- Rice (Oryza sativa): 40,000 to 50,000 genes
As with C-values, this observation has been the source of significant surprise among genome researchers. “How can our own supremely sophisticated species be governed by just 50% to 100% more genes than the nematode worm?” some wondered.22 Following the same formula as with genome size (simplistic expectation + contradictory data = “paradox”), this disparity between gene number and complexity has been labeled as the “G-value paradox” or “N-value paradox.”23-25
The G-value enigma
Perhaps it should go without saying that the G-value “paradox,” like its C-value predecessor, is not paradoxical at all. What the data currently emerging from comparative genomics indicate is that the mechanisms by which the genome specifies the construction of an organism is complex and, for the time being, puzzling: a “G-value enigma.” And, like the C-value enigma, this new puzzle is most likely to be solved when the pieces are clearly delineated. In this case, some of the pertinent questions include these:
- By what mechanisms are genes regulated, and how does this contribute to the high diversity of tissues constructed from a low number of genes? The recent suggestion of a second, nongenic “code” in DNA based on the positions of packaging structures called nucleosomes provides an exciting example of the sorts of discoveries that will be forthcoming in this area.26
What roles, if any, does noncoding DNA play in the link between genome and phenotype? Insights from the study of genome size in general, such as those described above, are directly relevant to this issue, as are other influences such as the position and configuration of DNA, the level of DNA compaction, and other such non-genic factors.
In what ways do interactions among genes account for the emergence of complex wholes from a relatively limited number of parts?
How many different protein products can a single gene region encode through such processes as alternative splicing, and to what extent could this explain the diverse protein products that can result from even a relatively simple protein-encoding genome?
Future perspectives
Although they may not yet be recognized explicitly as parts of a larger puzzle, each of the component questions in the G-value enigma is the subject of an increasing amount of study. To the extent that co-opted transposable elements play a role in gene regulation, that other noncoding DNA influences gene expression, that introns are involved in alternative splicing, and that bulk DNA content exerts an impact on cellular and organismal phenotypes, it is clear that the C-value and G-value enigmas are themselves part of an overarching quest to understand the form, function, and evolution of genomes. To advance this cause, a few key steps might be taken by the scientific community:
Consider findings that contradict simplistic assumptions about genomes—most notably that one or a few linear genomic parameters should determine the complexity of organisms—as exciting challenges, rather than framing them as “paradoxical.”
Think of genomes as complex biological entities with their own inherent properties and evolutionary histories.
Characterize both the coding and noncoding components of genomes and their relative proportions in complete sequencing projects.
- Create greater linkages between researchers who study genome size (the C-value enigma) and those dealing with the sequences and functions of genes (the G-value enigma), and make a stronger effort to combine insights derived from the study of each of the major groups of living things and to move well beyond the current cast of model organisms.
The lesson from the past 50 years, and the most productive guiding principle for the next phase of genomic science, is that genomes are complex and strongly resistant to one-dimensional explanations. Put more simply, those wishing to shed light on the causes and consequences of genomic variation at any level should bear the following in mind: Paradoxes are frustrating, but clearly defined puzzles are stimulating.
© 2006, American Institute of Biological Sciences. Educators have permission to reprint articles for classroom use; other users, please contact editor@actionbioscience.org for reprint permission. See reprint policy.