In order to perform mutational analysis of cancer genomes it is imperative to acquire high-quality reagents and to perform several quality controls to verify that the derived data are reliable. To detect somatic (i.e., tumor-specific) mutations in cancer both the tumor DNA and the germline DNA from the same individual are required, especially because knowledge of the variations in the normal human genome is as yet incomplete. Normal genomic DNA from the same individual may be derived either from blood or from tumor neighboring tissue in cases where solid tumors are investigated.
Cancer Gene Discovery by Sequencing Candidate Gene Families
The availability of the human genome sequence provides new opportunities to comprehensively search for somatic mutations in cancer on a larger scale than previously possible. Progress in the field has been closely linked to improvements in the throughput of DNA analysis and the continuous reduction in sequencing costs. Below some of the achievements in this research area are described, as well as how they affected knowledge of the cancer genome.
A seminal work in the field was the systematic mutational profiling of the genes involved in the RAF-RAS pathway in multiple tumors. This candidate gene approach led to the discovery that
BRAF is frequently mutated in melanomas and is mutated at a lower frequency in other tumor types.
14 Follow-up studies quickly revealed that mutations in
BRAF are mutually exclusive with alterations in
KRAS,14,
15 genetically emphasizing that these genes function in the same pathway, a concept that had been previously demonstrated in lower organisms such as
Caenorhabditis elegans and
Drosophila melanogaster.16,
17
In 2003, identification of cancer genes shifted from a candidate gene approach to the mutational analyses of gene families. The first gene families to be completely sequenced were those that involved protein
18,
19 and lipid phosphorylation.
20 The rationale for focusing initially on these gene families was threefold:
The corresponding proteins were already known at that time to play a pivotal role in signaling and proliferation of normal and cancerous cells.
Multiple members of the protein kinases family had already been linked to tumorigenesis.
Kinases are clearly amenable to pharmacological inhibition, making them attractive drug targets.
The mutational analysis of all the tyrosine kinase domains in colorectal cancers revealed that 30% of cases had a mutation in at least one tyrosine kinase gene, and overall mutations were identified in eight different kinases, most of which had not previously been linked to cancer.
18 An additional mutational analysis of the coding exons of 518 protein kinase genes in 210 diverse human cancers, including breast, lung, gastric, ovarian, renal, and acute lymphoblastic leukemia, identified approximately 120 mutated genes that probably contribute to oncogenesis.
19 A recent somatic mutations interrogation of the protein tyrosine kinases in cutaneous melanoma identified
ERBB4 to be mutated in 19% of cases, making it the most highly mutated protein tyrosine kinase in melanoma.
21 ERBB4 is a member of the ERBB/HER family of receptor tyrosine kinases. Other family members, including
ERBB1 (EGFR) and
ERBB2 (HER-2), have been implicated by mutations or amplifications in a number of cancers, including lung, colon, and breast cancers. The high mutation frequency as well as the nonsynonymous (NS) to synonymous (S) ratio, which was 24:3, significantly higher than the NS:S ratio predicted for non-selected mutations (
P <.01)
22 indicated that
ERBB4 mutations are selected for during tumorigenesis and therefore contribute to melanoma tumorigenesis.
As kinase activity is attenuated by enzymes that remove phosphate groups called phosphatases, the rational next step in these studies was to perform a mutation analysis of the protein tyrosine phosphatases. Mutational investigation of this family in colorectal cancer identified
that 25% of cases had mutations in six different phosphatase genes (
PTPRF, PTPRG, PTPRT, PTPN3, PTPN13, or
PTPN14).
23 Combined analysis of the protein tyrosine kinases and the protein tyrosine phosphatases showed that 50% of colorectal cancers had mutations in a tyrosine kinase gene, a protein tyrosine phosphatase gene, or both, further emphasizing the pivotal role of protein phosphorylation in neoplastic progression. Many of the identified genes had previously been linked to human cancer, thus validating the unbiased comprehensive mutation profiling. These landmark studies led to additional gene family surveys.
The phosphatidylinositol 3-kinase (
PI3K) gene family, which also plays a role in proliferation, adhesion, survival, and motility, was also comprehensively investigated.
24 Sequencing of the exons encoding the kinase domain of all 16 members belonging to this family pinpointed
PIK3CA as the only gene to harbor somatic mutations. When the entire coding region was analyzed,
PIK3CA was found somatically mutated in 32% of colorectal cancers. At that time, the
PIK3CA gene was certainly not a newcomer in the cancer arena, as it had previously been shown to be involved in cell transformation and metastasis.
24 Strikingly, its staggering high mutation frequency was discovered only through systematic sequencing of the corresponding gene family.
20 Subsequent analysis of
PIK3CA in other tumor types identified somatic mutations in this gene in additional cancer types, including 36% of hepatocellular carcinomas, 36% of endometrial carcinomas, 25% of breast carcinomas, 15% of anaplastic oligodendrogliomas, 5% of medulloblastomas and anaplastic astrocytomas, and 27% of glioblastomas.
25,
26,
27,
28,
29 It is known that
PIK3CA is one of the two (the other being
KRAS) most commonly mutated oncogenes in human cancers. Further investigation of the
PI3K pathway in colorectal cancer showed that 40% of tumors had genetic alterations in one of the
PI3K pathway genes, emphasizing the central role of this pathway in colorectal cancer pathogenesis.
30 The relevance and the functional role of the PI3K pathway in tumorigenesis is further described in
Chapter 5.
Although most cancer genome studies of large gene families have focused on the kinome, recent analyses have revealed that members of other families highly represented in the human genome are also a target of mutational events in cancer. This is the case of proteases, a complex group of enzymes consisting of at least 569 components that constitute the so-called human degradome.
31 Proteases exhibit an elaborate interplay with kinases and have traditionally been associated with cancer progression because of their ability to degrade extracellular matrices, thus facilitating tumor invasion and metastasis.
32,
33 However, recent studies have shown that these enzymes hydrolyze a wide variety of substrates and influence many different steps of cancer, including early stages of tumor evolution.
34 These functional studies have also revealed that beyond their initial recognition as prometastatic enzymes, they play dual roles in cancer, as assessed by the identification of a growing number of tumor-suppressive proteases.
35
These findings emphasized the possibility that mutational activation or inactivation of protease genes occurs in cancer. The first clear evidence of this is derived from systematic analysis of genetic alterations in breast and colorectal cancers, which revealed that proteases from different catalytic classes were candidate cancer genes that had somatically mutated in cancer.
36 These results have prompted the mutational analysis of entire protease families such as MMPs (matrix metallo-proteinases), ADAMs (a disintegrin and metallo-proteinase) and ADAMTSs (ADAMs with thromsbospondin domains) in different tumors. These studies led to identification of protease genes frequently mutated in cancer, such as
MMP8, which is mutated and functionally inactivated in 6.3% of human melanomas.
37,
38 Other MMP genes, including
MMP2, MMP9, MMP14, and
MMP27, are also somatically mutated in melanomas and other malignant tumors, albeit at low frequency.
37,
39 Systematic mutational analysis of all members of the ADAM family of membranebound metalloproteases has shown that
ADAM7 and
ADAM29 are also often mutated in melanoma, whereas parallel studies of the ADAMTS family have revealed that
ADAMTS15 is mutated in colorectal carcinomas and
ADAMTS18 and
ADAMTS20 in melanomas.
40,
41 Functional analyses have indicated that
ADAM7, ADAM29, and
ADAMTS18 mutations affect adhesion of melanoma cells to specific extracellular matrix proteins and in some cases increase their migrating and invasive properties, suggesting that these mutated genes play a role in melanoma progression.
41,
42 In contrast, functional studies of
ADAMTS15 mutations in colorectal cancer cells have revealed that this metalloprotease restrains tumor growth and invasion, further validating the concept that secreted proteases may have tumorsuppressor properties.
40
The mutational status of caspases has also been extensively analyzed in different tumors as these proteases play a fundamental role in execution of apoptosis, one of the hallmarks of cancer.
43 These studies demonstrated that
CASP8 is deleted in neuroblastomas and inactivated by somatic mutations in a variety of human malignancies, including head and neck, colorectal, lung, and gastric carcinomas.
44,
45,
46 Likewise
CASP3, CASP4, CASP5, CASP6, CASP7, CASP10, and
CASP14 are occasionally inactivated by mutation in different human cancers.
47,
48,
49,
50,
51,
52,
53,
54 Other large protease families whose components are often mutated in cancer are the deubiquitylating enzymes (DUBs), which catalyze the removal of
ubiquitin and ubiquitinlike modifiers of their target proteins.
55 Some DUBs were initially identified as oncogenic proteins, but recent work has shown that other deubiquitylases such as CYLD, A20, and BAP1 are tumor suppressors inactivated in cancer.
CYLD is mutated in patients with familial cylindromatosis, a disease characterized by the formation of multiple tumors of skin appendages.
56 A20 is a DUB family member encoded by the
TNFAIP3 gene, which is mutated in a large number of Hodgkin’s lymphomas and primary mediastinal B-cell lymphomas.
57,
58,
59,
60 Finally, the
BAP1 gene, encoding an ubiquitin C-terminal hydrolase, has been found to be somatically mutated in 86% metastasizing uveal melanomas of the eye.
61
Mutational Analysis of Exomes Using Sanger Sequencing
Although the gene family approach for the identification of cancer genes has proven extremely valuable, it still is a candidate approach and thus biased in its nature. The next step forward in the mutational profiling of cancer has been the sequencing of exomes, which is the entire coding portion of the human genome (18,000 proteinencoding genes). As of today the exomes of breast, colorectal, pancreatic, and ovarian clear cell carcinomas, glioblastoma multiforme, and medulloblastoma have been analyzed using Sanger sequencing. These large-scale analyses for the first time allowed researchers to describe and understand the genetic complexity of human cancers.
22,
36,
62,
63,
64,
65 The declared goals of these exome studies were to provide for the first time methods for exome-wide mutational analyses in human tumors, to characterize their spectrum and quantity of somatic mutations, and, finally, to discover new genes involved in tumorigenesis as well as novel pathways that have a role in these tumors. In these studies, sequencing data were complemented with gene expression and copy number analyses, thus providing for the first time a comprehensive view of the genetic complexity of human tumors.
62,
63,
64,
65 A number of conclusions can be drawn from these analyses:
Cancer genomes have an average of 30 to 100 somatic alterations per tumor, which was a higher number than previously thought. Although the alterations included point mutations, small insertions, deletions, or amplifications, the great majority of the mutations observed were single-base substitutions.
62,
63
Even within a single cancer type, there is a significant intertumor heterogeneity. This means that multiple mutational patterns (encompassing different mutant genes) are present in tumors that cannot be distinguished based on histological analysis. The concept that individual tumors have a unique genetic milieu is highly relevant for personalized medicine, a concept that will be discussed below.
The spectrum and nucleotide contexts of mutations differ between different tumor types. For example, over 50% of mutations in colorectal cancer were C:G to T:A transitions, and 10% were C:G to G:C transversions. In contrast, in breast cancers, only 35% of the mutations were C:G to T:A transitions, and 29% were C:G to G:C transversions. Knowledge of mutation spectra is vital as it allows insight into the mechanisms underlying mutagenesis and repair in the various cancers investigated.
A considerably larger number of genes that had not been previously reported to be involved in cancer were found to play a role in the disease.
Solid tumors arising in children, such as medulloblastoma, harbor on average five to ten times less gene alterations compared to a typical adult solid tumor. These pediatric tumors also harbor fewer amplifications and homozygous deletions within coding genes compared to adult solid tumors.
Importantly, to deal with the large amount of data generated in these genomic projects, it was necessary to develop new statistical and bioinformatic tools. Furthermore, examination of the overall distribution of the identified mutations allowed the development of a novel view of cancer genome landscapes and a novel definition of cancer genes. These new concepts in the understanding of cancer genetics are further discussed below. The compiled conclusions derived from these analyses have led to a paradigm shift in the understanding of cancer genetics.
A clear indication of the power of the unbiased nature of the whole exome surveys was revealed by the discovery of recurrent mutations in the active site of
IDH1, a gene with no known link to gliomas, in 12% of tumors analyzed.
63 As malignant gliomas are the most common and lethal tumors of the central nervous system, and glioblastoma multiforme (GBM; World Health Organization grade IV astrocytoma) is the most biologically aggressive subtype, the unveiling of
IDH1 as a novel GBM gene is extremely significant. Importantly, mutations of
IDH1 predominantly occurred in younger patients (median age of 34 versus 56 years for anaplastic astrocytomas and 32 versus 59 years for GBMs) and were associated with a better prognosis, as patients with
IDH mutations have a median overall survival of 31 months, and patients with wild type
IDH1 and
IDH2 have a median 15-month survival.
66 Follow-up studies showed that mutations of
IDH1 occur early in glioma progression, the R132 somatic mutation is harbored by the majority (greater than 70%) of grades II and III astrocytomas and oligodendrogliomas, as well as in
secondary GBMs that develop from these lower grade lesions.
66,
67,
68,
69,
70,
71,
72 In contrast, less than 10% of primary GBMs harbor these alterations. Furthermore, analysis of the associated
IDH2 revealed recurrent somatic mutations in the R172 residue, which is the exact analog of the frequently mutated R132 residue of
IDH1. These mutations occur mostly in a mutually exclusive manner with
IDH1 mutations,
66,
68 suggesting that they have equivalent phenotypic effects. Subsequently,
IDH1 mutations have been reported in additional cancer types such as myeloid leukemia samples,
73,
74,
75 a single case of colorectal cancer, two prostate carcinomas,
71 one melanoma case,
76 and a few cases of adult supratentorial primitive neuroectodermal tumors.
69 Further description of the function of
IDH1 and
IDH2 mutations in cancer is found in
Chapter 8.
Next-Generation Sequencing and Cancer Genome Analysis
The introduction in 1977 of the Sanger method for DNA sequencing with chain-terminating inhibitors has transformed biomedical research.
8 Over the past 30 years, this first-generation technology has been universally used for elucidating the nucleotide sequence of DNA molecules. However, the launching of new large-scale projects, including those implicating whole-genome sequencing of cancer samples, has made necessary the development of new methods that are widely known as next-generation sequencing technologies.
77,
78,
79 These approaches have significantly lowered the cost and the time required to determine the sequence of the 3 × 10
9 nucleotides present in the human genome. Moreover, they have a series of advantages over Sanger sequencing, which are of special interest for the analysis of cancer genomes.
80 First, next-generation sequencing approaches are more sensitive than Sanger methods and can detect somatic mutations even when they are present only in a subset of tumor cells.
81 Moreover, these new sequencing strategies are quantitative and can be used to simultaneously determine both nucleotide sequence and copy number variations.
82 They can also be coupled to other procedures such as those involving paired-end reads, allowing the identification of multiple structural alterations, such as insertions, deletions, and rearrangements, commonly occurring in cancer genomes.
81 Nonetheless, next-generation sequencing still presents some limitations mainly derived from the relatively high error rate in the short reads generated during the sequencing process. In addition, these short reads make the task of
de novo assembly of the generated sequences and the mapping of the reads to a reference genome extremely complex. To overcome some of these current limitations, deep coverage of each analyzed genome is required and a careful validation of the identified variants must be performed, typically using Sanger sequencing. As a consequence, there is a substantial increase in both cost of the process and time of analysis. Therefore, it can be concluded that whole-genome sequencing of cancer samples is already a feasible task but not yet a routine process. Further technical improvements will be required before the task of decoding the entire genome of any malignant tumor of any cancer patient can be applied to clinical practice.
The number of next-generation sequencing platforms has substantially grown over the past few years and currently includes technologies from Roche/454, Illumina/Solexa, Life/APG’s SOLiD3, Helicos BioSciences/HeliScope, and Pacific Biosciences/PacBio RS.
79 Noteworthy also are the recent introduction of the Polonator G.007 instrument, an open source platform with freely available software and protocols, the Ion Torrent’s semiconductor sequencer, as well as those involving self-assembling DNA nanoballs or nanopore technologies.
83,
84,
85 These new machines are driving the field toward the era of third-generation sequencing, which brings enormous clinical interest as it can substantially increase speed and accuracy of analysis at reduced costs and facilitate the possibility of single-molecule sequencing of human genomes. A comparison of next-generation sequencing platforms is shown in
Table 1.1. These various platforms differ in the method utilized for template preparation and in the nucleotide sequencing and imaging strategy, which finally result in their different performance. Ultimately, the most suitable approach depends on the specific genome sequencing projects.
79
Current methods of template preparation first involve randomly shearing genomic DNA into smaller fragments from which a library of either fragment templates or mate-pair templates are generated. Then, clonally amplified templates from single DNA molecules are prepared by either emulsion polymerase chain reaction (PCR) or solid-phase amplification.
86,
87 Alternatively, it is possible to prepare single-molecule templates through methods that require less starting material and do not involve PCR amplification reactions, which can be the source of artifactual mutations.
88 Once prepared, templates are attached to a solid surface in spatially separated sites, allowing thousands to billions of nucleotide sequencing reactions to be performed simultaneously.
The sequencing methods currently used by the different next-generation sequencing platforms are diverse and have been classified into four groups: cyclic reversible termination, singlenucleotide addition, real-time sequencing, and sequencing by ligation
79,
89 (
Fig. 1.3). These sequencing strategies are coupled with different imaging methods, including those based on measuring bioluminescent signals or involving four-color imaging of single molecular events. Finally, the extraordinary amount of data released from these nucleotide sequencing platforms is stored, assembled, and analyzed using powerful bioinformatic tools that have been developed in parallel with next-generation sequencing technologies.
90