Fig. 7.1
The left table shows six nucleotides from five individuals; three of these nucleotides vary and are therefore SNPs. From the right table, we can see that the two SNPs in green (G/T and G/A) are strongly correlated. If we genotype only the G/T SNP, we will also know the allele present at the G/A location
One of the issues that plagued early genetic epidemiologic research was the lack of replication of findings across studies. Few of the SNPs identified during this time have actually been confirmed as bona fide cancer risk SNPs. During the mid-2000s, investigators began performing genome-wide association studies (GWAS). The information from the HapMap and innovations in array-based genotyping technology allowed for the interrogation of hundreds of thousands of SNPs simultaneously. Given the very large number of tests performed, a SNP must reach a genome-wide p value threshold to be considered significant (generally p < 5 × 10−8—which is 0.05/1,000,000 tests). To encourage the publication of valid findings, very specific study design and publication criteria were developed for GWAS. To publish a manuscript in Nature Genetics, associations from GWAS studies must be observed in at least two independent cohorts [12]. In 2012, the editorial board additionally stipulated that authors report the co-location of disease-associated variants with gene regulatory elements identified by epigenetic, functional, and conservation criteria. Authors are also asked to publish or include in a public database genotype frequencies or p values for associations for all SNPs investigated, regardless of whether they reached genome-wide significance [13].
Hundreds of GWAS have now been published, identifying thousands of SNPs associated with various diseases, including cancers. A good resource for updated results on confirmed genetic variants of disease can be found at the National Human Genome Research Institute’s website (https://www.genome.gov/26525384) [14]. One early story of the success of GWAS in cancer was the discovery of a locus on chromosome 8q24 that was associated with prostate cancer [15]. Many more SNPs in this region have since been identified, and several of these are associated with the risk of multiple cancer types, including breast, colorectal, and ovarian cancer [16]. The majority of the genetic variants associated with cancer risk has very small effect sizes (odds ratios commonly ~1.1–1.2 per each risk allele). The risk estimates can also differ by ethnicity, highlighting the importance of conducting these large-scale studies in multiple ethnic populations.
Technology has continued to advance making it possible to obtain the sequence of the entire exome or genome. The 1000 Genomes Project, which is the first project to sequence the genomes of a large amount of people, allows for imputation of more variants from GWAS-level data [17]. The goal of the project is to find genetic variants that are present in at least 1 % of study populations, which can be done through light sequencing of individuals. Currently, finding the complete genomic sequence of one person requires sequencing an individuals’ DNA 28 times (28×), however, due to expense, data is usually combined across many samples to detect most of the variants in a particular region. Currently, the 1000 Genomes Project plans to sequence each sample 4× to detect variants with frequencies as low as 1 %. As the price continues to decline, more studies will include whole genome sequencing to detect variants associated with cancer. This leads to the identification of many rare (<1 %) and private (<0.01 %) variants. Given that power will be limited to study these variants individually, methods have been developed that study variation across a gene or region in aggregate.
7.5 Bias in Genetic Studies
As explained in Chap. 5, confounding is a bias that results when there is a third factor that influences (is causally related to) the exposure and the outcome. The presence of this bias and methods to correct for it, either through stratification or adjustment in statistical models, is typically a large component of epidemiologic studies However, given that there are not many “causes” of germline genetics, confounding is much less of a concern in genetic epidemiology. One major exception to this is confounding by ethnicity or race, often referred to as “population stratification” or “population structure.” The frequency of SNPs differs in different ethnic groups, as described above; race or ethnicity is additionally often associated through other mechanisms with disease outcome. Prostate cancer risk is substantially higher in African Americans than European Americans, and the frequency of many SNPs varies greatly between African Americans and European Americans. If ethnicity were ignored when designing a study, a larger percentage of the cases than of the controls would be African Americans—any SNPs discovered to be associated with case/control status could potentially only be markers for ethnicity.
Most studies are more carefully conducted and avoid blatant bias by restricting to one self-reported ethnicity. However, even in this situation more “cryptic” population stratification can exist. For instance, African Americans have varying percentage of African ancestry and there are genetic differences between Northern and Southern Europeans. To statistically correct for this, a principal components analysis can be run on hundreds of SNPs or even the entire GWAS dataset, using software such as Eigenstrat [18]. The principal components are modeling the ancestral differences in the cases and controls. By correcting for the significant principal components, any underlying bias by ethnicity will be removed.
This type of correction with principal components can also remove misclassification which could result from technical errors. While new genotyping and sequencing technologies are extremely accurate, missing data or incorrect genotype calls can occur. Again, serious bias can be avoided by careful study design—interspersing cases and controls across an array, for example—but a small amount of bias may still be present. Investigators tend to exclude SNPs with >5 % missing data as well as individuals who are missing >5 % of data from an analysis to avoid the possible inclusion of flawed data.
7.6 Gene x Environment Interactions
The impact of many traditional epidemiologic risk factors on disease may be modified by an individual’s genetic background (the reciprocal is therefore also true, that the impact of the genetic background may be modified by lifestyle, diet, or environmental exposures). This type of effect modification is referred to as “gene x environment interaction.” A classic example is Phenylketonuria (PKU) and phenylalanine—individuals with this rare inherited disease caused by a variant in the PAH gene cannot metabolize phenylalanine, which accumulates and leads to mental retardation. However, these downstream phenotypic effects can be avoided by adopting a phenylalanine-free diet high in fruits and vegetables and low-protein breads and pastas [19]. PKU is actually quite common, affecting 1 in 15,000 infants in the United States, but due to widespread screening programs and awareness most grow up unaffected [20].
Gene x environment interactions are appealing in the field of cancer prevention, as an individual’s genetic background cannot be altered, but one’s lifestyle can. It has been thought that individuals at greater risk due to their genetics may be more likely to change their behavior, though this has not always transpired in practice. Results from a recent randomized controlled trial of 783 patients at average risk of colorectal cancer within four medical school affiliated primary care practices found that individuals who were informed they were at an increased risk for colorectal cancer based on gene environmental risk assessment (GERA) were no more likely to be screened than individuals who received usual care [21]. Gene x environment studies are often performed using candidate genes/variants and environmental factors thought to be involved independently with the disease process. These types of studies often have limited power, especially when carried out at the genome-wide level, and are subject to all the usual biases present in epidemiologic studies (see references for methodology) [22, 23].
Epistasis is when the effect of one gene (or variant) is modified by another gene (or variant), leading to a nonadditive effect. However, testing all pairwise sets of genetic variants in a GWAS leads to so many tests that power is always a limitation, so few examples of epistasis exist in the literature.
7.7 GWAS Follow-Up and Linking SNPs to Function
One of the original hopes for GWAS was that common variants identified for common diseases could be utilized for risk prediction. Now that dozens of variants have been identified for many cancers, research is beginning to demonstrate that this may be possible. For example, using 25 of the known prostate cancer risk SNPs, a genetic risk score was developed and applied to 40,414 individuals. The men in the top 1 % of the risk distribution were 4.2 times more likely to develop prostate cancer than men with the median risk [24]. This type of information may prove incredibly useful in the future to make decisions about who should receive cancer screening.