Molecular and Clinical Genetics



Molecular and Clinical Genetics


Lesley Everett

Karl Desch

David Ginsburg



Keeping up to date with the remarkable progress in human genetics and the rapidly changing genomic laboratory toolbox poses a significant challenge for hemostasis physicians and scientists. This chapter provides a review of fundamental concepts in molecular genetics and an overview of key genetic technologies.


GENOME STRUCTURE AND DNA SEQUENCE VARIATION

The first draft of the human genome was published in 2004, resulting from a publicly funded, international effort (The Human Genome Project) utilizing Sanger sequencing.1 The human genome is packaged into 23 pairs of chromosomes, including 22 pairs of autosomes and 1 pair of sex chromosomes (XX or XY), with each chromosome containing a single doublestranded DNA molecule. Human chromosomes range in length from 51 to 249 megabases (Mb), with an average length of about 135 Mb.2 The complete haploid human genome is approximately 3 billion base pairs (bp), with 6 billion base pairs for the full diploid genome. In addition to the 23 pairs of nuclear chromosomes in any human cell, there also exists a small mitochondrial chromosome (mtDNA molecule) that is present in each mitochondria and thus in multiple copies in the cytoplasm of each cell. The mtDNA molecule is only 16.6 kilobases (kb), a tiny fraction of the total human genome size.2 Nevertheless, mutations of mtDNA cause a variety of human diseases and are also implicated in aging and age-associated disease.3

Only approximately 1.5% of the human genome corresponds to protein-coding sequence (˜21,000 genes), while the remaining intergenic DNA is composed of repetitive sequences, control regions, and large sections of the genome with unknown function. Human protein-coding genes show enormous variation in size and organization. The average length of a gene is 53.5 kb, but genes can range in length from several hundred bp up to 2,400 kb (dystrophin).2 Gene length in nucleotides typically correlates with the size of the encoded protein, but this in part depends on the number and size of introns and exons in each gene. An intron is a portion of a gene that is transcribed into RNA, but does not code for amino acids and is removed by splicing to generate the mature messenger RNA (mRNA) that is translated into protein. All but a very small number of human protein-coding genes contain introns. In contrast, exons are the portions of a gene that are retained in the mature mRNA. Though all the DNA sequences that encode amino acids are contained in exons, noncoding exons also exist, and these contain the 5′ and 3′ untranslated regions (UTRs) of each mRNA.

The average number of exons per gene is 9.8 (range 1 to 363), and the average exon size is 288 bp (range <10 bp to 18.2 kb). The distribution of genes along and between chromosomes is not uniform. Some chromosomal regions are densely populated with genes, while other regions (often referred to as “gene deserts”) are gene poor. At the chromosomal level, some chromosomes have substantial regions of tightly packed (heterochromatic), transcriptionally silent DNA; these regions generally contain many repetitive DNA sequences and very few genes.

In addition to the coding exons, another 3% to 5% of the genome is composed of highly conserved noncoding sequences. These regions are thought to contain important regulatory sequences that may play critical roles in vertebrate development, regulation of transcription, DNA replication, and chromosome pairing and condensation.4,5


DNA Sequence Variants

Despite vast phenotypic variation in the population, human beings are genetically very similar. One individual human genome differs from another by approximately 1 nucleotide in every 1,000 bp (0.1%). Thus, each haploid human genome of 3 × 109 bp harbors approximately 3,000,000 DNA variants, most of which are rare and predominantly found in noncoding DNA such as introns and intergenic regions.

DNA sequence variants are classified by their frequencies in the population (common or rare) and by their nucleotide composition (single-nucleotide variants [SNVs] or more complex structural variants).6 Some sequence variants are associated with human disease, while other sequence variants are “silent” or nonpathogenic.


All DNA Sequence Polymorphisms are Due to Mutations, but not All Mutations Result in Polymorphisms

A mutation is any change of an ancestral base-pair sequence to a different nucleotide. Rare DNA sequence variants may be unique to an individual or a family. In contrast, polymorphisms are common differences in DNA sequence, strictly defined as having a frequency ≥1% for the less common allele(s). Thus, all polymorphisms are due to mutations, but not all mutations result in polymorphisms.

A mutation may be (a) a single base-pair substitution, (b) a deletion or insertion of 1 or more base pairs (indel), or (c) a larger deletion, insertion, or other rearrangement of genetic material. Mutations may be neutral (i.e., cause no observable change in phenotype or functional disruption) or deleterious (disease causing). In humans, new mutations occur approximately every 1 × 10-8 bp/generation, which corresponds to approximately 60 new genetic mutations per individual.
However, >98.5% of mutations occur in introns or intergenic regions (noncoding regions), so it is difficult to assess the consequence and implication of these mutations for human health and disease susceptibility.

Most well-characterized disease-causing mutations occur in exons or in exon/intron junctions. DNA mutations may disrupt normal gene expression or function in several ways (See FIGURE 3.1 and Chapter 4). Point mutations are the replacement of a singlebase nucleotide with another nucleotide. Some point mutations within the coding sequence do not alter the encoded amino acid (synonymous substitution) and have no effect on the function or expression of the gene. However, a subset of synonymous mutations may alter the structure, function, and expression level of proteins via other mechanisms.7 Missense mutations (nonsynonymous substitutions) are changes in the DNA sequence that cause one encoded amino acid to be changed to a different amino acid, thus changing the overall peptide sequence of the gene product. Some missense mutations are deleterious to protein function, while others may be neutral.

Nonsense mutations are another type of nonsynonymous substitution, changing a specific amino acid codon to a stop codon and thereby producing a shortened protein product. Frameshift mutations result from the insertion or deletion of a number of nucleotides (not divisible by 3) that alters the normal protein-coding reading frame. The reading frame of a DNA or RNA molecule refers to the sequence of three-letter codons that can be translated into amino acids. A frameshift mutation changes the amino acid sequence downstream of the mutation, often resulting in a premature stop codon. Of note, small insertions/deletions (indels) of a multiple of three nucleotides will insert or delete one or more amino acids, but are not frameshift mutations because they do not alter the reading frame.

Finally, splice-site mutations result in an altered RNA sequence by changing the specific site at which splicing of an intron takes place during mRNA processing. Splice-site mutations may lead to the production of aberrant proteins. However, mutations far from a splice site may also affect splicing via alternate mechanisms8 or may be implicated in abnormal polyadenylation of mRNAs, RNA stability, or transcript processing.2 An important cellular mRNA surveillance mechanism known as nonsensemediated decay (NMD) results in the instability of mRNAcontaining nonsense mutations, thereby reducing the expression of truncated, potentially harmful proteins.9






FIGURE 3.1 Examples of different classes of mutation.

Mutations that functionally disrupt protein-coding genes, known as loss-of-function (LOF) mutations, are surprisingly common in the human genome. LOF variants have historically been assumed to cause disease, as is the case for many Mendelian diseases. However, recent large-scale sequencing projects have revealed that apparently healthy individuals harbor at least 100 LOF variants in their genomes, including approximately 30 in the homozygous or compound heterozygous state.10

Some LOF variants that are common in healthy individuals have been known and well characterized for years, such as the O allele of the blood group ABO antigen locus, and other variants that contribute to variable drug-metabolizing capacity between individuals. Nevertheless, the finding that healthy individuals carry dozens to hundreds of seemingly benign LOF variants came as a great surprise to the scientific community. Even more startling is the mounting evidence that gene disruption may actually be beneficial in some cases.10

It is possible for one disease to result from many different types of mutations in a given gene. For example, over 1,000 unique mutations causing hemophilia A are reported in the worldwide hemophilia database (HAMSTeRS).11 As a result of its X-linked inheritance, hemophilia A is observed with a high frequency in the human population (1:5,000 male newborns worldwide) (Chapter 51). As Haldane predicted in 1935, one-third of males with a lethal X-linked disorder should represent de novo somatic genetic mutations in eggs or sperm,12 based on the assumption that one-third of lethal X-linked mutations are carried in males and will be lost in each generation. Thus, most hemophilia A mutations are expected to be on average only a few generations
old. Many of these mutations occur at CpG dinucleotides, which are known hotspots for mutation in mammalian genes.13,14

Mutations occurring outside of coding regions, such as in regulatory regions or in splice-site consensus sequences, may alter gene transcription or lead to alternative mRNA transcripts, respectively. Of historic interest, by analyzing DNA recovered from skeletal bone specimens of the Romanov family, a mutation in the F9 gene predicted to alter RNA splicing was identified as the cause of the “Royal Disease” transmitted from Queen Victoria to the Royal families of Europe, thereby determining that this disease was hemophilia B, not the more common hemophilia A.15

Finally, in addition to smaller coding and noncoding mutations, structural rearrangements comprise another important class of genetic mutations, including deletions, insertions, inversions, and chromosomal translocations. Such larger genomic disruptions have been observed to cause hemophilia A, including Line 1 retrotransposon insertion.11,16 Of particular note, the recurrent intron-22 inversion in the FVIII gene accounts for approximately 35% to 45% of all severe hemophilia A cases.17 Duplications or deletions of large segments of DNA termed copy number variations (CNVs) are surprisingly common in the human genome. CNVs ranging in size from thousands to millions of base pairs may be present in anywhere from 0 to 2 or more copies when comparing one individual to the next.18 CNVs can encompass entire genes or groups of genes, leading to gene dosage imbalance. Healthy individuals harbor multiple CNVs, which likely contribute to normal trait variation, though a subset of CNVs are associated with disease or disease susceptibility. Until recently, structural variants were technologically difficult to detect and characterize, as compared to single-nucleotide substitutions. Common structural variants are now estimated to involve between 9 to 25 Mb (0.5% to 1%) of the genome.6


Genetic Polymorphisms

As defined above, the formal definition of a polymorphism is any sequence or trait for which the less common or minor form(s) exhibit a population frequency of ≥1% (Table 3.1). These variants are considered to be common in the population. Common variants are not generally deleterious or associated with disease, which is consistent with the expectation that deleterious mutations would not typically be expected to reach a frequency of ≥1% in the population. Single-nucleotide polymorphisms (SNPs) represent positions in the genome where two or occasionally three alternative nucleotides are common in the population. Though the term “SNP” is typically used to refer to all single-nucleotide DNA variants, even those with minor allele frequencies ≤1%, the latter should more correctly be identified as SNVs. Unlike common polymorphisms with relatively high population frequencies, variants occurring at a frequency <0.01 in the population are considered to be rare. These variants are also known as “private” polymorphisms because they are often restricted to specific pedigrees in a population.








Table 3.1 Classification of DNA sequence variants















Variant Frequency


<1%


≥1%


“Normal”


Rare variant or “private polymorphism”


Polymorphism


“Disease”


Disease-causing sequence variant


Common disease-causing variant


For example, FV Leiden (2.5% allele frequency) and the common Δ508 cystic fibrosis mutation (2% allele frequency)


Of note, the terms polymorphism and neutral variant are often erroneously used synonymously. However, rare variants can be neutral and common polymorphisms can be deleterious (Table 3.1). In clinical practice, rare sequence changes in a disease gene in a patient can often only be described as “unclassified variants” or “variants of unknown significance.” The relative frequency of neutral, near-neutral, and nonneutral genetic variants remains to be precisely determined (Table 3.2).19

Balanced polymorphisms represent a special class of genetic variant observed under balancing selection. In this case, an individual who is heterozygous at a particular genetic locus has a greater fitness or survival advantage compared to either type of homozygous individual, a phenomenon known as heterozygote advantage. For example, individuals with sickle cell trait (heterozygote carriers of the HbS mutation) are asymptomatic with normal life expectancy, though they are resistant to Plasmodium falciparum malaria,20 which is endemic in West Africa. A balance exists between selection against individuals who suffer from sickle cell anemia (homozygous for HbS) and selection for heterozygous HbS carriers who are resistant to malaria.


Databases Catalog Common Genetic Variation

The International HapMap Project is a publicly available catalog of common genetic variation in the human population. By collecting SNP data from hundreds of individuals, investigators can predict the likelihood of neighboring SNPs being inherited together. This analysis takes advantage of the fact that discrete blocks of genetic sequences tend to be inherited together more often (or less often) than expected by chance, given their individual allele frequencies in the population. This type of statistical association between genetic alleles is known as linkage disequilibrium (LD), and it typically results from the maintenance of ancestral haplotypes in the population. Haplotypes are a series of specific adjacent SNP alleles that are inherited together because they have not yet been broken up by meiotic recombination. Such patterns of SNPs that are strongly correlated with each other across individuals form a limited number of relatively small haplotype blocks (10 to 50 kb).21 Haplotype blocks thus provide important clues for understanding the evolutionary and demographic history of human populations.22,23

Several known genetic mutations, including the FV Leiden mutation and the prothrombin G20210A mutation, arose from a single mutational event on a distinct founder haplotype.24,25 Haplotype blocks allow for the construction of genomic maps in populations sharing a geographic or ethnic background.26 The original 270 HapMap samples were derived from
Caucasians of European ancestry, Yoruba people from Nigeria, and Asian individuals of Japanese and Han Chinese origin and included data for over 3.1 million SNPs. Currently in Phase III of the study, HapMap now includes data from 1,301 samples taken from 11 different global populations, enabling detailed study of common and rare genetic variation in diverse human populations.27








Table 3.2 Glossary of terms




































































































































Allele


One of two or more versions of a gene.


Allelic heterogeneity


A single disorder, trait, or pattern of traits caused by different mutations within the same gene.


Alternative splicing


More than one mRNA may be generated from the same gene by use of different exons, resulting in the generation of related proteins from one gene, often in a tissue- or developmental stage-specific manner.


Codominance


The phenotype of the heterozygote individual includes elements of both homozygous phenotypes (the discrete contributions of both parental alleles are visible in the phenotype).


Complex genetic traits


Result from the combined effects of many genetic loci and environmental factors (multifactorial).


Copy Number Variant (CNV)


A DNA sequence (ranging from thousands to millions of base pairs) that is present at variable copy number among individuals within a population, as compared to a diploid reference genome.


Epigenomic modifications


Chemical marks, such as DNA methylation and histone acetylation, that mediate the dynamic regulation of gene expression by influencing which genes are expressed and which are repressed.


Exome


The complete sequence of all exons in the human genome, comprising about 1.5% of the entire genome.


Exon


The portions of a gene that are retained in the mature mRNA. Though all the DNA sequences that encode amino acids are contained in exons, noncoding exons also exist, and these contain the 5′ and 3′ UTRs of each mRNA.


Frameshift mutation


Results from the insertion or deletion of a number of nucleotides (not divisible by 3) that alters the normal protein-coding reading frame.


Gene


The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule).


Genome-wide association studies (GWAS)


Utilize large samples of unrelated individuals to identify population-wide statistical associations between disease phenotypes and tested genotypes.


Genetic heterogeneity


Different genetic variants underlying the same phenotype or trait in different individuals or families (includes locus and allelic heterogeneity).


Genome browser


A database of genome information with graphical user interfaces.


Haplotype


A series of specific adjacent SNP alleles that are inherited together because they have not yet been shuffled by meiotic recombination.


Incomplete penetrance


A subset of individuals inheriting a disease-causing mutation exhibit no symptoms of the disease.


Indel


Insertion or deletion of a few nucleotides.


Intron


A portion of the gene that is transcribed into RNA but is removed by splicing to generate the mature mRNA that is translated into protein.


LD


A statistical association between genetic alleles such that discrete blocks of genetic sequences tend to be inherited together more often (or less often) than expected by chance.


Locus


The chromosomal location of a gene or a specific genetic sequence.


Locus heterogeneity


The same genetic disorder, trait, or pattern of traits that can be caused by mutations in one of a set of two or more genes.


Mendelian


A monogenic trait or disease whose pattern of inheritance suggests it is caused by variation at a single genetic locus—a monogenic character.


miRNA


Small RNA molecules that are processed from small hairpin RNA (shRNA) precursors that are produced from miRNA genes. miRNAs are 21-23 nucleotides in length, and through the RNAinduced silencing complex, they target and silence mRNAs in a sequence-specific manner.


Missense mutation


Change in the genetic sequence that causes one encoded amino acid to be changed to a different amino acid.


Modifier gene


Genes acting together with environmental and stochastic effects to modulate the effect of a major Mendelian gene.


Mutation


Any alteration in a gene from its ancestral state—may be disease-causing or a benign variant.


Next-generation sequencing (NGS)


New DNA sequencing methods based on massively parallel sequencing of thousands of DNA sequences simultaneously.


Nonsense-mediated decay (NMD)


A cellular mRNA surveillance mechanism used to detect nonsense mutations and to prevent the expression of truncated or potentially harmful proteins.


Nonsense mutation


A change from an amino acid codon to a stop codon.


Nonsynonymous mutation


A mutation that changes the encoded amino acid sequence to another amino acid or a stop codon.


Phenocopy


An individual with a phenotype (often environmentally induced) mimicking one usually produced by a specific genotype.


Point mutation


The replacement of a single-base nucleotide with another nucleotide.


Polymorphism


Any sequence or trait for which the less common or minor form(s) exhibit a population frequency of ≥1%.


Positional cloning


Using knowledge of chromosomal location to identify a disease gene.


Proteome


The set of proteins expressed by the genetic material of an organism under a given set of environmental conditions.


Quantitative trait


Many complex traits and diseases are quantitative, meaning that the observable human trait or disease varies over a continuous (and measurable) range of phenotypes.


RNA-seq


Transcriptome profiling technique that uses deep-sequencing technologies to provide a precise measurement of levels of transcripts and their isoforms.


Semidominance


Also known as incomplete dominance, this occurs when the phenotype of the heterozygote is an intermediate of the phenotypes of the two homozygous genotypes.


SNP


A position in the genome where two or occasionally three alternative nucleotides are common in the population, with the less common or minor form(s) exhibiting a population frequency of ≥1%. In common usage, often referred to as SNVs.


SNV


Similar to an SNP, but includes those with minor allele frequency <1%.


Synonymous mutation


A genetic mutation that does not alter the encoded amino acid sequence.


Transcriptome


The total mRNA content of a cell at a given time.


Variable expressivity


A wide range in disease severity or other features of the phenotype among individuals inheriting the same disease-causing mutation.



Another useful source of sequence variation data is the single nucleotide polymorphism database (dbSNP), a free public archive cataloging genetic variation within and across different species. In contrast to the HapMap Project, dbSNP contains information about SNPs as well as several other types of sequence variants from multiple species. The most recent dbSNP release (Build 135, November 2011) includes a total of 436,271,275 SNPs from 102 different organisms.27 Finally, the database of genotypes and phenotypes (dbGAP) reports the results of studies investigating the interaction of genotype and phenotype, including data from genome-wide association studies (GWASs), medical sequencing, and other molecular diagnostic assays (Table 3.3).









Table 3.3 Other useful resources and web sites


















































































































































General Resources


Resource


Description


URL


All about the Human Genome Project


General information sheet about the Human Genome Project


http://www.genome. gov/10001772


The 1,000 Genomes Project


Public effort to provide a comprehensive resource on human genetic variation


http://www.1000genomes. org


NIH Genetics Home Reference


Guide to understanding basic genetics concepts, genetic conditions, and additional genetics resources


http://ghr.nlm.nih.gov/


Genome Browsers


Database


Source


Description


URL


Ensembl


Wellcome Trust Sanger Institute/European Bioinformatics Institute


Genome databases for vertebrates and other organisms presented with a convenient user interfaces


http://www.ensembl.org


NCBI map viewer


US National Center for Biotechnology Information


http://www.ncbi.nlm.nih. gov/mapview/


UCSC Genome Browser


University of California at Santa Cruz


http://genome.ucsc.edu


General Nucleotide Sequence Databases


Database


Source


Description


URL


GenBank


National Center for Biotechnology Information


Annotated collection of all publicly available DNA sequences


http://ncbi.nlm.nih.gov


dbSNP


National Center for Biotechnology Information


Archive for genetic variation within and across different species


http://www.ncbi.nlm.nih. gov/projects/SNP/


dbGAP


The database of Genotypes and Phenotypes (NCBI)


Archive of study results for interactions between genotype and phenotype


http://www.ncbi.nlm.nih. gov/gap


HapMap


The International HapMap Association


Identifies and catalogs genetic similarities and differences in humans


hapmap.ncbi.nlm.nih. gov http://snp.cshl.org/abouthapmap.html


ENCODE


Encyclopedia of DNA elements/NHGRI


Effort to identify all functional elements in the human genome


http://www.genome. gov/10005107


General Protein Sequence Databases


Database


Source


Description


URL


UniProt


UniProt Consortium


Comprehensive database of protein sequence and functional information


http://www.uniprot.org/


SWISS-PROT


Swiss Institute of Bioinformatics


High-quality annotated and nonredundant protein sequence database


http://ca.expasy.org/sprot


Diseases and Clinical Genetics Resources


Database


Source


Description


URL


Online Mendelian Inheritance in Man (OMIM)


US National Center for Biotechnology Information


Comprehensive database of human genes and genetic phenotypes


http://www.ncbi.nlm.nih. gov/omim


Human Gene Mutation Database (HGMD)


Institute of Medical Genetics in Cardiff


Published mutations causing or associated with human inherited disease


http://www.hgmd.cf.ac.uk/ac/index.php


Genecards


The Human Genome Compendium


Web-based cards integrating automatically generated information on human genes


http://www.genecards.org/


GeneTests/The Genetic Testing Registry


NIH


Information on genetic testing and its use in diagnosis, management, and genetic counseling


http://www.ncbi.nlm.nih. gov/sites/GeneTests/


HAMSTeRS


The Haemophilia A Mutation Database and Factor VIII Resource Site


General resource for FVIII genetic variation


http://hadb.org.uk/


Pharmacogenomics Factsheet: One size does not fit all—the promise of pharmacogenomics


NCBI


A basic primer on pharmacogenomics and its implications for clinical medicine


http://www.ncbi.nlm.nih. gov/About/primer/pharm.html


Gene Expression Profiling Databases


Resource


Description



URL


The Genotype-Tissue Expression (GTeX) Project (NIH Roadmap Initiative)


Large-scale study of human gene expression and regulation in multiple tissues to establish correlation between genotype- and tissue-specific gene expression levels


http://nihroadmap.nih. gov/GTEx/index.asp


Mammalian Gene Collection (NHGRI)


Sequence-validated full-length protein-coding cDNA clones for most known human, mouse, and rat genes


http://mgc.nci.nih.gov/


Mouse Transcriptome Project (NHGRI)


Database and public repository of gene transcripts for many mouse tissues (microarray, NGS, and other high-throughput functional genomic data)


http://www.ncbi.nlm.nih. gov/projects/geo/info/mouse-trans.html

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Jun 21, 2016 | Posted by in HEMATOLOGY | Comments Off on Molecular and Clinical Genetics

Full access? Get Clinical Tree

Get Clinical Tree app for offline access