Exome sequencing known as whole exome sequencing, is a genomic technique for sequencing all of the protein-coding region of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins; these regions are known as exons – humans have about 180,000 exons, constituting about 1% of the human genome, or 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology; the goal of this approach is to identify genetic variants that alter protein sequences, to do this at a much lower cost than whole-genome sequencing. Since these variants can be responsible for both Mendelian and common polygenic diseases, such as Alzheimer's disease, whole exome sequencing has been applied both in academic research and as a clinical diagnostic. Exome sequencing is effective in the study of rare Mendelian diseases, because it is an efficient way to identify the genetic variants in all of an individual's genes.
These diseases are most caused by rare genetic variants that are only present in a tiny number of individuals. Furthermore, because severe disease-causing variants are much more to be in the protein coding sequence, focusing on this 1% costs far less than whole genome sequencing but still detects a high yield of relevant variants. In the past, clinical genetic tests were chosen based on the clinical presentation of the patient, or surveyed only certain types of variation but provided definitive genetic diagnoses in fewer than half of all patients. Exome sequencing is now used to complement these other tests: both to find mutations in genes known to cause disease as well as to identify novel genes by comparing exomes from patients with similar features. Target-enrichment methods allow one to selectively capture genomic regions of interest from a DNA sample prior to sequencing. Several target-enrichment strategies have been developed since the original description of the direct genomic selection method in 2005.
Though many techniques have been described for targeted capture, only a few of these have been extended to capture entire exomes. The first target enrichment strategy to be applied to whole exome sequencing was the array-based hybrid capture method in 2007, but in-solution capture has gained popularity in recent years. Twist Bioscience introduced Human Core Exome Enrichment Kit that enables researchers to perform more efficient capture of exomes than any other available method resulting in more complete enrichment of target sequences and lower sequencing depth requirements. Microarrays contain single-stranded oligonucleotides with sequences from the human genome to tile the region of interest fixed to the surface. Genomic DNA is sheared to form double-stranded fragments; the fragments undergo end-repair to produce blunt ends and adaptors with universal priming sequences are added. These fragments are hybridized to oligos on the microarray. Unhybridized fragments are washed away and the desired fragments are eluted.
The fragments are amplified using PCR. Roche NimbleGen was first to take the original DGS technology and adapt it for next-generation sequencing, they developed the Sequence Capture Human Exome 2.1M Array to capture ~180,000 coding exons. This method is both cost-effective compared to PCR based methods; the Agilent Capture Array and the comparative genomic hybridization array are other methods that can be used for hybrid capture of target sequences. Limitations in this technique include the need for expensive hardware as well as a large amount of DNA. To capture genomic regions of interest using in-solution capture, a pool of custom oligonucleotides is synthesized and hybridized in solution to a fragmented genomic DNA sample; the probes selectively hybridize to the genomic regions of interest after which the beads can be pulled down and washed to clear excess material. The beads are removed and the genomic fragments can be sequenced allowing for selective DNA sequencing of genomic regions of interest.
This method was developed to improve on the hybridization capture target-enrichment method. In solution capture as opposed to hybrid capture, there is an excess of probes to target regions of interest over the amount of template required; the optimal target size is about 3.5 megabases and yields excellent sequence coverage of the target regions. The preferred method is dependent on several factors including: number of base pairs in the region of interest, demands for reads on target, equipment in house, etc. There are many Next Generation Sequencing sequencing platforms available, postdating classical Sanger sequencing methodologies. Other platforms include Roche 454 sequencer and Life Technologies SOLiD systems, the Life Technologies Ion Torrent and Illumina's Illumina Genome Analyzer II and subsequent Illumina MiSeq, HiSeq, NovaSeq series instruments, all of which can be used for massively parallel exome sequencing. These'short read' NGS systems are well suited to analyse many short stretches of DNA sequence, as found in human exons.
There are multiple technologies available. Each technology has disadvantages in terms of technical and financial factors. Two such technologies are whole-genome sequencing. Microarrays use hybridization probes
Nuclear magnetic resonance
Nuclear magnetic resonance is a physical phenomenon in which nuclei in a strong static magnetic field are perturbed by a weak oscillating magnetic field and respond by producing an electromagnetic signal with a frequency characteristic of the magnetic field at the nucleus. This process occurs near resonance, when the oscillation frequency matches the intrinsic frequency of the nuclei, which depends on the strength of the static magnetic field, the chemical environment, the magnetic properties of the isotope involved. NMR results from specific magnetic properties of certain atomic nuclei. Nuclear magnetic resonance spectroscopy is used to determine the structure of organic molecules in solution and study molecular physics, crystals as well as non-crystalline materials. NMR is routinely used in advanced medical imaging techniques, such as in magnetic resonance imaging. All isotopes that contain an odd number of protons and/or neutrons have an intrinsic nuclear magnetic moment and angular momentum, in other words a nonzero nuclear spin, while all nuclides with numbers of both have a total spin of zero.
The most used nuclei are 1H and 13C, although isotopes of many other elements can be studied by high-field NMR spectroscopy as well. A key feature of NMR is that the resonance frequency of a particular simple substance is directly proportional to the strength of the applied magnetic field, it is this feature, exploited in imaging techniques. Since the resolution of the imaging technique depends on the magnitude of the magnetic field gradient, many efforts are made to develop increased gradient field strength; the principle of NMR involves three sequential steps: The alignment of the magnetic nuclear spins in an applied, constant magnetic field B0. The perturbation of this alignment of the nuclear spins by a weak oscillating magnetic field referred to as a radio-frequency pulse; the oscillation frequency required for significant perturbation is dependent upon the static magnetic field and the nuclei of observation. The detection of the NMR signal during or after the RF pulse, due to the voltage induced in a detection coil by precession of the nuclear spins around B0.
After an RF pulse, precession occurs with the nuclei's intrinsic Larmor frequency and, in itself, does not involve transitions between spin states or energy levels. The two magnetic fields are chosen to be perpendicular to each other as this maximizes the NMR signal strength; the frequencies of the time-signal response by the total magnetization of the nuclear spins are analyzed in NMR spectroscopy and magnetic resonance imaging. Both use applied magnetic fields of great strength produced by large currents in superconducting coils, in order to achieve dispersion of response frequencies and of high homogeneity and stability in order to deliver spectral resolution, the details of which are described by chemical shifts, the Zeeman effect, Knight shifts; the information provided by NMR can be increased using hyperpolarization, and/or using two-dimensional, three-dimensional and higher-dimensional techniques. NMR phenomena are utilized in low-field NMR, NMR spectroscopy and MRI in the Earth's magnetic field, in several types of magnetometers.
Nuclear magnetic resonance was first described and measured in molecular beams by Isidor Rabi in 1938, by extending the Stern–Gerlach experiment, in 1944, Rabi was awarded the Nobel Prize in Physics for this work. In 1946, Felix Bloch and Edward Mills Purcell expanded the technique for use on liquids and solids, for which they shared the Nobel Prize in Physics in 1952. Yevgeny Zavoisky observed nuclear magnetic resonance in 1941, well before Felix Bloch and Edward Mills Purcell, but dismissed the results as not reproducible. Russell H. Varian filed the "Method and means for correlating nuclear properties of atoms and magnetic fields", U. S. Patent 2,561,490 on July 24, 1951. Varian Associates developed the first NMR unit called NMR HR-30 in 1952. Purcell had worked on the development of radar during World War II at the Massachusetts Institute of Technology's Radiation Laboratory, his work during that project on the production and detection of radio frequency power and on the absorption of such RF power by matter laid the foundation for his discovery of NMR in bulk matter.
Rabi and Purcell observed that magnetic nuclei, like 1H and 31P, could absorb RF energy when placed in a magnetic field and when the RF was of a frequency specific to the identity of the nuclei. When this absorption occurs, the nucleus is described as being in resonance. Different atomic nuclei within a molecule resonate at different frequencies for the same magnetic field strength; the observation of such magnetic resonance frequencies of the nuclei present in a molecule allows any trained user to discover essential chemical and structural information about the molecule. The development of NMR as a technique in analytical chemistry and biochemistry parallels the development of electromagnetic technology and advanced electronics and their introduction into civilian use. All nucleons, neutrons and protons, composing any atomic nucleus, have the intrinsic quantum property of spin, an intrinsic angular momentum analogous to the classical angular momentum of a spinning sphere; the overall spin of the nucleus is determined b
Ribonucleic acid is a polymeric molecule essential in various biological roles in coding, decoding and expression of genes. RNA and DNA are nucleic acids, along with lipids and carbohydrates, constitute the four major macromolecules essential for all known forms of life. Like DNA, RNA is assembled as a chain of nucleotides, but unlike DNA it is more found in nature as a single-strand folded onto itself, rather than a paired double-strand. Cellular organisms use messenger RNA to convey genetic information that directs synthesis of specific proteins. Many viruses encode their genetic information using an RNA genome; some RNA molecules play an active role within cells by catalyzing biological reactions, controlling gene expression, or sensing and communicating responses to cellular signals. One of these active processes is protein synthesis, a universal function in which RNA molecules direct the assembly of proteins on ribosomes; this process uses transfer RNA molecules to deliver amino acids to the ribosome, where ribosomal RNA links amino acids together to form proteins.
Like DNA, most biologically active RNAs, including mRNA, tRNA, rRNA, snRNAs, other non-coding RNAs, contain self-complementary sequences that allow parts of the RNA to fold and pair with itself to form double helices. Analysis of these RNAs has revealed that they are structured. Unlike DNA, their structures do not consist of long double helices, but rather collections of short helices packed together into structures akin to proteins. In this fashion, RNAs can achieve chemical catalysis. For instance, determination of the structure of the ribosome—an RNA-protein complex that catalyzes peptide bond formation—revealed that its active site is composed of RNA; each nucleotide in RNA contains a ribose sugar, with carbons numbered 1' through 5'. A base is attached to the 1' position, in general, cytosine, guanine, or uracil. Adenine and guanine are purines and uracil are pyrimidines. A phosphate group is attached to the 5' position of the next; the phosphate groups have a negative charge each. The bases form hydrogen bonds between cytosine and guanine, between adenine and uracil and between guanine and uracil.
However, other interactions are possible, such as a group of adenine bases binding to each other in a bulge, or the GNRA tetraloop that has a guanine–adenine base-pair. An important structural component of RNA that distinguishes it from DNA is the presence of a hydroxyl group at the 2' position of the ribose sugar; the presence of this functional group causes the helix to take the A-form geometry, although in single strand dinucleotide contexts, RNA can also adopt the B-form most observed in DNA. The A-form geometry results in a deep and narrow major groove and a shallow and wide minor groove. A second consequence of the presence of the 2'-hydroxyl group is that in conformationally flexible regions of an RNA molecule, it can chemically attack the adjacent phosphodiester bond to cleave the backbone. RNA is transcribed with only four bases, but these bases and attached sugars can be modified in numerous ways as the RNAs mature. Pseudouridine, in which the linkage between uracil and ribose is changed from a C–N bond to a C–C bond, ribothymidine are found in various places.
Another notable modified base is hypoxanthine, a deaminated adenine base whose nucleoside is called inosine. Inosine plays a key role in the wobble hypothesis of the genetic code. There are more than 100 other occurring modified nucleosides; the greatest structural diversity of modifications can be found in tRNA, while pseudouridine and nucleosides with 2'-O-methylribose present in rRNA are the most common. The specific roles of many of these modifications in RNA are not understood. However, it is notable that, in ribosomal RNA, many of the post-transcriptional modifications occur in functional regions, such as the peptidyl transferase center and the subunit interface, implying that they are important for normal function; the functional form of single-stranded RNA molecules, just like proteins requires a specific tertiary structure. The scaffold for this structure is provided by secondary structural elements that are hydrogen bonds within the molecule; this leads to several recognizable "domains" of secondary structure like hairpin loops and internal loops.
Since RNA is charged, metal ions such as Mg2+ are needed to stabilise many secondary and tertiary structures. The occurring enantiomer of RNA is D-RNA composed of D-ribonucleotides. All chirality centers are located in the D-ribose. By the use of L-ribose or rather L-ribonucleotides, L-RNA can be synthesized. L-RNA is much more stable against degradation by RNase. Like other structured biopolymers such as proteins, one can define topology of a folded RNA molecule; this is done based on arrangement of intra-chain contacts within a folded RNA, termed as circuit topology. Synthesis of RNA is catalyzed by an enzyme—RNA polymerase—using DNA as a template, a process known as transcription. Initiation of transcription begins with the binding of the enzyme to a promoter sequence in the DNA; the DNA double helix is unwound by the helicase activity of the enzyme. The enzyme progresses along the template strand in the 3’ to 5’ direction, synthesizing a complementary RNA molecule with elongation occ
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology, used to determine the order of the four bases: adenine, guanine and thymine; the advent of rapid DNA sequencing methods has accelerated biological and medical research and discovery. Knowledge of DNA sequences has become indispensable for basic biological research, in numerous applied fields such as medical diagnosis, forensic biology and biological systematics; the rapid speed of sequencing attained with modern DNA sequencing technology has been instrumental in the sequencing of complete DNA sequences, or genomes, of numerous types and species of life, including the human genome and other complete DNA sequences of many animal and microbial species. The first DNA sequences were obtained in the early 1970s by academic researchers using laborious methods based on two-dimensional chromatography. Following the development of fluorescence-based sequencing methods with a DNA sequencer, DNA sequencing has become easier and orders of magnitude faster.
DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions, full chromosomes, or entire genomes of any organism. DNA sequencing is the most efficient way to indirectly sequence RNA or proteins. In fact, DNA sequencing has become a key technology in many areas of biology and other sciences such as medicine and anthropology. Sequencing is used in molecular biology to study genomes and the proteins they encode. Information obtained using sequencing allows researchers to identify changes in genes, associations with diseases and phenotypes, identify potential drug targets. Since DNA is an informative macromolecule in terms of transmission from one generation to another, DNA sequencing is used in evolutionary biology to study how different organisms are related and how they evolved; the field of metagenomics involves identification of organisms present in a body of water, dirt, debris filtered from the air, or swab samples from organisms. Knowing which organisms are present in a particular environment is critical to research in ecology, epidemiology and other fields.
Sequencing enables researchers to determine which types of microbes may be present in a microbiome, for example. Medical technicians may sequence genes from patients to determine if there is risk of genetic diseases; this is a form of genetic testing. DNA sequencing may be used along with DNA profiling methods for forensic identification and paternity testing. DNA testing has evolved tremendously in the last few decades to link a DNA print to what is under investigation; the DNA patterns in fingerprint, hair follicles, etc. uniquely separate each living organism from another. Testing DNA is a technique which can detect specific genomes in a DNA strand to produce a unique and individualized pattern; every living organism created has a one of a kind DNA pattern, which can be determined through DNA testing. It is rare that two people have the same DNA pattern, therefore DNA testing is successful; the canonical structure of DNA has four bases: thymine, adenine and guanine. DNA sequencing is the determination of the physical order of these bases in a molecule of DNA.
However, there are many other bases. In some viruses, cytosine may be replaced by hydroxy methyl glucose cytosine. In mammalian DNA, variant bases with methyl groups or phosphosulfate may be found. Depending on the sequencing technique, a particular modification, e.g. the 5mC common in humans, may or may not be detected. Deoxyribonucleic acid was first discovered and isolated by Friedrich Miescher in 1869, but it remained understudied for many decades because proteins, rather than DNA, were thought to hold the genetic blueprint to life; this situation changed after 1944 as a result of some experiments by Oswald Avery, Colin MacLeod, Maclyn McCarty demonstrating that purified DNA could change one strain of bacteria into another. This was the first time. In 1953, James Watson and Francis Crick put forward their double-helix model of DNA, based on crystallized X-ray structures being studied by Rosalind Franklin – and without crediting her. According to the model, DNA is composed of two strands of nucleotides coiled around each other, linked together by hydrogen bonds and running in opposite directions.
Each strand is composed of four complementary nucleotides – adenine, cytosine and thymine – with an A on one strand always paired with T on the other, C always paired with G. They proposed such a structure allowed each strand to be used to reconstruct the other, an idea central to the passing on of hereditary information between generations; the foundation for sequencing proteins was first laid by the work of Frederick Sanger who by 1955 had completed the sequence of all the amino acids in insulin, a small protein secreted by the pancreas. This provided the first conclusive evidence that proteins were chemical entities with a specific molecular pattern rather than a random mixture of material suspended in fluid. Sanger's success in sequencing insulin electrified x-ray crystallographers, including Watson and Crick who by now were trying to understand how DNA directed the formation of proteins within a cell. Soon after attending a series of lectures given by Frederick Sanger in October 1954, Crick began to develo
Edman degradation, developed by Pehr Edman, is a method of sequencing amino acids in a peptide. In this method, the amino-terminal residue is labeled and cleaved from the peptide without disrupting the peptide bonds between other amino acid residues. Phenyl isothiocyanate is reacted with an uncharged N-terminal amino group, under mildly alkaline conditions, to form a cyclical phenylthiocarbamoyl derivative. Under acidic conditions, this derivative of the terminal amino acid is cleaved as a thiazolinone derivative; the thiazolinone amino acid is selectively extracted into an organic solvent and treated with acid to form the more stable phenylthiohydantoin - amino acid derivative that can be identified by using chromatography or electrophoresis. This procedure can be repeated again to identify the next amino acid. A major drawback to this technique is that the peptides being sequenced in this manner cannot have more than 50 to 60 residues; the peptide length is limited due to the cyclical derivatization not always going to completion.
The derivatization problem can be resolved by cleaving large peptides into smaller peptides before proceeding with the reaction. It is able to sequence up to 30 amino acids with modern machines capable of over 99% efficiency per amino acid. An advantage of the Edman degradation is that it only uses 10 - 100 pico-moles of peptide for the sequencing process; the Edman degradation reaction was automated in 1967 by Edman and Beggs to speed up the process and 100 automated devices were in use worldwide by 1973. Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-terminus has been chemically modified. Sequencing will stop if a non-α-amino acid is encountered, since the favored five-membered ring intermediate is unable to be formed. Edman degradation is not useful to determine the positions of disulfide bridges, it requires peptide amounts of 1 picomole or above for discernible results. Following 2D SDS PAGE the proteins can be transferred to a polyvinylidene difluoride blotting membrane for further analysis.
Edman degradations can be performed directly from a PVDF membrane. N-terminal residue sequencing resulting in five to ten amino acid may be sufficient to identify a Protein of Interest. Bergmann degradation Dansyl chloride
Deoxyribonucleic acid is a molecule composed of two chains that coil around each other to form a double helix carrying the genetic instructions used in the growth, development and reproduction of all known organisms and many viruses. DNA and ribonucleic acid are nucleic acids; the two DNA strands are known as polynucleotides as they are composed of simpler monomeric units called nucleotides. Each nucleotide is composed of one of four nitrogen-containing nucleobases, a sugar called deoxyribose, a phosphate group; the nucleotides are joined to one another in a chain by covalent bonds between the sugar of one nucleotide and the phosphate of the next, resulting in an alternating sugar-phosphate backbone. The nitrogenous bases of the two separate polynucleotide strands are bound together, according to base pairing rules, with hydrogen bonds to make double-stranded DNA; the complementary nitrogenous bases are divided into two groups and purines. In DNA, the pyrimidines are cytosine. Both strands of double-stranded DNA store the same biological information.
This information is replicated as and when the two strands separate. A large part of DNA is non-coding, meaning that these sections do not serve as patterns for protein sequences; the two strands of DNA are thus antiparallel. Attached to each sugar is one of four types of nucleobases, it is the sequence of these four nucleobases along the backbone. RNA strands are created using DNA strands as a template in a process called transcription. Under the genetic code, these RNA strands specify the sequence of amino acids within proteins in a process called translation. Within eukaryotic cells, DNA is organized into long structures called chromosomes. Before typical cell division, these chromosomes are duplicated in the process of DNA replication, providing a complete set of chromosomes for each daughter cell. Eukaryotic organisms store most of their DNA inside the cell nucleus as nuclear DNA, some in the mitochondria as mitochondrial DNA, or in chloroplasts as chloroplast DNA. In contrast, prokaryotes store their DNA only in circular chromosomes.
Within eukaryotic chromosomes, chromatin proteins, such as histones and organize DNA. These compacting structures guide the interactions between DNA and other proteins, helping control which parts of the DNA are transcribed. DNA was first isolated by Friedrich Miescher in 1869, its molecular structure was first identified by Francis Crick and James Watson at the Cavendish Laboratory within the University of Cambridge in 1953, whose model-building efforts were guided by X-ray diffraction data acquired by Raymond Gosling, a post-graduate student of Rosalind Franklin. DNA is used by researchers as a molecular tool to explore physical laws and theories, such as the ergodic theorem and the theory of elasticity; the unique material properties of DNA have made it an attractive molecule for material scientists and engineers interested in micro- and nano-fabrication. Among notable advances in this field are DNA origami and DNA-based hybrid materials. DNA is a long polymer made from repeating units called nucleotides.
The structure of DNA is dynamic along its length, being capable of coiling into tight loops and other shapes. In all species it is composed of two helical chains, bound to each other by hydrogen bonds. Both chains are coiled around the same axis, have the same pitch of 34 angstroms; the pair of chains has a radius of 10 angstroms. According to another study, when measured in a different solution, the DNA chain measured 22 to 26 angstroms wide, one nucleotide unit measured 3.3 Å long. Although each individual nucleotide is small, a DNA polymer can be large and contain hundreds of millions, such as in chromosome 1. Chromosome 1 is the largest human chromosome with 220 million base pairs, would be 85 mm long if straightened. DNA does not exist as a single strand, but instead as a pair of strands that are held together; these two long strands coil in the shape of a double helix. The nucleotide contains both a segment of the backbone of a nucleobase. A nucleobase linked to a sugar is called a nucleoside, a base linked to a sugar and to one or more phosphate groups is called a nucleotide.
A biopolymer comprising multiple linked nucleotides is called a polynucleotide. The backbone of the DNA strand is made from alternating sugar residues; the sugar in DNA is 2-deoxyribose, a pentose sugar. The sugars are joined together by phosphate groups that form phosphodiester bonds between the third and fifth carbon atoms of adjacent sugar rings; these are known as the 3′-end, 5′-end carbons, the prime symbol being used to distinguish these carbon atoms from those of the base to which the deoxyribose forms a glycosidic bond. When imagining DNA, each phosphoryl is considered to "belong" to the nucleotide whose 5′ carbon forms a bond therewith. Any DNA strand therefore has one end at which there is a phosphoryl attached to the 5′ carbon of a ribose and another end a
Transcription is the first step of gene expression, in which a particular segment of DNA is copied into RNA by the enzyme RNA polymerase. Both DNA and RNA are nucleic acids. During transcription, a DNA sequence is read by an RNA polymerase, which produces a complementary, antiparallel RNA strand called a primary transcript. Transcription proceeds in the following general steps: RNA polymerase, together with one or more general transcription factors, binds to promoter DNA. RNA polymerase creates a transcription bubble; this is done by breaking the hydrogen bonds between complementary DNA nucleotides. RNA polymerase adds RNA nucleotides. RNA sugar-phosphate backbone forms with assistance from RNA polymerase to form an RNA strand. Hydrogen bonds of the RNA–DNA helix break, freeing the newly synthesized RNA strand. If the cell has a nucleus, the RNA may be further processed; this may include polyadenylation and splicing. The RNA may exit to the cytoplasm through the nuclear pore complex; the stretch of DNA transcribed into an RNA molecule is called a transcription unit and encodes at least one gene.
If the gene encodes a protein, the transcription produces messenger RNA. Alternatively, the transcribed gene may encode for non-coding RNA such as microRNA, ribosomal RNA, transfer RNA, or enzymatic RNA molecules called ribozymes. Overall, RNA helps synthesize and process proteins. In virology, the term may be used when referring to mRNA synthesis from an RNA molecule. For instance, the genome of a negative-sense single-stranded RNA virus may be template for a positive-sense single-stranded RNA; this is because the positive-sense strand contains the information needed to translate the viral proteins for viral replication afterwards. This process is catalyzed by a viral RNA replicase. A DNA transcription unit encoding for a protein may contain both a coding sequence, which will be translated into the protein, regulatory sequences, which direct and regulate the synthesis of that protein; the regulatory sequence before the coding sequence is called the five prime untranslated region. As opposed to DNA replication, transcription results in an RNA complement that includes the nucleotide uracil in all instances where thymine would have occurred in a DNA complement.
Only one of the two DNA strands serve as a template for transcription. The antisense strand of DNA is read by RNA polymerase from the 3' end to the 5' end during transcription; the complementary RNA is created in the opposite direction, in the 5' → 3' direction, matching the sequence of the sense strand with the exception of switching uracil for thymine. This directionality is because RNA polymerase can only add nucleotides to the 3' end of the growing mRNA chain; this use of only the 3' → 5' DNA strand eliminates the need for the Okazaki fragments that are seen in DNA replication. This removes the need for an RNA primer to initiate RNA synthesis, as is the case in DNA replication; the non-template strand of DNA is called the coding strand, because its sequence is the same as the newly created RNA transcript. This is the strand, used by convention when presenting a DNA sequence. Transcription has some proofreading mechanisms, but they are fewer and less effective than the controls for copying DNA.
As a result, transcription has a lower copying fidelity than DNA replication. Transcription is divided into initiation, promoter escape and termination. Transcription begins with the binding of RNA polymerase, together with one or more general transcription factors, to a specific DNA sequence referred to as a "promoter" to form an RNA polymerase-promoter "closed complex". In the "closed complex" the promoter DNA is still double-stranded. RNA polymerase, assisted by one or more general transcription factors unwinds 14 base pairs of DNA to form an RNA polymerase-promoter "open complex". In the "open complex" the promoter DNA is unwound and single-stranded; the exposed, single-stranded DNA is referred to as the "transcription bubble."RNA polymerase, assisted by one or more general transcription factors selects a transcription start site in the transcription bubble, binds to an initiating NTP and an extending NTP complementary to the transcription start site sequence, catalyzes bond formation to yield an initial RNA product.
In bacteria, RNA polymerase holoenzyme consists of five subunits: 2 α subunits, 1 β subunit, 1 β' subunit, 1 ω subunit. In bacteria, there is one general RNA transcription factor: sigma. RNA polymerase core enzyme binds to the bacterial general transcription factor sigma to form RNA polymerase holoenzyme and binds to a promoter. In archaea and eukaryotes, RNA polymerase contains subunits homologous to each of the five RNA polymerase subunits in bacteria and contains additional subunits. In archaea and eukaryotes, the functions of the bacterial general transcription factor sigma are performed by multiple general transcription factors that work together. In archaea, there ar