1.
G banding
–
G-banding, G banding, or Giemsa banding is a technique used in cytogenetics to produce a visible karyotype by staining condensed chromosomes. It is useful for identifying genetic diseases through the representation of the entire chromosome complement. The metaphase chromosomes are treated with trypsin and stained with Giemsa stain, heterochromatic regions, which tend to be rich with adenine and thymine DNA and relatively gene-poor, stain more darkly in G-banding. The pattern of bands are numbered on each arm of the chromosome from the centromere to the telomere. This numbering system allows any band on the chromosome to be identified and described precisely, the reverse of G‑bands is obtained in R‑banding. Banding can be used to identify chromosomal abnormalities, such as translocations, because there is a pattern of light. It is difficult to identify and group based on simple staining because the uniform colour of the structures makes it difficult to differentiate between the different chromosomes. Therefore, techniques like G‑banding were developed that made bands appear on the chromosomes and these bands were the same in appearance on the homologous chromosomes, thus, identification became easier and more accurate. The less condensed the chromosomes are, the bands that appear when G-banding. This means that the different chromosomes are more distinct in prophase than they are in metaphase, other types of cytogenetic banding are listed below
2.
Karyotype
–
A karyotype is the number and appearance of chromosomes in the nucleus of a eukaryotic cell. The term is used for the complete set of chromosomes in a species or in an individual organism. Karyotypes describe the chromosome count of an organism and what these chromosomes look like under a light microscope, attention is paid to their length, the position of the centromeres, banding pattern, any differences between the sex chromosomes, and any other physical characteristics. The preparation and study of karyotypes is part of cytogenetics, the study of whole sets of chromosomes is sometimes known as karyology. The chromosomes are depicted in a format known as a karyogram or idiogram, in pairs, ordered by size. The basic number of chromosomes in the cells of an individual or a species is called the somatic number and is designated 2n. In the germ-line the chromosome number is n. p28 Thus, so, in normal diploid organisms, autosomal chromosomes are present in two copies. There may, or may not, be sex chromosomes, polyploid cells have multiple copies of chromosomes and haploid cells have single copies. The study of karyotypes is important for cell biology and genetics, Karyotypes can be used for many purposes, such as to study chromosomal aberrations, cellular function, taxonomic relationships, and to gather information about past evolutionary events. Chromosomes were first observed in plant cells by Carl Wilhelm von Nägeli in 1842 and their behavior in animal cells was described by Walther Flemming, the discoverer of mitosis, in 1882. The name was coined by another German anatomist, Heinrich von Waldeyer in 1888 and it is New Latin from Ancient Greek κάρυον karyon, kernel, seed, or nucleus, and τύπος typos, general form). The next stage took place after the development of genetics in the early 20th century, lev Delaunay seems to have been the first person to define the karyotype as the phenotypic appearance of the somatic chromosomes, in contrast to their genic contents. The subsequent history of the concept can be followed in the works of C. D. Darlington, investigation into the human karyotype took many years to settle the most basic question, how many chromosomes does a normal diploid human cell contain. In 1912, Hans von Winiwarter reported 47 chromosomes in spermatogonia and 48 in oogonia, concluding an XX/XO sex determination mechanism. Painter in 1922 was not certain whether the diploid of humans was 46 or 48, at first favouring 46, but revised his opinion from 46 to 48, considering the techniques of the time, these results were remarkable. In textbooks, the number of human chromosomes remained at 48 for over thirty years, New techniques were needed to correct this error. The work took place in 1955, and was published in 1956, the karyotype of humans includes only 46 chromosomes. Rather interestingly, the apes have 48 chromosomes
3.
Base pair
–
A base pair is a unit consisting of two nucleobases bound to each other by hydrogen bonds. They form the blocks of the DNA double helix. Dictated by specific hydrogen bonding patterns, Watson-Crick base pairs allow the DNA helix to maintain a regular helical structure that is dependent on its nucleotide sequence. The complementary nature of this structure provides a backup copy of all genetic information encoded within double-stranded DNA. Many DNA-binding proteins can recognize specific base pairing patterns that identify particular regulatory regions of genes, intramolecular base pairs can occur within single-stranded nucleic acids. The size of a gene or an organisms entire genome is often measured in base pairs because DNA is usually double-stranded. Hence, the number of base pairs is equal to the number of nucleotides in one of the strands. The haploid human genome is estimated to be about 3.2 billion bases long and to contain 20, a kilobase is a unit of measurement in molecular biology equal to 1000 base pairs of DNA or RNA. The total amount of related DNA base pairs on Earth is estimated at 5.0 x 1037, in comparison, the total mass of the biosphere has been estimated to be as much as 4 TtC. Hydrogen bonding is the interaction that underlies the base-pairing rules described above. Appropriate geometrical correspondence of hydrogen donors and acceptors allows only the right pairs to form stably. Purine-pyrimidine base pairing of AT or GC or UA results in proper duplex structure, the only other purine-pyrimidine pairings would be AC and GT and UG, these pairings are mismatches because the patterns of hydrogen donors and acceptors do not correspond. The GU pairing, with two bonds, does occur fairly often in RNA. Higher GC content results in higher melting temperatures, it is, therefore, on the converse, regions of a genome that need to separate frequently — for example, the promoter regions for often-transcribed genes — are comparatively GC-poor. GC content and melting temperature must also be taken into account when designing primers for PCR reactions, the following DNA sequences illustrate pair double-stranded patterns. By convention, the top strand is written from the 5 end to the 3 end, thus and this is due to their isosteric chemistry. One common mutagenic base analog is 5-bromouracil, which resembles thymine, most intercalators are large polyaromatic compounds and are known or suspected carcinogens. Examples include ethidium bromide and acridine, an unnatural base pair is a designed subunit of DNA which is created in a laboratory and does not occur in nature
4.
Reference genome
–
A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of a species set of genes. As they are assembled from the sequencing of DNA from a number of donors. Instead a reference provides a haploid mosaic of different DNA sequences from each donor, for example, GRCh37, the Genome Reference Consortium human genome is derived from thirteen anonymous volunteers from Buffalo, New York. The ABO blood group system differs among humans, but the reference genome contains only an O allele. As the cost of DNA sequencing falls, and new full genome sequencing technologies emerge, Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Most individuals with their entire genome sequenced, such as James D. Watson, had their genome assembled in this manner, for much of a genome, the reference provides a good approximation of the DNA of any single individual. For regions where there is known to be large scale variation, Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser. The length of a genome can be measured in different ways. A simple way to measure genome length is to count the number of pairs in the assembly. The golden path is a measure of length that omits redundant regions such as haplotypes. It is usually constructed by layering sequencing information over a map to combine scaffold information. It is a best estimate of what the genome will look like and typically includes gaps, GRC continues to improve reference genomes by building new alignments that contain fewer gaps, and fixing misrepresentations in the sequence. The human reference genome GRCh38 was released on 24 December 2013, the previous human reference genome was the nineteenth version. This build contained around 250 gaps, whereas the first version had ~150,000 gaps
5.
Gene
–
A gene is a locus of DNA which is made up of nucleotides and is the molecular unit of heredity. The transmission of genes to an offspring is the basis of the inheritance of phenotypic traits. These genes make up different DNA sequences called genotypes, genotypes along with environmental and developmental factors determine what the phenotypes will be. Most biological traits are under the influence of polygenes as well as gene–environment interactions, genes can acquire mutations in their sequence, leading to different variants, known as alleles, in the population. These alleles encode slightly different versions of a protein, which cause different phenotypical traits, usage of the term having a gene typically refers to containing a different allele of the same, shared gene. Genes evolve due to natural selection or survival of the fittest of the alleles, the concept of a gene continues to be refined as new phenomena are discovered. For example, regulatory regions of a gene can be far removed from its coding regions, some viruses store their genome in RNA instead of DNA and some gene products are functional non-coding RNAs. The existence of discrete inheritable units was first suggested by Gregor Mendel, from 1857 to 1864, in Brno, he studied inheritance patterns in 8000 common edible pea plants, tracking distinct traits from parent to offspring. He described these mathematically as 2n combinations where n is the number of differing characteristics in the original peas, although he did not use the term gene, he explained his results in terms of discrete inherited units that give rise to observable physical characteristics. This description prefigured the distinction between genotype and phenotype, charles Darwin developed a theory of inheritance he termed pangenesis, from Greek pan and genesis / genos. Darwin used the term gemmule to describe hypothetical particles that would mix during reproduction, de Vries called these units pangenes, after Darwins 1868 pangenesis theory. In 1909 the Danish botanist Wilhelm Johannsen shortened the name to gene, advances in understanding genes and inheritance continued throughout the 20th century. Deoxyribonucleic acid was shown to be the repository of genetic information by experiments in the 1940s to 1950s. In the early 1950s the prevailing view was that the genes in a chromosome acted like discrete entities, indivisible by recombination, collectively, this body of research established the central dogma of molecular biology, which states that proteins are translated from RNA, which is transcribed from DNA. This dogma has since shown to have exceptions, such as reverse transcription in retroviruses. The modern study of genetics at the level of DNA is known as molecular genetics, in 1972, Walter Fiers and his team at the University of Ghent were the first to determine the sequence of a gene, the gene for Bacteriophage MS2 coat protein. The subsequent development of chain-termination DNA sequencing in 1977 by Frederick Sanger improved the efficiency of sequencing, an automated version of the Sanger method was used in early phases of the Human Genome Project. The theories developed in the 1930s and 1940s to integrate molecular genetics with Darwinian evolution are called the evolutionary synthesis
6.
Consensus CDS Project
–
The Consensus Coding Sequence Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation, biological and biomedical research has come to rely on accurate and consistent annotation of genes and their products on genome assemblies. Reference annotations of genomes are available from sources, each with their own independent goals and policies. The CCDS gene sets that have arrived at by consensus of the different partners now consist of over 18,000 human. The CCDS dataset is increasingly representing more alternative splicing events with new release. A combination of manual and automated genome annotations provided by and Ensembl are compared to identify annotations with matching genomic coordinates, in order to ensure that CDSs are of high quality, multiple quality assurance tests are performed. All tests are performed following the annotation comparison step of each CCDS build and are independent of individual annotation group QA tests performed prior to the annotation comparison. Annotations that fail QA tests undergo a round of manual checking that may improve results or reach a decision to reject annotation matches based on QA failure. The CCDS database is unique in that the process must be carried out by multiple collaborators. This is made possible with a coordination system that includes a work process flow and forums for analysis. When a collaborating CCDS group member identifies a CCDS ID that may need review, coordinated manual curation is supported by a restricted-access website and a discussion e-mail list. CCDS curation guidelines were established to specific conflicts that were observed at a higher frequency. A link to the CCDS curation guidelines can be found here and these standards address specific problem areas, are not a comprehensive set of annotation guidelines, and do not restrict the annotation polices of any collaborating group. Curation occurs continuously, and any of the collaborating centers can flag a CCDS ID as an update or withdrawal. Conflicting opinions are addressed by consulting with experts or other annotation curation groups such as the HUGO Gene Nomenclature Committee. If a conflict cannot be resolved, then agree to withdraw the CCDS ID until more information becomes available. Nonsense-mediated decay, NMD is the most powerful mRNA surveillance process, NMD eliminates defective mRNA before it can be translated into protein. This is important because if the defective mRNA is translated, the protein may cause disease
7.
Autosome
–
An autosome is a chromosome that is not an allosome. Autosomes appear in pairs whose members have the form but differ from other pairs in a diploid cell, whereas members of an allosome pair may differ from one another. The DNA in autosomes is collectively known as atDNA or auDNA, for example, humans have a diploid genome that usually contains 22 pairs of autosomes and one allosome pair. The autosome pairs are labeled with numbers roughly in order of their sizes in base pairs, by contrast, the allosome pair consists of two X chromosomes in females or one X and one Y chromosome in males. Autosomes still contain sexual determination genes even though they are not sex chromosomes, for example, the SRY gene on the Y chromosome encodes the transcription factor TDF and is vital for male sex determination during development. TDF functions by activating the SOX9 gene on chromosome 17, so mutations of the SOX9 gene can cause humans with a Y chromosome to develop as females. All human autosomes have been identified and mapped by extracting the chromosomes from a cell arrested in metaphase or prometaphase and these chromosomes are typically viewed as karyograms for easy comparison. Clinical geneticists can compare the karyogram of an individual to a reference karyogram to discover the basis of certain phenotypes. For example, the karyogram of someone with Patau Syndrome would show that they possess three copies of chromosome 13, karyograms and staining techniques can only detect large-scale disruptions to chromosomes—chromosomal aberrations smaller than a few million base pairs generally cannot be seen on a karyogram. Autosomal genetic disorders can arise due to a number of causes, Autosomal genetic disorders which exhibit Mendelian inheritance can be inherited either in an autosomal dominant or recessive fashion. These disorders manifest in and are passed on by sex with equal frequency. Autosomal dominant disorders are present in both parent and child, as the child needs to inherit only one copy of the deleterious allele to manifest the disease. Autosomal recessive diseases, however, require two copies of the allele for the disease to manifest. Autosomal aneuploidy can also result in disease conditions, aneuploidy of autosomes is not well tolerated and usually results in miscarriage of the developing fetus. Possessing a single copy of an autosome is nearly always incompatible with life, having three copies of an autosome is far more compatible with life, however. A common example is Down syndrome, which is caused by possessing three copies of chromosome 21 instead of the usual two, partial aneuploidy can also occur as a result of unbalanced translocations during meiosis. Deletions of part of a chromosome cause partial monosomies, while duplications can cause partial trisomies, if the duplication or deletion is large enough, it can be discovered by analyzing a karyogram of the individual. Autosomal translocations can be responsible for a number of diseases, ranging from cancer to schizophrenia, unlike single gene disorders, diseases caused by aneuploidy are the result of improper gene dosage, not nonfunctional gene product
8.
Centromere
–
The centromere is the part of a chromosome that links sister chromatids or a dyad. During mitosis, spindle fibers attach to the centromere via the kinetochore, centromeres were first thought to be genetic loci that direct the behavior of chromosomes. e. There are, broadly speaking, two types of centromeres, point centromeres bind to specific proteins that recognise particular DNA sequences with high efficiency. Any piece of DNA with the point centromere DNA sequence on it will form a centromere if present in the appropriate species. The best characterised point centromeres are those of the budding yeast, regional centromeres is the term coined to describe most centromeres, which typically form on regions of preferred DNA sequence, but which can form on other DNA sequences as well. The signal for formation of a regional centromere appears to be epigenetic, most organisms, ranging from the fission yeast Schizosaccharomyces pombe to humans, have regional centromeres. Regarding mitotic chromosome structure, centromeres represent a region of the chromosome where two identical sister chromatids are most closely in contact. When cells enter mitosis, the chromatids are linked along their length by the action of the cohesin complex. Each chromosome has two arms, labeled p and q, many remember that the short arm p is named for the French word petit meaning small, although this explanation was shown to be apocryphal. They can be connected in either metacentric, submetacentric, acrocentric or telocentric manner and these are X-shaped chromosomes, with the centromere in the middle so that the two arms of the chromosomes are almost equal. A chromosome is metacentric if its two arms are equal in length. In a normal karyotype, five chromosomes are considered metacentric, chromosomes 1,3,16,19. In some cases, a chromosome is formed by balanced translocation. If arms lengths are unequal, the chromosome is said to be submetacentric, if the p arm is so short that it is hard to observe, but still present, then the chromosome is acrocentric. The human genome includes six acrocentric chromosomes,13,14,15,21,22, the domestic horse genome includes one metacentric chromosome that is homologous to two acrocentric chromosomes in the conspecific but undomesticated Przewalskis horse. A telocentric chromosomes centromere is located at the end of the chromosome. Telomeres may extend from both ends of the chromosome, for example, the standard house mouse karyotype has only telocentric chromosomes. Humans do not possess telocentric chromosomes, if the chromosomes centromere is located closer to its end than to its center, it may be described as subtelocentric
9.
HUGO Gene Nomenclature Committee
–
The HUGO Gene Nomenclature Committee is a committee of the Human Genome Organisation that sets the standards for human gene nomenclature. The HGNC approves a unique and meaningful name for every known human gene, in addition to the name, which is usually 1 to 10 words long, the HGNC also assigns a symbol to every gene. As with an SI symbol, a symbol is like an abbreviation but is more than that. It may not necessarily stand for the initials of the name, especially gene abbreviations/symbols but also full gene names are often not specific for a single gene. A marked example is CAP which can refer to any of 6 different genes, the HGNC short gene names, or gene symbols, unlike previously used or published symbols, are specifically assigned to one gene only. This can result in less common abbreviations being selected but reduces confusion as to which gene is referred to. e, h/h for human The full description of HGNCs nomenclature guidelines can be found on their web site. HGNC advocates the appendices _v1, _v2. to distinguish between different splice variants and _pr1, _pr2. for promoter variants of a single gene. HGNC also states that gene nomenclature should evolve with new technology rather than be restrictive as sometimes occurs when historical, HGNC also coordinates with the related Mouse and Rat Genomic Nomenclature Committees, other database curators, and experts for given specific gene families or sets of genes. For this reason the HGNC aims to change a name only if agreement for that change can be reached among a majority of researchers working on that gene. Human Genome Organisation Human Genome Project Human genome Gene Gene nomenclature HGNC homepage HUGO homepage
10.
UniProt
–
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains an amount of information about the biological function of proteins derived from the research literature. The UniProt consortium comprises the European Bioinformatics Institute, the Swiss Institute of Bioinformatics, EBI, located at the Wellcome Trust Genome Campus in Hinxton, UK, hosts a large resource of bioinformatics databases and services. SIB, located in Geneva, Switzerland, maintains the ExPASy servers that are a resource for proteomics tools. In 2002, EBI, SIB, and PIR joined forces as the UniProt consortium, each consortium member is heavily involved in protein database maintenance and annotation. Until recently, EBI and SIB together produced the Swiss-Prot and TrEMBL databases and these databases coexisted with differing protein sequence coverage and annotation priorities. Swiss-Prot aimed to provide reliable protein sequences associated with a level of annotation. Recognizing that sequence data were being generated at a pace exceeding Swiss-Prots ability to keep up, meanwhile, PIR maintained the PIR-PSD and related databases, including iProClass, a database of protein sequences and curated families. The consortium members pooled their resources and expertise, and launched UniProt in December 2003. UniProt provides four core databases, UniProtKB, UniParc, UniRef, UniProt Knowledgebase is a protein database partially curated by experts, consisting of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. As of 19 March 2014, release 2014_03 of UniProtKB/Swiss-Prot contains 542,782 sequence entries, UniProtKB/Swiss-Prot is a manually annotated, non-redundant protein sequence database. It combines information extracted from literature and biocurator-evaluated computational analysis. The aim of UniProtKB/Swiss-Prot is to all known relevant information about a particular protein. Annotation is regularly reviewed to keep up with current scientific findings, the manual annotation of an entry involves detailed analysis of the protein sequence and of the scientific literature. Sequences from the gene and the same species are merged into the same database entry. Differences between sequences are identified, and their cause documented, a range of sequence analysis tools is used in the annotation of UniProtKB/Swiss-Prot entries. Computer-predictions are manually evaluated, and relevant results selected for inclusion in the entry and these predictions include post-translational modifications, transmembrane domains and topology, signal peptides, domain identification, and protein family classification. Relevant publications are identified by searching databases such as PubMed, the full text of each paper is read, and information is extracted and added to the entry
11.
National Center for Biotechnology Information
–
The National Center for Biotechnology Information is part of the United States National Library of Medicine, a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper, the NCBI houses a series of databases relevant to biotechnology and biomedicine and is an important resource for bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a database for the biomedical literature. Other databases include the NCBI Epigenomics database, all these databases are available online through the Entrez search engine. NCBI is directed by David Lipman, one of the authors of the BLAST sequence alignment program. He also leads a research program, including groups led by Stephen Altschul, David Landsman, Eugene Koonin, John Wilbur, Teresa Przytycka. NCBI is listed in the Registry of Research Data Repositories re3data. org, NCBI has had responsibility for making available the GenBank DNA sequence database since 1992. GenBank coordinates with individual laboratories and other databases such as those of the European Molecular Biology Laboratory. Since 1992, NCBI has grown to other databases in addition to GenBank. The NCBI assigns a unique identifier to each species of organism, the NCBI has software tools that are available by WWW browsing or by FTP. For example, BLAST is a sequence similarity searching program, BLAST can do sequence comparisons against the GenBank DNA database in less than 15 seconds. RAG2/IL2RG The NCBI Bookshelf is a collection of freely accessible, downloadable, some of the books are online versions of previously published books, while others, such as Coffee Break, are written and edited by NCBI staff. BLAST is a used for calculating sequence similarity between biological sequences such as nucleotide sequences of DNA and amino acid sequences of proteins. BLAST is a tool for finding sequences similar to the query sequence within the same organism or in different organisms. It searches the query sequence on NCBI databases and servers and post the results back to the browser in chosen format. Input sequences to the BLAST are mostly in FASTA or Genbank format while output could be delivered in variety of such as HTML, XML formatting. HTML is the output format for NCBIs web-page. Entrez is both indexing and retrieval system having data from sources for biomedical research
12.
Ensembl genome database project
–
Ensembl is one of several well known genome browsers for the retrieval of genomic information. Similar databases and browsers are found at NCBI and the University of California, the human genome consists of three billion base pairs, which code for approximately 20, 000–25,000 genes. However the genome alone is of use, unless the locations. One option is manual annotation, whereby a team of scientists tries to locate genes using experimental data from scientific journals, however this is a slow, painstaking task. The alternative, known as automated annotation, is to use the power of computers to do the complex pattern-matching of protein to DNA. In the Ensembl project, sequence data are fed into the gene annotation system which creates a set of predicted gene locations and saves them in a MySQL database for subsequent analysis, Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download, in addition, the Ensembl website provides computer-generated visual displays of much of the data. Over time the project has expanded to additional species as well as a wider range of genomic data, including genetic variations. Central to the Ensembl concept is the ability to automatically generate graphical views of the alignment of genes and these are shown as data tracks, and individual tracks can be turned on and off, allowing the user to customise the display to suit their research interests. The interface also enables the user to zoom in to a region or move along the genome in either direction, the graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as FASTA. Externally produced data can also be added to the display, either via a DAS server on the internet, or by uploading a file in one of the supported formats, such as BAM, BED. Graphics are generated using a suite of custom Perl modules based on GD, in addition to its website, Ensembl provides a Perl API that models biological objects such as genes and proteins, allowing simple scripts to be written to retrieve data of interest. The same API is used internally by the web interface to display the data and it is divided in sections like the core API, the compara API, the variation API, and the functional genomics API. The Ensembl website provides information on how to install and use the API. This software can be used to access the public MySQL database, the users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema. Large datasets can be retrieved using the BioMart data-mining tool and it provides a web interface for downloading datasets using complex queries. Last, there is an FTP server which can be used to download entire MySQL databases as some selected data sets in other formats. The annotated genomes include most fully sequenced vertebrates and selected model organisms, all of them are eukaryotes, there are no prokaryotes
13.
Entrez
–
The name Entrez was chosen to reflect the spirit of welcoming the public to search the content available from the NLM. Entrez Global Query is a search and retrieval system that provides access to all databases simultaneously with a single query string. Entrez can efficiently retrieve related sequences, structures, and references, the Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are available online through the Entrez system. The Entrez front page provides, by default, access to the global query, all databases indexed by Entrez can be searched via a single query string, supporting boolean operators and search term tags to limit parts of the search statement to particular fields. This returns a unified results page, that shows the number of hits for the search in each of the databases, Entrez also provides a similar interface for searching each particular database and for refining search results. The Limits feature allows the user to narrow a search a web forms interface, the History feature gives a numbered list of recently performed queries. Results of previous queries can be referred to by number and combined via boolean operators, search results can be saved temporarily in a Clipboard. Users with a MyNCBI account can save queries indefinitely and also choose to have updates with new search results e-mailed for saved queries of most databases and it is widely used in the field of biotechnology as a reference tool for students and professionals alike. Entrez searches the following databases, PubMed, biomedical literature citations and abstracts, including Medline - articles from journals, in addition to using the search engine forms to query the data in Entrez, NCBI provides the Entrez Programming Utilities for more direct access to query results. The eUtils are accessed by posting specially formed URLs to the NCBI server, there was also an eUtils SOAP interface which was terminated on July 2015. In 1991, entrez was introduced in CD form, in 1993, a client-server version of the software provided connectivity with the internet. In 1994, NCBI established a website, and Entrez was a part of initial release. In 2001, Entrez bookshelf was released and in 2003, the Entrez Gene database was developed, Entrez search engine form Entrez Help
14.
UCSC Genome Browser
–
The UCSC Genome Browser is an on-line genome browser hosted by the University of California, Santa Cruz. The Genome Browser Database, browsing tools, downloadable data files, today the browser is used by geneticists, molecular biologists and physicians as well as students and teachers of evolution for access to genomic information. High coverage is necessary to allow overlap to guide the construction of contiguous regions. The species hosted with full-featured genome browsers are shown in the table, the large amount of data about biological systems that is accumulating in the literature makes it necessary to collect and digest information using the tools of bioinformatics. The basic paradigm of display is to show the sequence in the horizontal dimension. Blocks of color along the coordinate axis show the locations of the alignments of the data types. The ability to show this large variety of types on a single coordinate axis makes the browser a handy tool for the vertical integration of the data. To find a specific gene or genomic region, the user may type in the name, an accession number for an RNA. Presenting the data in the format allows the browser to present link access to detailed information about any of the annotations. Designed for the presentation of complex and voluminous data, the UCSC Browser is optimized for speed, by pre-aligning the 55 million RNAs of GenBank to each of the 81 genome assemblies, the browser allows instant access to the alignments of any RNA to any of the hosted species. The juxtaposition of the types of data allow researchers to display exactly the combination of data that will answer specific questions. A pdf/postscript output functionality allows export of an image for publication in academic journals. One unique and useful feature that distinguishes the UCSC Browser from other genome browsers is the variable nature of the display. Sequence of any size can be displayed, from a single DNA base up to the chromosome with full annotation tracks. Researchers can display a single gene, an exon, or an entire chromosome band, showing dozens or hundreds of genes. A convenient drag-and-zoom feature allows the user to any region in the genome image. Researchers may also use the browser to display their own data via the Custom Tracks tool and this feature allows users to upload a file of their own data and view the data in the context of the reference genome assembly. Users may also use the data hosted by UCSC, creating subsets of the data of their choosing with the Table Browser tool, any browser view created by a user, including those containing Custom Tracks, may be shared with other users via the Saved Sessions tool
15.
Nucleic acid sequence
–
A nucleic acid sequence is a succession of letters that indicate the order of nucleotides within a DNA or RNA molecule. By convention, sequences are presented from the 5 end to the 3 end. For DNA, the strand is used. Because nucleic acids are linear polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure, the sequence has capacity to represent information. Biological deoxyribonucleic acid represents the information which directs the functions of a living thing, nucleic acids also have a secondary structure and tertiary structure. Primary structure is sometimes referred to as primary sequence. Conversely, there is no concept of secondary or tertiary sequence. Nucleic acids consist of a chain of linked units called nucleotides, each nucleotide consists of three subunits, a phosphate group and a sugar make up the backbone of the nucleic acid strand, and attached to the sugar is one of a set of nucleobases. The nucleobases are important in base pairing of strands to form secondary and tertiary structure such as the famed double helix. The possible letters are A, C, G, and T, in the typical case, the sequences are printed abutting one another without gaps, as in the sequence AAAGTCTGAC, read left to right in the 5 to 3 direction. With regards to transcription, a sequence is on the strand if it has the same order as the transcribed RNA. One sequence can be complementary to sequence, meaning that they have the base on each position in the complementary. For example, the sequence to TTAC is GTAA. If one strand of the double-stranded DNA is considered the sense strand, then the other strand, considered the antisense strand, will have the complementary sequence to the sense strand. Apart from adenine, cytosine, guanine, thymine and uracil, in DNA, the most common modified base is 5-methylcytidine. In RNA, there are many modified bases, including pseudouridine, dihydrouridine, inosine, ribothymidine and 7-methylguanosine, hypoxanthine and xanthine are two of the many bases created through mutagen presence, both of them through deamination. Hypoxanthine is produced from adenine, xanthine from guanine, similarly, deamination of cytosine results in uracil
16.
GenBank
–
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence Database Collaboration, the National Center for Biotechnology Information is a part of the National Institutes of Health in the United States. GenBank and its collaborators receive sequences produced in laboratories throughout the world more than 100,000 distinct organisms. GenBank continues to grow at a rate, doubling every 18 months. Release 194, produced in February 2013, contained over 150 billion nucleotide bases in more than 162 million sequences, GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers. Only original sequences can be submitted to GenBank, direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, Sequin. Upon receipt of a submission, the GenBank staff examines the originality of the data and assigns an accession number to the sequence. The submissions are then released to the database, where the entries are retrievable by Entrez or downloadable by FTP. Bulk submissions of Expressed Sequence Tag, Sequence-tagged site, Genome Survey Sequence, the GenBank direct submissions group also processes complete microbial genome sequences. Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences were stored in it. In the mid 1980s, the Intelligenetics bioinformatics company at Stanford University managed the GenBank project in collaboration with LANL, as one of the earliest bioinformatics community projects on the Internet, the GenBank project started BIOSCI/Bionet news groups for promoting open access communications among bioscientists. During 1989 to 1992, the GenBank project transitioned to the newly created National Center for Biotechnology Information, the GenBank release notes for release 162.0 state that from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months. As of 15 June 2016, GenBank release 214.0 has 194,463,572 loci,213,200,907,819 bases, from 194,463,572 reported sequences. The GenBank database includes data sets that are constructed mechanically from the main sequence data collection. On the other hand, while commercial databases potentially contain high-quality filtered sequence data, the results showed that analyses performed using GenBank combined with EzTaxon-e were more discriminative than using GenBank or other databases alone. GenBank Example sequence record, for hemoglobin beta BankIt Sequin — a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank sequence database. EMBOSS — free, open source software for molecular biology GenBank, RefSeq, TPA and UniProt, Whats in a Name
17.
Chromosome
–
A chromosome is a DNA molecule with part or all of the genetic material of an organism. Prokaryotes usually have one single circular chromosome, whereas most eukaryotes are diploid, chromosomes in eukaryotes are composed of chromatin fiber. Chromatin fiber is made of nucleosomes, a nucleosome is a histone octamer with part of a longer DNA strand attached to and wrapped around it. Chromatin fiber, together with associated proteins is known as chromatin, chromatin is present in most cells, with a few exceptions, for example, red blood cells. Occurring only in the nucleus of cells, chromatin contains the vast majority of DNA, except for a small amount inherited maternally. Chromosomes are normally visible under a microscope only when the cell is undergoing the metaphase of cell division. Before this happens every chromosome is copied once, and the copy is joined to the original by a centromere resulting in an X-shaped structure, the original chromosome and the copy are now called sister chromatids. During metaphase, when a chromosome is in its most condensed state, in this highly condensed form chromosomes are easiest to distinguish and study. In prokaryotic cells, chromatin occurs free-floating in cytoplasm, as these cells lack organelles, the main information-carrying macromolecule is a single piece of coiled double-helix DNA, containing many genes, regulatory elements and other noncoding DNA. The DNA-bound macromolecules are proteins that serve to package the DNA, chromosomes vary widely between different organisms. Some species such as certain bacteria also contain plasmids or other extrachromosomal DNA and these are circular structures in the cytoplasm that contain cellular DNA and play a role in horizontal gene transfer. Chromosomal recombination during meiosis and subsequent sexual reproduction plays a significant role in genetic diversity. In prokaryotes and viruses, the DNA is often densely packed and organized, in the case of archaea, by homologs to eukaryotic histones, small circular genomes called plasmids are often found in bacteria and also in mitochondria and chloroplasts, reflecting their bacterial origins. Some use the term chromosome in a sense, to refer to the individualized portions of chromatin in cells. However, others use the concept in a sense, to refer to the individualized portions of chromatin during cell division. The word chromosome comes from the Greek χρῶμα and σῶμα, describing their strong staining by particular dyes, schleiden, Virchow and Bütschli were among the first scientists who recognized the structures now so familiar to everyone as chromosomes. The term was coined by von Waldeyer-Hartz, referring to the term chromatin, in a series of experiments beginning in the mid-1880s, Theodor Boveri gave the definitive demonstration that chromosomes are the vectors of heredity. His two principles were the continuity of chromosomes and the individuality of chromosomes and it is the second of these principles that was so original
18.
Human
–
Modern humans are the only extant members of Hominina tribe, a branch of the tribe Hominini belonging to the family of great apes. Several of these hominins used fire, occupied much of Eurasia and they began to exhibit evidence of behavioral modernity around 50,000 years ago. In several waves of migration, anatomically modern humans ventured out of Africa, the spread of humans and their large and increasing population has had a profound impact on large areas of the environment and millions of native species worldwide. Humans are uniquely adept at utilizing systems of communication for self-expression and the exchange of ideas. Humans create complex structures composed of many cooperating and competing groups, from families. Social interactions between humans have established a wide variety of values, social norms, and rituals. These human societies subsequently expanded in size, establishing various forms of government, religion, today the global human population is estimated by the United Nations to be near 7.5 billion. In common usage, the word generally refers to the only extant species of the genus Homo—anatomically and behaviorally modern Homo sapiens. In scientific terms, the meanings of hominid and hominin have changed during the recent decades with advances in the discovery, there is also a distinction between anatomically modern humans and Archaic Homo sapiens, the earliest fossil members of the species. The English adjective human is a Middle English loanword from Old French humain, ultimately from Latin hūmānus, the words use as a noun dates to the 16th century. The native English term man can refer to the species generally, the species binomial Homo sapiens was coined by Carl Linnaeus in his 18th century work Systema Naturae. The generic name Homo is a learned 18th century derivation from Latin homō man, the species-name sapiens means wise or sapient. Note that the Latin word homo refers to humans of either gender, the genus Homo evolved and diverged from other hominins in Africa, after the human clade split from the chimpanzee lineage of the hominids branch of the primates. The closest living relatives of humans are chimpanzees and gorillas, with the sequencing of both the human and chimpanzee genome, current estimates of similarity between human and chimpanzee DNA sequences range between 95% and 99%. The gibbons and orangutans were the first groups to split from the leading to the humans. The splitting date between human and chimpanzee lineages is placed around 4–8 million years ago during the late Miocene epoch, during this split, chromosome 2 was formed from two other chromosomes, leaving humans with only 23 pairs of chromosomes, compared to 24 for the other apes. There is little evidence for the divergence of the gorilla, chimpanzee. Each of these species has been argued to be an ancestor of later hominins
19.
DNA
–
Deoxyribonucleic acid is a molecule that carries the genetic instructions used in the growth, development, functioning and reproduction of all known living organisms and many viruses. DNA and RNA are nucleic acids, alongside proteins, lipids and complex carbohydrates, most DNA molecules consist of two biopolymer strands coiled around each other to form a double helix. The two DNA strands are termed polynucleotides since they are composed of simpler units called nucleotides. Each nucleotide is composed of one of four nitrogen-containing nucleobases—cytosine, guanine, adenine, or thymine —a sugar called deoxyribose, and a phosphate group. The nucleotides are joined to one another in a chain by covalent bonds between the sugar of one nucleotide and the phosphate of the next, resulting in an alternating sugar-phosphate backbone. The nitrogenous bases of the two polynucleotide strands are bound together, according to base pairing rules, with hydrogen bonds to make double-stranded DNA. The total amount of related DNA base pairs on Earth is estimated at 5.0 x 1037, in comparison the total mass of the biosphere has been estimated to be as much as 4 trillion tons of carbon. The DNA backbone is resistant to cleavage, and both strands of the double-stranded structure store the same biological information and this information is replicated as and when the two strands separate. A large part of DNA is non-coding, meaning that these sections do not serve as patterns for protein sequences, the two strands of DNA run in opposite directions to each other and are thus antiparallel. Attached to each sugar is one of four types of nucleobases and it is the sequence of these four nucleobases along the backbone that encodes biological information. RNA strands are created using DNA strands as a template in a process called transcription, under the genetic code, these RNA strands are translated to specify the sequence of amino acids within proteins in a process called translation. Within eukaryotic cells DNA is organized into structures called chromosomes. During cell division these chromosomes are duplicated in the process of DNA replication, eukaryotic organisms store most of their DNA inside the cell nucleus and some of their DNA in organelles, such as mitochondria or chloroplasts. In contrast prokaryotes store their DNA only in the cytoplasm, within the eukaryotic chromosomes, chromatin proteins such as histones compact and organize DNA. These compact structures guide the interactions between DNA and other proteins, helping control which parts of the DNA are transcribed, DNA was first isolated by Friedrich Miescher in 1869. DNA is used by researchers as a tool to explore physical laws and theories, such as the ergodic theorem. The unique material properties of DNA have made it an attractive molecule for material scientists and engineers interested in micro-, among notable advances in this field are DNA origami and DNA-based hybrid materials. DNA is a polymer made from repeating units called nucleotides
20.
Cell (biology)
–
The cell is the basic structural, functional, and biological unit of all known living organisms. A cell is the smallest unit of life that can replicate independently, the study of cells is called cell biology. Cells consist of cytoplasm enclosed within a membrane, which contains many such as proteins. Organisms can be classified as unicellular or multicellular, while the number of cells in plants and animals varies from species to species, humans contain more than 10 trillion cells. Most plant and animal cells are only under a microscope. The cell was discovered by Robert Hooke in 1665, who named the unit for its resemblance to cells inhabited by Christian monks in a monastery. Cells emerged on Earth at least 3.5 billion years ago, Cells are of two types, eukaryotic, which contain a nucleus, and prokaryotic, which do not. Prokaryotes are single-celled organisms, while eukaryotes can be either single-celled or multicellular, prokaryotic cells were the first form of life on Earth, characterised by having vital biological processes including cell signaling and being self-sustaining. They are simpler and smaller than eukaryotic cells, and lack membrane-bound organelles such as the nucleus, prokaryotes include two of the domains of life, bacteria and archaea. The DNA of a prokaryotic cell consists of a chromosome that is in direct contact with the cytoplasm. The nuclear region in the cytoplasm is called the nucleoid, most prokaryotes are the smallest of all organisms ranging from 0.5 to 2.0 µm in diameter. Though most prokaryotes have both a cell membrane and a wall, there are exceptions such as Mycoplasma and Thermoplasma which only possess the cell membrane layer. The envelope gives rigidity to the cell and separates the interior of the cell from its environment, the cell wall consists of peptidoglycan in bacteria, and acts as an additional barrier against exterior forces. It also prevents the cell from expanding and bursting from osmotic pressure due to a hypotonic environment, some eukaryotic cells also have a cell wall. Inside the cell is the region that contains the genome, ribosomes. The genetic material is found in the cytoplasm. Prokaryotes can carry extrachromosomal DNA elements called plasmids, which are usually circular, linear bacterial plasmids have been identified in several species of spirochete bacteria, including members of the genus Borrelia notably Borrelia burgdorferi, which causes Lyme disease. Though not forming a nucleus, the DNA is condensed in a nucleoid, plasmids encode additional genes, such as antibiotic resistance genes
21.
Genome annotation
–
DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation is a note added by way of explanation or commentary, once a genome is sequenced, it needs to be annotated to make sense of it. This annotation is stored in databases such as Mouse Genome Informatics, FlyBase. Educational materials on some aspects of biological annotation from the 2006 Gene Ontology annotation camp, the National Center for Biomedical Ontology develops tools for automated annotation of database records based on the textual descriptions of those records. Genome annotation consists of three main steps, identifying portions of the genome that do not code for proteins identifying elements on the genome, a process called gene prediction, and attaching biological information to these elements. Automatic annotation tools try to perform all this by computer analysis, ideally, these approaches co-exist and complement each other in the same annotation pipeline. The basic level of annotation is using BLAST for finding similarities, however, nowadays more and more additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation, some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases rely on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline, structural annotation consists of the identification of genomic elements. ORFs and their localisation gene structure coding regions location of regulatory motifs Functional annotation consists of attaching biological information to genomic elements, biochemical function biological function involved regulation and interactions expression These steps may involve both biological experiments and in silico analysis. Proteogenomics based approaches utilize information from expressed proteins, often derived from mass spectrometry, a variety of software tools have been developed to permit scientists to view and share genome annotations. Identifying the locations of genes and other control elements is often described as defining the biological parts list for the assembly. Scientists are still at a stage in the process of delineating this parts list
22.
Gene prediction
–
In computational biology gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced, in its earliest days, gene finding was based on painstaking experimentation on living cells and organisms. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, determining that a sequence is functional should be distinguished from determining the function of the gene or its product. Gene prediction is one of the key steps in genome annotation, following sequence assembly, gene prediction is closely related to the so-called target search problem investigating how DNA-binding proteins locate specific binding sites within the genome. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed, given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. Given a sequence, local alignment algorithms such as BLAST, FASTA, matches can be complete or partial, and exact or inexact. The success of this approach is limited by the contents and accuracy of the sequence database, a high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA, thus, to collect extrinsic evidence for most or all of the genes in a complex organism requires the study of many hundreds or thousands of cell types, which presents further difficulties. For example, some genes may be expressed only during development as an embryo or fetus. Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the RefSeq database contains transcript and protein sequence from different species. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data, in prokaryotes its essential to consider horizontal gene transfer when searching for gene sequence homology. An additional important factor underused in current gene detection tools is existence of gene clusters—operons in both prokaryotes and eukaryotes, most popular gene detectors treat each gene in isolation, independent of others, which is not biologically accurate. Ab Initio gene prediction is a method based on gene content. These signs can be categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content. Ab initio gene finding might be accurately characterized as gene prediction
23.
Human genome
–
The human genome is the complete set of nucleic acid sequence for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. Human genomes include both protein-coding DNA genes and noncoding DNA, haploid human genomes, which are contained in germ cells consist of three billion DNA base pairs, while diploid genomes have twice the DNA content. The Human Genome Project produced the first complete sequences of human genomes, with the first draft sequence. The human genome was the first of all vertebrates to be completely sequenced, as of 2012, thousands of human genomes have been completely sequenced, and many more have been mapped at lower levels of resolution. The resulting data are used worldwide in biomedical science, anthropology, forensics, there is a widely held expectation that genomic studies will lead to advances in the diagnosis and treatment of diseases, and to new insights in many fields of biology, including human evolution. Although the sequence of the genome has been completely determined by DNA sequencing. There are an estimated 19, 000-20,000 human protein-coding genes, in June 2016, scientists formally announced HGP-Write, a plan to synthesize the human genome. The total length of the genome is over 3 billion base pairs. The genome is organized into 22 paired chromosomes, plus the X chromosome and, in males only and these are all large linear DNA molecules contained within the cell nucleus. The genome also includes the mitochondrial DNA, a small circular molecule present in each mitochondrion. Basic information about these molecules and their content, based on a reference genome that does not represent the sequence of any specific individual, are provided in the following table. Chromosome lengths were estimated by multiplying the number of base pairs by 0.34 nanometers, variations are unique DNA sequence differences that have been identified in the individual human genome sequences analyzed by Ensembl as of December,2016. The number of identified variations is expected to increase as further personal genomes are sequenced and analyzed, in addition to the gene content shown in this table, a large number of non-expressed functional sequences have been identified throughout the human genome. Links open windows to the reference chromosome sequences in the EBI genome browser, small non-coding RNAs are RNAs of as many as 200 bases that do not have protein-coding potential. These include, microRNAs, or miRNAs, small nuclear RNAs, or snRNAs, long non-coding RNAs are RNA molecules longer than 200 bases that do not have protein-coding potential. Although the human genome has been sequenced for all practical purposes. A recent study noted more than 160 euchromatic gaps of which 50 gaps were closed, however, there are still numerous gaps in the heterochromatic parts of the genome which is much harder to sequence due to numerous repeats and other intractable sequence features. The content of the genome is commonly divided into coding and noncoding DNA sequences
24.
Non-coding RNA
–
A non-coding RNA is an RNA molecule that is not translated into a protein. Less-frequently used synonyms are non-protein-coding RNA, non-messenger RNA, or functional RNA, the DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. The number of ncRNAs encoded within the genome is unknown, however. But see Many of the newly identified ncRNAs have not been validated for their function and it is also likely that many ncRNAs are non functional, and are the product of spurious transcription. Nucleic acids were first discovered in 1868 by Friedrich Miescher and by 1939 RNA had been implicated in protein synthesis. Two decades later, Francis Crick predicted a functional RNA component which mediated translation, the first non-coding RNA to be characterised was an alanine tRNA found in bakers yeast, its structure was published in 1965. To produce a purified alanine tRNA sample, Robert W. Holley et al. used 140kg of commercial bakers yeast to give just 1g of purified tRNAAla for analysis, the 80 nucleotide tRNA was sequenced by first being digested with Pancreatic ribonuclease and then with takadiastase ribonuclease Tl. Chromatography and identification of the 5 and 3 ends then helped arrange the fragments to establish the RNA sequence, of the three structures originally proposed for this tRNA, the cloverleaf structure was independently proposed in several following publications. The cloverleaf secondary structure was finalised following X-ray crystallography analysis performed by two independent research groups in 1974, ribosomal RNA was next to be discovered, followed by URNA in the early 1980s. Since then, the discovery of new non-coding RNAs has continued with snoRNAs, Xist, CRISPR, recent notable additions include riboswitches and miRNA, the discovery of the RNAi mechanism associated with the latter earned Craig C. Mello and Andrew Fire the 2006 Nobel Prize in Physiology or Medicine, noncoding RNAs belong to several groups and are involved in many cellular processes. These range from ncRNAs of central importance that are conserved across all or most cellular life through to more transient ncRNAs specific to one or a few related species. Many of the conserved, essential and abundant ncRNAs are involved in translation, ribonucleoprotein particles called ribosomes are the factories where translation takes place in the cell. The ribosome consists of more than 60% ribosomal RNA, these are made up of 3 ncRNAs in prokaryotes and 4 ncRNAs in eukaryotes, ribosomal RNAs catalyse the translation of nucleotide sequences to protein. Another set of ncRNAs, Transfer RNAs, form an adaptor molecule between mRNA and protein, the H/ACA box and C/D box snoRNAs are ncRNAs found in archaea and eukaryotes. RNase MRP is restricted to eukaryotes, both groups of ncRNA are involved in the maturation of rRNA. The snoRNAs guide covalent modifications of rRNA, tRNA and snRNAs, the ubiquitous ncRNA, RNase P, is an evolutionary relative of RNase MRP. RNase P matures tRNA sequences by generating mature 5-ends of tRNAs through cleaving the 5-leader elements of precursor-tRNAs, another ubiquitous RNP called SRP recognizes and transports specific nascent proteins to the endoplasmic reticulum in eukaryotes and the plasma membrane in prokaryotes
25.
Pseudogene
–
Pseudogenes are segments of DNA that are related to real genes. Pseudogenes have lost at least some functionality, relative to the complete gene, pseudogenes often result from the accumulation of multiple mutations within a gene whose product is not required for the survival of the organism. Although not fully functional, pseudogenes may be functional, similar to kinds of noncoding DNA. The pseudo in pseudogene implies a variation in relative to the parent coding gene. Despite being non-coding, many pseudogenes have important roles in normal physiology, although some pseudogenes do not have introns or promoters, others have some gene-like features such as promoters, CpG islands, and splice sites. They are different from normal due to either a lack of protein-coding ability resulting from a variety of disabling mutations. The term pseudogene was coined in 1977 by Jacq et al, because pseudogenes were initially thought of as the last stop for genomic material that could be removed from the genome, they were often labeled as junk DNA. Nonetheless, pseudogenes contain biological and evolutionary histories within their sequences, pseudogenes are usually characterized by a combination of homology to a known gene and loss of some functionality. That is, although every pseudogene has a DNA sequence that is similar to some functional gene, homology is implied by sequence identity between the DNA sequences of the pseudogene and parent gene. After aligning the two sequences, the percentage of identical base pairs is computed, a high sequence identity means that it is highly likely that these two sequences diverged from a common ancestral sequence, and highly unlikely that these two sequences have evolved independently. Nonfunctionality can manifest itself in many ways, normally, a gene must go through several steps to a fully functional protein, Transcription, pre-mRNA processing, translation, and protein folding are all required parts of this process. If any of these steps fails, then the sequence may be considered nonfunctional, pseudogenes for RNA genes are usually more difficult to discover as they do not need to be translated and thus do not have reading frames. Pseudogenes can complicate molecular genetic studies, for example, amplification of a gene by PCR may simultaneously amplify a pseudogene that shares similar sequences. This is known as PCR bias or amplification bias, similarly, pseudogenes are sometimes annotated as genes in genome sequences. Processed pseudogenes often pose a problem for gene prediction programs, often being misidentified as real genes or exons and it has been proposed that identification of processed pseudogenes can help improve the accuracy of gene prediction methods. Recently 140 human pseudogenes have been shown to be translated, however, the function, if any, of the protein products is unknown. There are four types of pseudogenes, all with distinct mechanisms of origin. The classifications of pseudogenes are as follows, Processed pseudogenes, in higher eukaryotes, particularly mammals, retrotransposition is a fairly common event that has had a huge impact on the composition of the genome
26.
Protein
–
Proteins are large biomolecules, or macromolecules, consisting of one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, a linear chain of amino acid residues is called a polypeptide. A protein contains at least one long polypeptide, short polypeptides, containing less than 20–30 residues, are rarely considered to be proteins and are commonly called peptides, or sometimes oligopeptides. The individual amino acid residues are bonded together by peptide bonds, the sequence of amino acid residues in a protein is defined by the sequence of a gene, which is encoded in the genetic code. In general, the code specifies 20 standard amino acids, however. Sometimes proteins have non-peptide groups attached, which can be called prosthetic groups or cofactors, proteins can also work together to achieve a particular function, and they often associate to form stable protein complexes. Once formed, proteins only exist for a period of time and are then degraded and recycled by the cells machinery through the process of protein turnover. A proteins lifespan is measured in terms of its half-life and covers a wide range and they can exist for minutes or years with an average lifespan of 1–2 days in mammalian cells. Abnormal and or misfolded proteins are degraded more rapidly due to being targeted for destruction or due to being unstable. Like other biological macromolecules such as polysaccharides and nucleic acids, proteins are essential parts of organisms, many proteins are enzymes that catalyse biochemical reactions and are vital to metabolism. Proteins also have structural or mechanical functions, such as actin and myosin in muscle and the proteins in the cytoskeleton, other proteins are important in cell signaling, immune responses, cell adhesion, and the cell cycle. In animals, proteins are needed in the diet to provide the essential amino acids that cannot be synthesized, digestion breaks the proteins down for use in the metabolism. Methods commonly used to study structure and function include immunohistochemistry, site-directed mutagenesis, X-ray crystallography, nuclear magnetic resonance. Most proteins consist of linear polymers built from series of up to 20 different L-α-amino acids, all proteinogenic amino acids possess common structural features, including an α-carbon to which an amino group, a carboxyl group, and a variable side chain are bonded. Only proline differs from this structure as it contains an unusual ring to the N-end amine group. The amino acids in a chain are linked by peptide bonds. Once linked in the chain, an individual amino acid is called a residue, and the linked series of carbon, nitrogen. The peptide bond has two forms that contribute some double-bond character and inhibit rotation around its axis, so that the alpha carbons are roughly coplanar
27.
Wilson disease protein
–
Wilson disease protein, also known as ATP7B protein, is a copper-transporting P-type ATPase which is encoded by the ATP7B gene. ATP7B protein locates in trans-Golgi network of liver and brain, balances the copper level in the body by excreting excess copper into bile and plasma. Genetic disorder of the ATP7B gene may cause Wilsons disease, a disease in which copper accumulates in tissues leading to neurological or psychiatric issues and this protein functions as a monomer, exporting copper out of the cells, such as the efflux of hepatic copper into the bile. Alternate transcriptional splice variants, encoding different isoforms with distinct cellular localizations, have been characterized, Wilson disease is caused by various mutations. One of the common mutations is single base pair mutation, H1069Q, ATP7B protein is a copper-transporting P-type ATPase, synthesized as a membrane protein of 165 KDa in human hepatoma cell line, and which is 57% homologous to menkes disease associated protein ATP7A. The copper binding motif also shows an affinity to other transition metal ions like zinc Zn, cadmium Cd, gold Au. As a P-type ATPases, ATP7B undergoes auto-phosphorylation of a key conserved aspartic acid residue in the DKTGT motif, the ATP binding to the protein initiates the reaction and copper binds to the transmembrane region. Then phosphorylation occurs at the acid residue in the DKTGT motif with Cu release. Then dephosphorylation of the acid residue recovers the protein to ready for the next transport. Most of ATP7B protein is located in the network of hepatocytes. Small amount of ATP7B is located in the brain, as a copper-transporting protein, one major function is delivering copper to copper dependent enzymes in Golgi apparatus. In the human body, liver plays an important role in copper regulation including removal of extra copper, ATP7B participates in the physiological pathway in the copper removal process in two ways, secreting copper into plasma and excreting copper into bile. ATP7B receives copper from cytosolic protein Antioxidant 1 copper chaperone and this protein targets ATP7B directly in liver in order to transport copper. ATOX1 transfers copper from cytosol to the binding domain of ATP7B which control the catalytic activity of ATP7B. Several mutations in ATOX1 can block the copper pathways and cause Wilson disease, subsequent transport is promoted through the reduction of intramolecular disulphide bonds by GLRX catalysis. Wilson disease happens when accumulation of copper inside the liver causes mitochondrial damage and cell destruction, then, the loss of excretion of copper in bile leads to an increasing concentration of copper level in urine and causes kidney problems. Therefore, symptoms of Wilson disease could be various including kidney disease, the major cause is the malfunction of ATP7B by single base pair mutations, deletions, frame-shifts, splice errors in ATP7B gene
28.
BRCA2
–
BRCA2 and BRCA2 are a human gene and its protein product, respectively. The official symbol and the name are maintained by the HGNC. One alternative symbol, FANCD1, recognizes its association with the FANC protein complex, orthologs, styled Brca2 and Brca2, are common in other mammal species. BRCA2 is a tumor suppressor gene, found in all humans, its protein. BRCA2 and BRCA1 are normally expressed in the cells of breast and other tissue and they are involved in the repair of chromosomal damage with an important role in the error-free repair of DNA double strand breaks. If BRCA1 or BRCA2 itself is damaged by a BRCA mutation, damaged DNA is not repaired properly, BRCA1 and BRCA2 have been described as breast cancer susceptibility genes and breast cancer susceptibility proteins. The BRCA2 gene is located on the arm of chromosome 13 at position 12.3. The human reference BRCA2 gene contains 27 exons, and the cDNA has 10,254 base pairs coding for a protein of 3418 amino acids, the gene was first cloned by scientists at Myriad Genetics, Endo Recherche, Inc. HSC Research & Development Limited Partnership, and the University of Pennsylvania, methods to diagnose the likelihood of a patient with mutations in BRCA1 and BRCA2 getting cancer were covered by patents owned or controlled by Myriad Genetics. Although the structures of the BRCA1 and BRCA2 genes are very different, the proteins made by both genes are essential for repairing damaged DNA. BRCA2 binds the single strand DNA and directly interacts with the recombinase RAD51 to stimulate strand invasion a vital step of homologous recombination, the localization of RAD51 to the DNA double-strand break requires the formation of BRCA1-PALB2-BRCA2 complex. PALB2 can function synergistically with a BRCA2 chimera to further promote strand invasion, double strand breaks are also generated during repair of DNA cross links. By repairing DNA, these play a role in maintaining the stability of the human genome and prevent dangerous gene rearrangements that can lead to hematologic. Like BRCA1, BRCA2 probably regulates the activity of other genes, certain variations of the BRCA2 gene increase risks for breast cancer as part of a hereditary breast-ovarian cancer syndrome. Researchers have identified hundreds of mutations in the BRCA2 gene, many of which cause a risk of cancer. BRCA2 mutations are usually insertions or deletions of a number of DNA base pairs in the gene. As a result of mutations, the protein product of the BRCA2 gene is abnormal. Researchers believe that the defective BRCA2 protein is unable to fix DNA damages that occur throughout the genome, people who have two mutated copies of the BRCA2 gene have one type of Fanconi anemia
29.
CARKD
–
Carbohydrate kinase domain containing protein, encoded by CARKD gene, is a human protein of unknown function. The CARKD gene encodes proteins with a predicted mitochondrial propeptide, a peptide or neither of them. The protein is conserved throughout many species, and has predicted orthologs through eukaryotes, bacteria, human CARKD gene has 10 exons and resides on Chromosome 13 at q34. The following genes are near CARKD on the chromosome, COL4A2, A2 Subunit of type IV collagen RAB20, cARS2, Mitochondrial Cystienyl-tRNA Synthetase 2 ING1, Tumor-Suppressor Protein This protein is part of the phosphomethylpyrimidine kinase, ribokinase / pfkB superfamily. This family is characterized by the presence of a domain shared by the family, CARKD contains a carbohydrate kinase domain. This family is related to Pfam PF02210 and Pfam PF00294 implying that it also is a carbohydrate kinase, the following properties of CARKD were predicted using bioinformatic analysis, Molecular Weight,41.4 KDal Isoelectric point,9. 377CARKD orthologs have highly variable isoelectric points. CARKD appears to be expressed at high levels. Expression data in the protein, and the mouse ortholog. One peculiar expression pattern of CARKD is its expression through the development of oligodendrocytes. Its expression is lower in oligodendrocyte progenitor cells than in mature oligodendrocytes, the human protein apolipoprotein A-1 binding precursor was predicted to be a binding partner for CARKD. This prediction is based on co-occurrence across genomes and co-expression, in addition to these data, the orthologs of CARKD in E. coli contain a domain similar to APOA1BP. Based on allele-specific expression of CARKD, CARKD may play a role in acute lymphoblastic leukemia, in addition, microarray data indicates that CARKD is up-regulated in Glioblastoma multiforme tumors. Human CARKD genome location and CARKD gene details page in the UCSC Genome Browser
30.
CKAP2
–
Cytoskeleton-associated protein 2 is a protein that in humans is encoded by the CKAP2 gene. Alternative titles, TUMOR- AND MICROTUBLE-ASSOCIATED PROTEIN, TMAP, LB1 Human CKAP2 gene and its high transcriptional activity has been observed in the testes, thymus, and diffuse B-cell lymphomas. The gene codes for a protein of 683 residues, which lacks a homology to known amino acid sequences, on evidence of immunofluorescence analysis, the CKAP2 product is a cytoplasmic protein associated with cytoskeletal fibrils. The CKAP2 gene is in chromosome 13q14, rearrangements of this region result in various tumors. Human CKAP2 genome location and CKAP2 gene details page in the UCSC Genome Browser
31.
EDNRB
–
Endothelin receptor type B, also known as ETB is a protein that in humans is encoded by the EDNRB gene. Endothelin receptor type B is a G protein-coupled receptor which activates a second messenger system. Its ligand, endothelin, consists of a family of three potent vasoactive peptides, ET1, ET2, and ET3, a splice variant, named SVR, has been described, the sequence of the ETB-SVR receptor is identical to ETRB except for the intracellular C-terminal domain. While both splice variants bind ET1, they exhibit different responses upon binding which suggests that they may be functionally distinct, in melanocytic cells the EDNRB gene is regulated by the microphthalmia-associated transcription factor. Mutations in either gene are links to Waardenburg syndrome, the multigenic disorder, Hirschsprung disease type 2, is due to mutation in endothelin receptor type B gene. In horses, a mutation in the middle of the EDNRB gene, Ile118Lys, in this mutation, a mismatch in the DNA replication causes isoleucine to be made instead of lysine. The resulting EDNRB protein is unable to fulfill its role in the development of the embryo, limiting the migration of the melanocyte, a single copy of the EDNRB mutation, the heterozygous state, produces an identifiable and completely benign spotted coat color called frame overo. Endothelin receptor type B has been shown to interact with Caveolin 1, agonists IRL-1620 Antagonists A-192,621 BQ-788 Bosentan Endothelin receptor Endothelin Receptors, ETB. IUPHAR Database of Receptors and Ion Channels, international Union of Basic and Clinical Pharmacology. This article incorporates text from the United States National Library of Medicine, which is in the public domain
32.
FLT1
–
Vascular endothelial growth factor receptor 1 is a protein that in humans is encoded by the FLT1 gene. Oncogene FLT belongs to the src gene family and is related to oncogene ROS, like other members of this family, it shows tyrosine protein kinase activity that is important for the control of cell proliferation and differentiation. The sequence structure of the FLT gene resembles that of the FMS gene, hence, FLT1 has been shown to interact with PLCG1 and vascular endothelial growth factor B
33.
Vascular endothelial growth factor
–
Vascular endothelial growth factor, originally known as vascular permeability factor, is a signal protein produced by cells that stimulates vasculogenesis and angiogenesis. It is part of the system restores the oxygen supply to tissues when blood circulation is inadequate such as in hypoxic conditions. Serum concentration of VEGF is high in bronchial asthma and diabetes mellitus, VEGFs normal function is to create new blood vessels during embryonic development, new blood vessels after injury, muscle following exercise, and new vessels to bypass blocked vessels. When VEGF is overexpressed, it can contribute to disease, solid cancers cannot grow beyond a limited size without an adequate blood supply, cancers that can express VEGF are able to grow and metastasize. Overexpression of VEGF can cause disease in the retina of the eye. Drugs such as aflibercept, bevacizumab, and ranibizumab can inhibit VEGF, VEGF is a sub-family of growth factors, to be specific, the platelet-derived growth factor family of cystine-knot growth factors. They are important signaling proteins involved in both vasculogenesis and angiogenesis, VEGF was first identified in guinea pigs, hamsters, and mice by Senger et al. in 1983. It was purified and cloned by Ferrara and Henzel in 1989, VEGF alternative splicing was discovered by Tischer et al. in 1991. Between 1996 and 1997, Christinger and De Vos obtained the structure of VEGF, first at 2.5 Å resolution. Fms-like tyrosine kinase-1 was shown to be a VEGF receptor by Ferrara et al. in 1992, the kinase insert domain receptor was shown to be a VEGF receptor by Terman et al. in 1992 as well. In 1998, neuropilin 1 and neuropilin 2 were shown to act as VEGF receptors, the VEGF family comprises in mammals five members, VEGF-A, placenta growth factor, VEGF-B, VEGF-C and VEGF-D. The latter ones were discovered later than VEGF-A, and, before their discovery, a number of VEGF-related proteins encoded by viruses and in the venom of some snakes have also been discovered. Activity of VEGF-A, as its name implies, has been studied mostly on cells of the vascular endothelium, in vitro, VEGF-A has been shown to stimulate endothelial cell mitogenesis and cell migration. VEGF-A is also a vasodilator and increases microvascular permeability and was referred to as vascular permeability factor. There are multiple isoforms of VEGF-A that result from alternative splicing of mRNA from a single and these are classified into two groups which are referred to according to their terminal exon splice site, the proximal splice site or distal splice site. In addition, alternate splicing of exon 6 and 7 alters their heparin-binding affinity and these domains have important functional consequences for the VEGF splice variants, as the terminal splice site determines whether the proteins are pro-angiogenic or anti-angiogenic. Recently, VEGF-C has been shown to be an important inducer of neurogenesis in the subventricular zone. VEGF-A binds to VEGFR-1 and VEGFR-2, VEGFR-2 appears to mediate almost all of the known cellular responses to VEGF
34.
GJB2
–
Gap junction beta-2 protein, also known as connexin 26 — is a protein that in humans is encoded by the GJB2 gene. Defects in this lead to the most common form of congenital deafness in developed countries. Gap junctions were first characterized by electron microscopy as regionally specialized structures on plasma membranes of contacting adherent cells and these structures were shown to consist of cell-to-cell channels. Proteins, called connexins, purified from fractions of enriched gap junctions from different tissues differ, the connexins are designated by their molecular mass. Another system of nomenclature divides gap junction proteins into two categories, alpha and beta, according to sequence similarities at the nucleotide and amino acid levels. For example, CX43 is designated alpha-1 gap junction protein, whereas CX32 and CX26 are called beta-1 and beta-2 gap junction proteins and this nomenclature emphasizes that CX32 and CX26 are more homologous to each other than either of them is to CX43. Connexin Gap junction Vohwinkel syndrome Bart–Pumphrey syndrome
35.
Glypican 5
–
Glypican-5 is a protein that in humans is encoded by the GPC5 gene. Cell surface heparan sulfate proteoglycans are composed of a protein core substituted with a variable number of heparan sulfate chains. Members of the integral membrane proteoglycan family contain a core protein anchored to the cytoplasmic membrane via a glycosyl phosphatidylinositol linkage. These proteins may play a role in the control of cell division and growth regulation