Genomics Terminology

From bradwiki
Revision as of 00:22, 29 October 2020 by Bradley Monk (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

gnomad

The Genome Aggregation Database (gnomAD or gnomad) is a resource developed with the goal of aggregating both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community. Version 2 (v2) of the gnomad dataset (GRCh37) spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated individuals sequenced as part of various disease-specific and population genetic studies. Version 3 (v3) data set (GRCh38) spans 71,702 genomes. All data is released without restriction on use.


proband

A proband is an individual serving as the starting point for the genetic study of a family (used especially in medicine). A proband is usually the first affected individual in a family who brings a genetic disorder to the attention of the medical community.


trio analysis

A trio refers to 2 parents + 1 offspring (2 + 1 = 3, hence trio). In medical genetics, trio analysis often means the analysis of a proband's genome and along with their parents genome. An exome trio-based approach is fundamental to the identification of heterozygous dominant pathogenic variants (in an afflicted proband and their unaffected parents).


Hail

Hail is an open-source library for scalable data exploration and analysis, with a particular emphasis on genomics. See the overview for a high-level walkthrough of the library, the GWAS tutorial for a simple example of conducting a genome-wide association study, and the installation page to get started using Hail.


GATK

The Genome Analysis ToolKit (GATK) is a genomic analysis toolkit focused on variant discovery. GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope includes somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.

These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.


human genome

The total length of the human genome is over 3 billion base pairs. The genome is organized into 22 paired chromosomes, termed autosomes, plus the 23rd pair of sex chromosomes (XX) in the female, and (XY) in the male. These are all large linear DNA molecules contained within the cell nucleus. The genome also includes the mitochondrial DNA, a comparatively small circular molecule present in each mitochondrion. Basic information about these molecules and their gene content, based on a reference genome that does not represent the sequence of any specific individual, are provided in the following table.

Chromosome Length
(mm)
Base
pairs
Variations Protein
coding
genes
Pseudo-
genes
long
ncRNA
small
ncRNA
miRNA rRNA snRNA snoRNA gnomAD
exome.vcf
Links Centromere
pos (Mbp)
Cumulative
(%)
1 85 248,956,422 12,151,146 2058 1220 1200 496 134 66 221 145 5.77 GiB EBI 125 7.9
2 83 242,193,529 12,945,965 1309 1023 1037 375 115 40 161 117 4.20 GiB EBI 93.3 16.2
3 67 198,295,559 10,638,715 1078 763 711 298 99 29 138 87 3.29 GiB EBI 91 23
4 65 190,214,555 10,165,685 752 727 657 228 92 24 120 56 2.17 GiB EBI 50.4 29.6
5 62 181,538,259 9,519,995 876 721 844 235 83 25 106 61 2.51 GiB EBI 48.4 35.8
6 58 170,805,979 9,130,476 1048 801 639 234 81 26 111 73 2.83 GiB EBI 61 41.6
7 54 159,345,973 8,613,298 989 885 605 208 90 24 90 76 2.88 GiB EBI 59.9 47.1
8 50 145,138,636 8,221,520 677 613 735 214 80 28 86 52 2.13 GiB EBI 45.6 52
9 48 138,394,717 6,590,811 786 661 491 190 69 19 66 51 2.40 GiB EBI 49 56.3
10 46 133,797,422 7,223,944 733 568 579 204 64 32 87 56 2.23 GiB EBI 40.2 60.9
11 46 135,086,622 7,535,370 1298 821 710 233 63 24 74 76 3.61 GiB EBI 53.7 65.4
12 45 133,275,309 7,228,129 1034 617 848 227 72 27 106 62 3.07 GiB EBI 35.8 70
13 39 114,364,328 5,082,574 327 372 397 104 42 16 45 34 0.98 GiB EBI 17.9 73.4
14 36 107,043,718 4,865,950 830 523 533 239 92 10 65 97 2.02 GiB EBI 17.6 76.4
15 35 101,991,189 4,515,076 613 510 639 250 78 13 63 136 2.08 GiB EBI 19 79.3
16 31 90,338,345 5,101,702 873 465 799 187 52 32 53 58 3.04 GiB EBI 36.6 82
17 28 83,257,441 4,614,972 1197 531 834 235 61 15 80 71 3.62 GiB EBI 24 84.8
18 27 80,373,285 4,035,966 270 247 453 109 32 13 51 36 0.88 GiB EBI 17.2 87.4
19 20 58,617,616 3,858,269 1472 512 628 179 110 13 29 31 4.30 GiB EBI 26.5 89.3
20 21 64,444,167 3,439,621 544 249 384 131 57 15 46 37 1.44 GiB EBI 27.5 91.4
21 16 46,709,983 2,049,697 234 185 305 71 16 5 21 19 0.65 GiB EBI 13.2 92.6
22 17 50,818,468 2,135,311 488 324 357 78 31 5 23 23 1.43 GiB EBI 14.7 93.8
X 53 156,040,895 5,753,881 842 874 271 258 128 22 85 64 1.33 GiB EBI 60.6 99.1
Y 20 57,227,415 211,643 71 388 71 30 15 7 17 3 15.66 GiB EBI 10.4 100
mtDNA 0.0054 16,569 929 13 0 0 24 0 2 0 0 NA EBI N/A 100
total 3,088,286,401 155,630,645 20412 14600 14727 5037 1756 532 1944 1521 58.81 GiB

Table 1 (above) summarizes the physical organization and gene content of the human reference genome, with links to the original analysis, as published in the Ensembl database at the European Bioinformatics Institute (EBI) and Wellcome Trust Sanger Institute. Chromosome lengths were estimated by multiplying the number of base pairs by 0.34 nanometers, the distance between base pairs in the DNA double helix. A recent estimation of human chromosome lengths based on updated data reports 205.00 cm for the diploid male genome and 208.23 cm for female, corresponding to weights of 6.41 and 6.51 picograms (pg), respectively. The number of proteins is based on the number of initial precursor mRNA transcripts, and does not include products of alternative pre-mRNA splicing, or modifications to protein structure that occur after translation.

human genes (count)

The number of genes in the human genome (see: full gene list) is not entirely clear because the function of numerous transcripts remains unclear. This is especially true for non-coding RNA (see below). The number of protein-coding genes is better known but there are still on the order of 1,400 questionable genes which may or may not encode functional proteins, usually encoded by short open reading frames. Table 2 gives estimates from various projects and shows these discrepancies.

Table 2. Number of human genes
(according to different databases)
Gencode Ensemble Refseq CHESS
protein-coding genes 19,901 20,376 20,345 21,306
lncRNA genes 15,779 14,720 17,712 18,484
antisense RNA 5501 28 2694
miscellaneous RNA 2213 2222 13,899 4347
Pseudogenes 14,723 1740 15,952
total transcripts 203,835 203,903 154,484 328,827