Genomics Terminology

gnomad

The Genome Aggregation Database (gnomAD or gnomad) is a resource developed with the goal of aggregating both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community. Version 2 (v2) of the gnomad dataset (GRCh37) spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated individuals sequenced as part of various disease-specific and population genetic studies. Version 3 (v3) data set (GRCh38) spans 71,702 genomes. All data is released without restriction on use.

proband

A proband is an individual serving as the starting point for the genetic study of a family (used especially in medicine). A proband is usually the first affected individual in a family who brings a genetic disorder to the attention of the medical community.

trio analysis

A trio refers to 2 parents + 1 offspring (2 + 1 = 3, hence trio). In medical genetics, trio analysis often means the analysis of a proband's genome and along with their parents genome. An exome trio-based approach is fundamental to the identification of heterozygous dominant pathogenic variants (in an afflicted proband and their unaffected parents).

Hail

Hail is an open-source library for scalable data exploration and analysis, with a particular emphasis on genomics. See the overview for a high-level walkthrough of the library, the GWAS tutorial for a simple example of conducting a genome-wide association study, and the installation page to get started using Hail.

GATK

The Genome Analysis ToolKit (GATK) is a genomic analysis toolkit focused on variant discovery. GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope includes somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.

These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.

human genome

The total length of the human genome is over 3 billion base pairs. The genome is organized into 22 paired chromosomes, termed autosomes, plus the 23rd pair of sex chromosomes (XX) in the female, and (XY) in the male. These are all large linear DNA molecules contained within the cell nucleus. The genome also includes the mitochondrial DNA, a comparatively small circular molecule present in each mitochondrion. Basic information about these molecules and their gene content, based on a reference genome that does not represent the sequence of any specific individual, are provided in the following table.

Chromosome	Length (mm)	Base pairs	Variations	Protein coding genes	Pseudo- genes	long ncRNA	small ncRNA	miRNA	rRNA	snRNA	snoRNA	gnomAD exome.vcf	Links	Centromere pos (Mbp)	Cumulative (%)
1	85	248,956,422	12,151,146	2058	1220	1200	496	134	66	221	145	5.77 GiB	EBI	125	7.9
2	83	242,193,529	12,945,965	1309	1023	1037	375	115	40	161	117	4.20 GiB	EBI	93.3	16.2
3	67	198,295,559	10,638,715	1078	763	711	298	99	29	138	87	3.29 GiB	EBI	91	23
4	65	190,214,555	10,165,685	752	727	657	228	92	24	120	56	2.17 GiB	EBI	50.4	29.6
5	62	181,538,259	9,519,995	876	721	844	235	83	25	106	61	2.51 GiB	EBI	48.4	35.8
6	58	170,805,979	9,130,476	1048	801	639	234	81	26	111	73	2.83 GiB	EBI	61	41.6
7	54	159,345,973	8,613,298	989	885	605	208	90	24	90	76	2.88 GiB	EBI	59.9	47.1
8	50	145,138,636	8,221,520	677	613	735	214	80	28	86	52	2.13 GiB	EBI	45.6	52
9	48	138,394,717	6,590,811	786	661	491	190	69	19	66	51	2.40 GiB	EBI	49	56.3
10	46	133,797,422	7,223,944	733	568	579	204	64	32	87	56	2.23 GiB	EBI	40.2	60.9
11	46	135,086,622	7,535,370	1298	821	710	233	63	24	74	76	3.61 GiB	EBI	53.7	65.4
12	45	133,275,309	7,228,129	1034	617	848	227	72	27	106	62	3.07 GiB	EBI	35.8	70
13	39	114,364,328	5,082,574	327	372	397	104	42	16	45	34	0.98 GiB	EBI	17.9	73.4
14	36	107,043,718	4,865,950	830	523	533	239	92	10	65	97	2.02 GiB	EBI	17.6	76.4
15	35	101,991,189	4,515,076	613	510	639	250	78	13	63	136	2.08 GiB	EBI	19	79.3
16	31	90,338,345	5,101,702	873	465	799	187	52	32	53	58	3.04 GiB	EBI	36.6	82
17	28	83,257,441	4,614,972	1197	531	834	235	61	15	80	71	3.62 GiB	EBI	24	84.8
18	27	80,373,285	4,035,966	270	247	453	109	32	13	51	36	0.88 GiB	EBI	17.2	87.4
19	20	58,617,616	3,858,269	1472	512	628	179	110	13	29	31	4.30 GiB	EBI	26.5	89.3
20	21	64,444,167	3,439,621	544	249	384	131	57	15	46	37	1.44 GiB	EBI	27.5	91.4
21	16	46,709,983	2,049,697	234	185	305	71	16	5	21	19	0.65 GiB	EBI	13.2	92.6
22	17	50,818,468	2,135,311	488	324	357	78	31	5	23	23	1.43 GiB	EBI	14.7	93.8
X	53	156,040,895	5,753,881	842	874	271	258	128	22	85	64	1.33 GiB	EBI	60.6	99.1
Y	20	57,227,415	211,643	71	388	71	30	15	7	17	3	15.66 GiB	EBI	10.4	100
mtDNA	0.0054	16,569	929	13	0	0	24	0	2	0	0	NA	EBI	N/A	100
total		3,088,286,401	155,630,645	20412	14600	14727	5037	1756	532	1944	1521	58.81 GiB

Table 1 (above) summarizes the physical organization and gene content of the human reference genome, with links to the original analysis, as published in the Ensembl database at the European Bioinformatics Institute (EBI) and Wellcome Trust Sanger Institute. Chromosome lengths were estimated by multiplying the number of base pairs by 0.34 nanometers, the distance between base pairs in the DNA double helix. A recent estimation of human chromosome lengths based on updated data reports 205.00 cm for the diploid male genome and 208.23 cm for female, corresponding to weights of 6.41 and 6.51 picograms (pg), respectively. The number of proteins is based on the number of initial precursor mRNA transcripts, and does not include products of alternative pre-mRNA splicing, or modifications to protein structure that occur after translation.

human genes (count)

The number of genes in the human genome (see: full gene list) is not entirely clear because the function of numerous transcripts remains unclear. This is especially true for non-coding RNA (see below). The number of protein-coding genes is better known but there are still on the order of 1,400 questionable genes which may or may not encode functional proteins, usually encoded by short open reading frames. Table 2 gives estimates from various projects and shows these discrepancies.

Table 2. Number of human genes
(according to different databases)
	Gencode	Ensemble	Refseq	CHESS
protein-coding genes	19,901	20,376	20,345	21,306
lncRNA genes	15,779	14,720	17,712	18,484
antisense RNA	5501		28	2694
miscellaneous RNA	2213	2222	13,899	4347
Pseudogenes	14,723	1740	15,952
total transcripts	203,835	203,903	154,484	328,827

Genomics Terminology

Contents

gnomad

proband

trio analysis

Hail

GATK

human genome

human genes (count)

Navigation menu

Genomics Terminology

gnomad

proband

trio analysis

Hail

GATK

human genome

human genes (count)

Navigation menu

Search