Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes

Ezkurdia, Iakes; Juan, David; Manuel Rodriguez, Jose; Frankish, Adam; Diekhans, Mark; Harrow, Jennifer; Vazquez, Jesus; Valencia, Alfonso; Tress, Michael L.

Publication:
Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes

dc.contributor.author	Ezkurdia, Iakes
dc.contributor.author	Juan, David
dc.contributor.author	Manuel Rodriguez, Jose
dc.contributor.author	Frankish, Adam
dc.contributor.author	Diekhans, Mark
dc.contributor.author	Harrow, Jennifer
dc.contributor.author	Vazquez, Jesus
dc.contributor.author	Valencia, Alfonso
dc.contributor.author	Tress, Michael L.
dc.contributor.funder	National Institutes of Health (Estados Unidos)
dc.contributor.funder	Ministerio de Ciencia e Innovación (España)
dc.date.accessioned	2017-12-01T07:37:29Z
dc.date.available	2017-12-01T07:37:29Z
dc.date.issued	2014
dc.description.abstract	Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide massspectrometry(MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60\% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96\% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3\% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
dc.description.peerreviewed	Sí
dc.description.sponsorship	This work was supported by the National Institutes of Health (NIH, grant number U41 HG007234) and by the Spanish Ministry of Science and Innovation (grant numbers BIO2007-666855, RD07-0067-0014, COMBIOMED). J.M.R. is supported by the Spanish National Institute of Bioinformatics (www.inab.org), a platform of the `Instituto de Salud Carlos III'. Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health (NIH, grant number U41 HG007234).
dc.format.page	5866-5878
dc.format.volume	23
dc.identifier	ISI:000344671900002
dc.identifier.citation	Hum Mol Genet. 2014; 23(22):5866-78
dc.identifier.doi	10.1093/hmg/ddu309
dc.identifier.e-issn	1460-2083
dc.identifier.issn	0964-6906
dc.identifier.journal	Human Molecular Genetics
dc.identifier.pubmedID	24939910
dc.identifier.uri	http://hdl.handle.net/20.500.12105/5536
dc.language.iso	eng
dc.publisher	Oxford University Press
dc.relation.publisherversion	https://doi.org/10.1093/hmg/ddu309
dc.repisalud.institucion	CNIC
dc.repisalud.orgCNIC	CNIC::Grupos de investigación::Proteómica cardiovascular
dc.repisalud.orgCNIC	CNIC::Unidades técnicas::Proteómica / Metabolómica
dc.rights.accessRights	open access	es_ES
dc.rights.license	Atribución-NoComercial 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	*
dc.subject	HUMAN GENOME
dc.subject	EVOLUTIONARY INFORMATION
dc.subject	MASS-SPECTROMETRY
dc.subject	CELL-LINE
dc.subject	PROTEOMICS
dc.subject	DATABASE
dc.subject	ANNOTATION
dc.subject	PREDICTION
dc.subject	PROJECT
dc.subject	SEQUENCES
dc.title	Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes
dc.type	journal article
dc.type.hasVersion	VoR
dspace.entity.type	Publication
relation.isAuthorOfPublication	bd96f60d-98c7-45d3-b247-22b4b53c78b6
relation.isAuthorOfPublication	9743763b-919c-4fa9-a53c-57c41be5e0ac
relation.isAuthorOfPublication	d691c3d3-9e05-4217-a923-08e68ba16baa
relation.isAuthorOfPublication.latestForDiscovery	bd96f60d-98c7-45d3-b247-22b4b53c78b6