Loose ends: almost one in five human genes still have unresolved coding status

Abascal, Federico; Juan, David; Jungreis, Irwin; Martinez, Laura; Rigau, Maria; Rodriguez, Jose Manuel; Vazquez, Jesus; Tress, Michael L.

Publication:
Loose ends: almost one in five human genes still have unresolved coding status

dc.contributor.author	Abascal, Federico
dc.contributor.author	Juan, David
dc.contributor.author	Jungreis, Irwin
dc.contributor.author	Martinez, Laura
dc.contributor.author	Rigau, Maria
dc.contributor.author	Rodriguez, Jose Manuel
dc.contributor.author	Vazquez, Jesus
dc.contributor.author	Tress, Michael L.
dc.contributor.funder	National Institutes of Health (Estados Unidos)
dc.date.accessioned	2018-10-26T07:59:26Z
dc.date.available	2018-10-26T07:59:26Z
dc.date.issued	2018
dc.description.abstract	Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.
dc.description.peerreviewed	Sí
dc.description.sponsorship	National Institutes of Health [2 U41 HG007234 to I.J., L.M., J.M.R. and M.L.T., R01 HG004037 to I.J.]. Funding for open access charge: NIH [2 U41 HG007234].
dc.format.page	7070-7084
dc.format.volume	46
dc.identifier	ISI:000444131400017
dc.identifier.citation	Nucleic Acids Res. 2018; 46(14):7070-7084
dc.identifier.doi	10.1093/nar/gky587
dc.identifier.e-issn	1362-4962
dc.identifier.issn	0305-1048
dc.identifier.journal	Nucleic Acids Research
dc.identifier.pubmedID	29982784
dc.identifier.uri	http://hdl.handle.net/20.500.12105/6538
dc.language.iso	eng
dc.publisher	Oxford University Press
dc.relation.publisherversion	https://doi.org/10.1093/nar/gky587
dc.repisalud.institucion	CNIC
dc.repisalud.orgCNIC	CNIC::Grupos de investigación::Proteómica cardiovascular
dc.rights.accessRights	open access	es_ES
dc.rights.license	Atribución 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.subject	HUMAN GENOME
dc.subject	EVOLUTIONARY INFORMATION
dc.subject	FUNCTIONALLY IMPORTANT
dc.subject	INTEGRATED MAP
dc.subject	PROTEOME
dc.subject	PREDICTION
dc.subject	TOPOLOGY
dc.subject	DATABASE
dc.subject	PROJECT
dc.subject	NUMBER
dc.title	Loose ends: almost one in five human genes still have unresolved coding status
dc.type	journal article
dc.type.hasVersion	VoR
dspace.entity.type	Publication
relation.isAuthorOfPublication	63e55d34-c1c9-439c-bc46-f5b9830e538a
relation.isAuthorOfPublication	9743763b-919c-4fa9-a53c-57c41be5e0ac
relation.isAuthorOfPublication.latestForDiscovery	63e55d34-c1c9-439c-bc46-f5b9830e538a