Publication:
An audit of the PeptideAtlas database uncovers evidence for repurposed pseudogenes and co-opted retroviral ORFs.

Research Projects

Organizational Units

Journal Issue

Abstract

The human genome has been the subject of scrutiny for more than two decades, yet new protein coding genes are still being uncovered. Recently ribosome profiling experiments have provided evidence for the translation of thousands of novel open reading frames (ORFs). To determine how many of these novel ORFs have peptide support, we carried out an in-depth investigation of an entire mass spectrometry proteomics database. We analysed the peptides housed in the human build of the PeptideAtlas database and identified reliable evidence for 35 potential coding genes not annotated in the Ensembl/GENCODE reference gene set. Evidence from complementary sources confirmed that 16 were almost certainly coding genes, but we believe that at least 14 are most likely to be undergoing aberrant translation. These 14 genes had reading frames that were not preserved beyond human and their peptides were restricted to cancers or cell lines. Remarkably, three of the sixteen likely coding genes were derived from endogenous retroviral ORFs and were expressed only in placenta. All three had evidence of purifying selection. Retroviral ORFs (syncytins) with distinct origins are expressed in almost all mammalian placentae and these results suggest that co-opted ORFs may also play an important role in placental development. Our analysis shows that proteomics data can be used in conjunction with evolutionary evidence to confirm the existence of new coding genes. The evidence suggests that both testis and placenta are the tissues most likely to express still to be identified coding genes, and that there may be other transposon-derived ORF that have been co-opted as coding genes. The strong evidence for the translation of regions under dysregulated conditions has important implications for the annotation of coding genes and in the analysis of cancer and other degenerative diseases.The online version contains supplementary material available at 10.1186/s12864-025-12238-w.

Description

MeSH Terms

DeCS Terms

Bibliographic citation

BMC Genomics. 2025 Nov 21;26(1):1087.

Related dataset

Related publication

Document type