SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification

Tardaguila, Manuel; de la Fuente, Lorena; Marti, Cristina; Pereira, Cecile; Pardo-Palacios, Francisco Jose; del Risco, Hector; Ferrell, Marc; Mellado, Maravillas; Macchietto, Marissa; Verheggen, Kenneth; Edelmann, Mariola; Ezkurdia, Iakes; Vazquez, Jesus; Tress, Michael; Mortazavi, Ali; Martens, Lennart; Rodriguez-Navarro, Susana; Moreno-Manzano, Victoria; Conesa, Ana

Publication:
SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification

dc.contributor.author	Tardaguila, Manuel
dc.contributor.author	de la Fuente, Lorena
dc.contributor.author	Marti, Cristina
dc.contributor.author	Pereira, Cecile
dc.contributor.author	Pardo-Palacios, Francisco Jose
dc.contributor.author	del Risco, Hector
dc.contributor.author	Ferrell, Marc
dc.contributor.author	Mellado, Maravillas
dc.contributor.author	Macchietto, Marissa
dc.contributor.author	Verheggen, Kenneth
dc.contributor.author	Edelmann, Mariola
dc.contributor.author	Ezkurdia, Iakes
dc.contributor.author	Vazquez, Jesus
dc.contributor.author	Tress, Michael
dc.contributor.author	Mortazavi, Ali
dc.contributor.author	Martens, Lennart
dc.contributor.author	Rodriguez-Navarro, Susana
dc.contributor.author	Moreno-Manzano, Victoria
dc.contributor.author	Conesa, Ana
dc.contributor.funder	National Institutes of Health (Estados Unidos)
dc.contributor.funder	University of Florida (Estados Unidos)
dc.contributor.funder	Ministerio de Economía y Competitividad (España)
dc.contributor.funder	Ministerio de Educación (España)
dc.date.accessioned	2018-11-22T08:10:53Z
dc.date.available	2018-11-22T08:10:53Z
dc.date.issued	2018
dc.description.abstract	High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.
dc.description.peerreviewed	Sí
dc.description.sponsorship	We thank Eric Triplett (University of Florida) for support in sequencing experiments and Elizabeth Tseng (PacBio) for helping in running the ToFU pipeline and critically reading this manuscript. This work has been partially funded by the University of Florida Preeminence hires program, the Spanish Ministry of Economy and Competitiveness grants BIO2015-71658-R, BFU2014-57636-P, Spanish Ministry of Education grant FPU2013/02348, and GENCODE NIH grant 2U41 HG007234.
dc.format.page	396-411
dc.format.volume	28
dc.identifier	ISI:000426355600012
dc.identifier.citation	Genome Res. 2018; 28(3):396-441
dc.identifier.doi	10.1101/gr.222976.117
dc.identifier.e-issn	1549-5469
dc.identifier.issn	1088-9051
dc.identifier.journal	Genome Research
dc.identifier.pubmedID	29440222
dc.identifier.uri	http://hdl.handle.net/20.500.12105/6686
dc.language.iso	eng
dc.publisher	Cold Spring Harbor Laboratory Press
dc.relation.projectID	info:eu-repo/grantAgreement/ES/BIO2015-71658-R	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/ES/BFU2014-57636-P	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/ES/FPU2013/02348	es_ES
dc.relation.publisherversion	https://doi.org/10.1101/gr.222976.117
dc.repisalud.institucion	CNIC
dc.repisalud.orgCNIC	CNIC::Grupos de investigación::Proteómica cardiovascular
dc.repisalud.orgCNIC	CNIC::Unidades técnicas::Proteómica / Metabolómica
dc.rights.accessRights	open access	es_ES
dc.rights.license	Atribución 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.title	SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification
dc.type	journal article
dc.type.hasVersion	VoR
dspace.entity.type	Publication
relation.isAuthorOfPublication	bd96f60d-98c7-45d3-b247-22b4b53c78b6
relation.isAuthorOfPublication	9743763b-919c-4fa9-a53c-57c41be5e0ac
relation.isAuthorOfPublication	4cd57a02-4264-435c-a2be-ac764f9a0ae6
relation.isAuthorOfPublication.latestForDiscovery	bd96f60d-98c7-45d3-b247-22b4b53c78b6