Nucleic Acids Research, 2018 1 doi: 10.1093/nar/gky955 GENCODE reference annotation for the human and mouse genomes Adam Frankish1, Mark Diekhans2, Anne-Maud Ferreira3, Rory Johnson4,5, Irwin Jungreis 6,7, Jane Loveland 1, Jonathan M. Mudge1, Cristina Sisu8,9, James Wright10, Joel Armstrong2, If Barnes1, Andrew Berry1, Alexandra Bignell1, Silvia Carbonell Sala11, Jacqueline Chrast3, Fiona Cunningham 1, Toma´s Di Domenico 12, Sarah Donaldson1, Ian T. Fiddes2, Carlos Garcı´a Giro´n 1, Jose Manuel Gonzalez1, Tiago Grego1, Matthew Hardy1, Thibaut Hourlier 1, Toby Hunt1, Osagie G. Izuogu1, Julien Lagarde11, Fergal J. Martin 1, Laura Martı´nez12, Shamika Mohanan1, Paul Muir13,14, Fabio C.P. Navarro8, Anne Parker1, Baikang Pei8, Fernando Pozo12, Magali Ruffier 1, Bianca M. Schmitt1, Eloise Stapleton1, Marie-Marthe Suner 1, Irina Sycheva1, Barbara Uszczynska-Ratajczak15, Jinuri Xu8, Andrew Yates1, Daniel Zerbino 1, Yan Zhang8,16, Bronwen Aken1, Jyoti S. Choudhary10, Mark Gerstein8,17,18, Roderic Guigo´11,19, Tim J.P. Hubbard20, Manolis Kellis6,7, Benedict Paten2, Alexandre Reymond3, Michael L. Tress12 and Paul Flicek 1,* 1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA, 3Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland, 4Department of Medical Oncology, Inselspital, University Hospital, University of Bern, Bern, Switzerland, 5Department of Biomedical Research (DBMR), University of Bern, Bern, Switzerland, 6MIT Computer Science and Artificial Intelligence Laboratory, 32 Vasser St, Cambridge, MA 02139, USA, 7Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA, 8Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA, 9Department of Bioscience, Brunel University London, Uxbridge UB8 3PH, UK, 10Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 123 Old Brompton Road, London SW7 3RP, UK, 11Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain, 12Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain, 13Department of Molecular, Cellular & Developmental Biology, Yale University, New Haven, CT 06520, USA, 14Systems Biology Institute, Yale University, West Haven, CT 06516, USA, 15Centre of New Technologies, University of Warsaw, Warsaw, Poland, 16Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA, 17Program in Computational Biology & Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA, 18Department of Computer Science, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA, 19Universitat Pompeu Fabra (UPF), Barcelona, E-08003 Catalonia, Spain and 20Department of Medical and Molecular Genetics, King’s College London, Guys Hospital, Great Maze Pond, London SE1 9RT, UK Received August 15, 2018; Revised September 20, 2018; Editorial Decision October 02, 2018; Accepted October 08, 2018 ABSTRACT The accurate identification and description of the genes in the human and mouse genomes is a fun- damental requirement for high quality analysis of data informing both genome biology and clinical ge- nomics. Over the last 15 years, the GENCODE con- sortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimen- tal and computational biology groups who work to- *To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: flicek@ebi.ac.uk C© The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D ow nloaded from https://academ ic.oup.com /nar/advance-article-abstract/doi/10.1093/nar/gky955/5144133 by guest on 29 N ovem ber 2018 2 Nucleic Acids Research, 2018 gether to improve and extend the GENCODE gene an- notation. Specifically, we generate primary data, cre- ate bioinformatics tools and provide analysis to sup- port the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and charac- terise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, En- sembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org. INTRODUCTION The GENCODE consortium produces foundational refer- ence genome annotation for the human andmouse genomes as well as tools and data to maintain and improve these an- notations. Our overall goal is to identify and classify, with high accuracy, all gene features in the human and mouse genomes based on defined biological evidence and to make these annotations freely available for the benefit of biomed- ical research and genome interpretation. The GENCODE project was founded in 2003 as part of the pilot phase of the ENCODE project to provide refer- ence quality manual gene annotation for the 30Mb (∼1%) of the reference human genome targeted by the ENCODE pilot (1–3). In 2007, we expanded our scope to the whole human genome as the ENCODE project did the same (4,5). In 2012, we began annotating the mouse reference genome to the same standards as human, while continuing to im- prove the existing gene annotation in both species via tar- geted reinvestigation of loci flagged by external users and internal QC pipelines. Today, the GENCODE consortium is a long-running partnership of manual annotation, com- putational biology and experimental groups including four of the founding groups (HAVANA, CRG, Yale and UCSC) and three groups that joined in 2007 (Ensembl, MIT and CNIO). Our gene annotations are regularly released as the Ensembl/GENCODE gene sets. The gene sets are compre- hensive and include protein-coding and non-coding loci in- cluding alternatively spliced isoforms and pseudogenes. To produce the annotations, we leverage computational and experimental methods to identify new genes and new tran- script isoforms, directing manual annotation to regions re- quiring expert investigation. The Ensembl/GENCODE an- notations are the default human and mouse annotation for the Ensembl project (6), while the UCSC Genome Browser (7) uses the human annotation as default and the mouse annotation as a secondary resource until the mouse clone- by-clone annotation is complete (see below). For each ver- sioned release, the underlying genome annotation is exactly the same whether it is accessed at Ensembl, UCSC or https: //genecodegenes.org, although there are minor differences in presentation associated with genome assembly patches and representation of the pseudoautosomal regions on the X and Y chromosomes. We also provide subsets of the an- notation as described below. For simplicity, we will here re- fer to the annotation holistically as GENCODE. GENCODE is the reference annotation of choice adopted by many large international consortia including ENCODE, GTEx (8), the International Cancer Genome Consortium (ICGC) (9), component projects of the In- ternational Human Epigenome Consortium (10), the 1000 Genomes Project, (11) the Exome Aggregation Consortium (EXAC) and Genome Aggregation Database (gnomAD) (12) and the Human Cell Atlas (HCA) (13). GENCODE ANNOTATION METHODS AND RESULTS The GENCODE consortium annotates protein-coding genes, pseudogenes, long non-codingRNAs (lncRNAs) and small non-coding RNAs (sncRNAs). We define protein- coding genes as loci where the weight of available evidence supports the presence of a coding sequence (CDS). Evi- dence for a CDS may come from high-throughput experi- mental assays, the demonstration of physiological function in the research literature, the observation of homology to a known protein-coding gene, or the interpretation of evolu- tionary conservation data. Pseudogenes are sequences de- rived from protein-coding genes, containing disabling mu- tations such as in-frame stop codons, frameshifting indels, truncations or insertions, or for which there is no evidence of transcription. lncRNA genes are identified by a combi- nation of transcriptional evidence and a lack of potential to be assigned as protein-coding. We do not absolutely re- quire lncRNA genes to be longer than 200 bp, but very few annotated lncRNAs fall below this threshold, as we also re- quire annotated lncRNAs to be free of secondary structures found in known functional sncRNAs. Currently, sncRNAs are almost entirely annotated by computational pipelines that use homology to known sncRNA sequences and pre- dicted secondary structure to identify functional copies. Our annotation processes use primary transcript and proteomics data, evolutionary conservation, computational methods and curated public databases such asUniProt (14). These data are integrated using a combination of expert manual annotators and computational methods to identify regions of the genome with genic potential, annotate the exon-intron structures of transcripts identified at the locus under investigation and assign a functional classification to both the individual transcript and the locus. Broad functional classes (referred to as ‘biotypes’) of protein-coding, pseudogene, lncRNA and sncRNA are as- signed as described above. More detailed functional cate- gories are also added. For example, at the locus level we de- scribe the provenance of pseudogenes as processed (derived via retrotransposition), unprocessed (defined by a genome duplication event) or unitary (arising from the lineage spe- cific disruption of an ancestral protein-coding gene). At the transcript level we define transcripts belonging to protein- coding loci as protein-coding, nonsense mediated decay (NMD) (containing a premature stop codon believed likely to lead to the transcript being targeted by the nonsense- mediated decay pathway) or retained intron (containing se- quence that is intronic in other transcripts from the lo- cus). Following the structural and functional classification of transcripts, a subset of GENCODE annotation is sub- D ow nloaded from https://academ ic.oup.com /nar/advance-article-abstract/doi/10.1093/nar/gky955/5144133 by guest on 29 N ovem ber 2018 Nucleic Acids Research, 2018 3 ject to targeted experimental validation as described below to ensure consistent high quality of the gene annotation. To cater for a variety of use cases, we create a number of annotation sets. Examples of these are our ‘GENCODE comprehensive’ and ‘GENCODE basic’ gene sets. GEN- CODE comprehensive includes the complete set of anno- tations including partial transcripts (i.e. transcripts that are not full length, but represent a unique splice form based on available evidence) and biotypes such as NMD. GEN- CODE basic is a subset of GENCODE comprehensive that contains only transcripts with full-length CDS. For non- coding loci, GENCODEbasic includes the smallest number of transcripts that cover 80% of the exonic features, while ensuring all loci are represented by at least 1 transcript. Computational methods add additional information. For example, APPRIS, described inmore detail below, identifies themost likely functional translations at protein-coding loci and TSL (transcript support level) calculates the amount and quality of supporting evidence for each transcript. Manual annotation The GENCODE gene set is created by merging the results of manual and computational gene annotation methods. Manual gene annotation has two major modes of opera- tion: clone-by-clone and targeted annotation. ‘Clone-by- clone’ annotation involves ‘walking’ across a genomic re- gion, investigating the sequence, aligned expression data and computational predictions for each BAC clone. In do- ing so, an expert annotator investigates all possible genic features and considers all possible annotations and biotypes simultaneously. We believe this approach carries substan- tial advantages. For example, the decision to annotate a lo- cus as protein-coding or pseudogenic benefits from being able to weigh both possibilities in light of all available ev- idence. This process helps prevent false positive and false negative misclassifications. Targeted annotation is designed to answer specific questions such as ‘is there an unannotated protein-coding gene in this position?’ Ranked target lists are generated by computational analysis based, for example, on transcriptomic data, shotgun proteomic data or conserva- tion measures. Over the last two years mouse annotation has been dominated by the clone-by-clone approach while the human genome has been refined entirely via targeted reannotation except for the annotation of human assembly patches and haplotypes released by the Genome Reference Consortium (15), which take a clone-by-clone approach. Over the last two years, we have focused on two broad areas: completing the first pass manual annotation across the entire mouse reference genome and a dedicated effort to improve the annotation of protein-coding genes in human and mouse. We have completed the annotation of novel protein- coding genes, lncRNAs and pseudogenes, plus QC and updating previous annotation where necessary for mouse chromosomes 9, 10, 11, 12, 13, 14, 15, 16 and 17. These updates bring the fraction of the mouse genome with com- pleted first pass manual annotation to approximately 97%. In addition, we have continued to work with the NCBI and Mouse Genome Informatics project at the Jackson Labo- ratory to resolve annotation differences for protein-coding, pseudogene and lncRNA loci. For protein-coding genes this is under the umbrella of the Consensus Coding Sequence (CCDS) project (16). We have also manually investigated unannotated regions of high protein-coding potential identified bywhole genome analysis using PhyloCSF (17) (a tool described in more de- tail below). In human, this led to the addition of 144 novel protein-coding genes and 271 pseudogenes (of which 42 were unitary pseudogenes). In mouse, we annotated orthol- ogous loci for all but 11 of the 144 human protein-coding genes. We have also revisited the annotation of all olfactory receptor loci in both human andmouse, usingRNAseq data to define 5′ and 3′ UTR sequences for ∼1400 loci. In hu- man we have also targeted a ‘deep dive’ manual reannota- tion of genes on clinical panels for paediatric neurological disorders to identify missing functional alternative splicing. Incorporating second and third generation transcriptomic data, we reannotated∼190 genes and addedmore than 3600 alternatively spliced transcripts, including ∼1400 entirely novel exons and an additional ∼30kb of CDS. We have also completed an effort to capture all recently described unannotatedmicroexons (18) intoGENCODE, and further added an additional 146 novel microexons mined from pub- lic SLRseq data (19). As part of the CCDS collaboration with RefSeq, we have checked a large subset of human loci where there was dis- agreement over gene biotype. Similarly, we have checked all UniProt manually annotated and reviewed (i.e. Swiss-Prot) accessions that lack an equivalent in GENCODE. As a re- sult, we added 32 novel protein-coding loci to GENCODE and rejected more than 200 putative coding loci. Finally, we are manually reviewing genes previously annotated as protein-coding, but with weak or no support based on a method incorporating UniProt, APPRIS, PhyloCSF, En- sembl comparative genomics, RNA-seq, mass spectrometry and variation data (20,21). Of the 821 loci investigated to date, 54 have had their coding status removed while a fur- ther 110 potentially dubious cases remain under review. The approach taken reflects in the kinds of updates cap- tured in the annotation. For example, the targeted rean- notation in human leads to the annotation of few novel protein-coding loci but many novel transcripts at updated protein-coding and lncRNA loci. Conversely, in mouse the emphasis on clone-by-clone annotation identifies many more novel loci and transcripts across a broader range of biotypes (Figure 1). Computational annotation of small RNAs We annotate small non-coding RNAs (sncRNAs) using a variety of mechanisms. Specifically, miRNA annotations are imported directly from miRBase (22), while tRNAs are identified ab initio using tRNAScan-SE (23) although they are not included directly in the gene set. For other classes of sncRNA, including small nucleolar RNAs (snoRNAs), small nuclear RNAs (snRNAs) and small Cajal body- specific RNAs (scaRNAs), we use a homology-based, com- putational pipeline (24), which first compares sequences of known RNA families in Rfam (25) to the genome using BLAST (26). This initial step reduces the genomic search space and excludes sequences with sub-optimal alignments D ow nloaded from https://academ ic.oup.com /nar/advance-article-abstract/doi/10.1093/nar/gky955/5144133 by guest on 29 N ovem ber 2018 4 Nucleic Acids Research, 2018 Figure 1. New and updated manually annotated genes and transcripts from July 2016 to June 2018. For both human (left) and mouse (right) the numbers of completely new genes and transcripts, updated genes and transcripts and the total number of manually added or edited genes and transcripts for each of four broad categories of annotation. A new gene annotation can represent a completely de novo locus with no overlap with pre-existing annotation or the reclassification of an existing complex lo- cus into multiple loci to better represent the biology of the locus inferred from transcriptomic and/or proteomic data. A new transcript represents the annotation of a unique exon-intron structure, including novel alterna- tive splicing at an annotated locus.Updated genes and transcripts represent pre-existing loci or transcript models that have been edited to improve the representation of biotype (e.g. changed from lncRNA to protein-coding) or structure (e.g. by extension, addition of novel exons). to the genome. We define putative sncRNA models after clustering top BLAST hits and evaluating these predictions by performing sequence and structure searches against co- variance models in the Infernal suite of tools (27). Pseudogenes Pseudogene annotations across 18 mouse strains were gen- erated using a combination of manual annotation liftover and computational methods. Additionally, we were able to annotate 88 new human and 131 new mouse unitary pseudogenes relative to each other. Amongst the strains we find roughly 20 unitary pseudogenes per strain. We iden- tified nearly 3000 ancestral pseudogenes conserved across all strains. Meanwhile, ∼20% of the pseudogenes in each strain are strain specific. In line with previous results in hu- man, 15% of pseudogenes exhibit transcriptional activity (bioRxiv: https://doi.org/10.1101/386656). EXPERIMENTAL ANNOTATION APPROACHES lncRNA annotation using capture long Seq Determining the precise boundaries and the exonic struc- ture of low abundant transcripts, such as lncRNAs is chal- lenging. We previously showed that 3′ and 5′ boundaries of lncRNAs annotated in GENCODE V7 (April 2011) were less supported by CAGE and PET tags than those of protein-coding genes, even when accounting for differ- ences in expression (28). Methods to assemble transcript se- quences from short sequence reads have also been shown to produce poor results when used to resolve the exonic structure of lncRNAs (29,30). To improve lncRNA anno- tation, we developed the RNA Capture Long Seq (CLS) method (31).Here, probes are first designed against targeted lncRNAs (or suspected, unannotated lncRNA loci). Full- length cDNAs generated from diverse cell types were cap- tured, resulting in cDNA libraries that are highly enriched for the targeted lncRNAs. Libraries were then sequenced using long-read sequencing technologies (31,32). Our initial efforts created a comprehensive capture library targeting the set of intergenic GENCODE lncRNAs in human and mouse, and used it in a set ofmatched human andmouse tis- sues (31). This resulted in novel lncRNA transcripts at 3574 loci in human, and 561 in mouse. The long length of the transcript sequences obtained, often correspond to com- plete 5′-to-3′ RNA molecules, substantially informed man- ual annotation. Indeed, CLS produces near manual-quality full-length transcriptmodels at high-throughput scales (32). Our current efforts are to include samples across a more di- verse panel of tissues such as fetal timepoints. Proteomics Proteomic mass spectrometry datasets are a powerful re- source contributing to the validation and annotation of protein-coding genes and transcripts. In GENCODE, we use proteomics data as an additional layer of evidence when defining the structure and protein-coding potential of a ge- nomic locus.We apply strict criteria to the peptide evidences we consider from mass spectrometry datasets (33–35) to minimize the incorporation of false positive and ambigu- ous or variant peptide species. In highly curated genomes such as human, the contribution from mass spectrometry experiments requires considerable scale of data and effort, with correspondingly small returns. Our experimental ef- forts inGENCODE incorporate targeted proteomics exper- iments, specific experimental designs and synthetically gen- erated peptides to find these elusive protein-coding genes. Annotation validation and RACEseq We used RT-PCR amplification followed by highly multi- plexed sequencing readout (36) to assess the quality of the annotations. This method evaluates low confidence tran- scribed loci (novel or putative). Splice site loci were system- atically experimentally tested in eight tissues (brain, heart, kidney, liver, lung, spleen, skeletal muscle, and testis) byRT- PCR-seq (36). From human GENCODE versions 3 to 19, a total of 18 132 splice junctions were analyzed and ex- perimentally tested. Seventy eight percent of all assessed junctions were confirmed through experimental validation. Similar to the human annotation, we assessed the quality of the mouse annotation. A total of 3956 splice junctions from GENCODE versions M2 and M4 were tested with a validation rate of 53%. Finally, to assess the complete- ness of the annotations we amplified and sequenced the transcripts of 527 deeply annotated human protein-coding genes, which are routinely used for diagnostic tests by the D ow nloaded from https://academ ic.oup.com /nar/advance-article-abstract/doi/10.1093/nar/gky955/5144133 by guest on 29 N ovem ber 2018 Nucleic Acids Research, 2018 5 UK Genetic Testing Network (UKGTN). We performed 5′- and 3′- nested- RACEs in seven different tissues (brain, testis, heart, kidney, liver, lung, and spleen) followed by long-read sequencing, which revealed 10 380 novel splice junction candidates. GENCODE ANNOTATION TOOLS Comparative annotation toolkit We developed the Comparative Annotation Toolkit (CAT) (37) to leverage the GENCODE annotations of mouse and human to annotate laboratory mouse strains (38) and great apes (39,40). CATuses whole genome alignments fromCac- tus (41) to project GENCODE annotations from mouse or human to related species, and then performs a variety of filtering and clean-up steps to generate a high quality annotation set for these other genomes. The GENCODE M11 mouse annotation was used with CAT to annotate 16 laboratory mouse strains, and these annotations are avail- able in Ensembl. Over 20 000 protein-coding and 12 000 non-coding genes were comparatively annotated in each lab strain. Novel gene predictions using Comparative Augustus (42) also found an average of 22 new loci in classical strains, including the discovery of the gene Efcab3-like in the refer- ence mouse, which was included in subsequent GENCODE releases. Additionally, the GENCODE 27 (August 2017) human annotation set was used to annotate chimpanzee, gorilla and orangutan, and these annotations were incor- porated intoGenbank, with over 19 000 protein-coding and 36,000 non-coding genes comparatively annotated in all of the great apes. APPRIS The APPRIS Database (http://appris-tools.org) (43) was developed to provide annotations for alternative splice vari- ants. APPRIS also determines principal splice isoforms based on cross-species conservation and the conservation of protein structure and function. Most coding genes have a single dominant protein isoform and this main isoform is almost always the APPRIS principal isoform (44). APPRISmaintains up-to-date annotations for the GEN- CODE and RefSeq reference sets and has been extended to theUniProtKB proteome and to sixmodel species as well as human andmouse (45). Technical improvements include in- cremental improvements to the core modules that make up the APPRIS pipeline, the implementation of a UCSCTrack Hub to make annotation access easier, and Docker images to allow the execution of the annotation pipeline (45). APPRIS is an integral part of the pipeline for the pre- diction of potential non-coding genes (20). For the GEN- CODE 27 (August 2017) human annotation the completed pipeline flagged 2432 genes. PhyloCSF Comparative genomics is one of the most powerful tools available for distinguishing protein-coding genomic regions. Previously, we developed PhyloCSF to support annotation of coding sequences based on the alignment of multiple genome sequences (17). As described above, we combine whole-genome PhyloCSF data with experimental evidence and expert manual annotation to detect novel coding se- quences. The workflow begins with PhyloCSF scores com- puted on every codon in the human genome in each of the six reading frames; applies a Hidden Markov Model to these scores to find candidate coding intervals; excludes in- tervals previously annotated as coding or pseudogene, or antisense to such intervals, as well as very short intervals; and uses a Support Vector Machine to prioritize the re- sulting ‘Novel PhyloCSF Regions’. We have created pub- licly available PhyloCSF track hubs for viewing the whole- genome PhyloCSF data and novel PhyloCSF Regions from human and mouse in the UCSC and Ensembl genome browsers. Pseudopipe Pseudopipe identifies and annotates pseudogenes across the genome (46). It takes as input an organism’s protein-coding gene set and searches for homology across the genome us- ing BLAST. Hits overlapping functional genes are removed and the remaining hits are then assembled into pseudo- gene annotations. Each annotation is also assigned a par- ent gene, the functional paralog that gave rise to the pseu- dogene, as well as a biotype (processed, duplicated, or am- biguous). Unitary pseudogenes are also identified via Pseu- dopipe by using a different organism’s protein-coding gene set as the input. We inform our annotation with results from Retrofinder (47) and RCPedia (48). In addition to our core annotation files, further information is available at http://www.pseudogene.org. These computational annota- tions are then combined with manual annotations in order to produce the full pseudogene complement. Pseudogene annotations are given a confidence level based on the in- tersection with manual annotations. Annotations detected by both the computational pipelines and manual annota- tors are assigned level 1, those only detected by manual annotators are given level 2, and the consensus annota- tions detected by PseudoPipe and RetroFinder are given level 3 and made available in a separate annotation file at https://www.gencodegenes.org. DATA ACCESS Versioned GENCODE gene sets are currently released ap- proximately four times a year for mouse and twice a year for human. This asymmetric update pattern reflects the fact that the first pass of the human annotation was completed inGENCODE15 (January 2013), while themouse first pass is approaching completion (expected for GENCODEM20) and therefore has been the subject ofmore intensive annota- tion. The most recent release of the human geneset is GEN- CODE 29 (October 2018), while the most recent mouse up- date is GENCODE M19 (October 2018). Each release in- corporates the continuous updates arising from expertman- ual annotation. Figure 2 shows the increase in the numbers of genes and transcripts in human and mouse GENCODE releases over the past two years. The human genesets look relatively static, although headline figures do not capture updates made to existing annotation and the balancing ef- fect of both adding and removing loci during a release cycle. D ow nloaded from https://academ ic.oup.com /nar/advance-article-abstract/doi/10.1093/nar/gky955/5144133 by guest on 29 N ovem ber 2018 6 Nucleic Acids Research, 2018 Figure 2. Annotation statistics for human and mouse GENCODE releases from July 2016 to June 2018, encompassing human releases GENCODE 25–28 and mouse releases M10 to M18. The panels on the left show the total number of genes by broad biotype (protein-coding, lncRNA, pseudogene and sncRNA) for each release for human and mouse respectively and panels on the right show the total numbers of genes and transcripts of all biotypes. In mouse however, there is clear growth in the numbers of both genes and transcripts driven predominantly by the ad- dition of lncRNAs and pseudogenes. Extensive data resources for current and archival GEN- CODE releases are available at https://www.gencodegenes. org. As described above, theGENCODEgene sets are avail- able as default in the Ensembl genome browser and also accessible via the UCSC genome browser. Other interfaces include the Ensembl FTP site (ftp://ftp.ensembl.org/pub/), which includes gene sets in GFF3, Genbank and GTF for- mats and full download of the complete Ensembl databases. More complex and customizable gene set queries can be created via the Ensembl Biomart (https://www.ensembl.org/ biomart/). Programmatic access to the GENCODE gene sets is pos- sible via the extensive Ensembl Perl API and the language- agnostic Ensembl REST API. Programmatic access facil- itates advanced genome-wide analysis such as retrieval of supporting features and associated gene trees. Examples of REST endpoint usage and starter scripts in different lan- guages are at https://rest.ensembl.org. GENCODE has been created exclusively on theGRCh38 human assembly sinceGENCODE20 (August 2014). How- ever, versions of selected releases since then that have been projection mapped from GRCh38 to GRCh37 are avail- able at UCSC and from https://www.gencodegenes.org. Re- ferred to as the ‘lift37’ annotation set, these data help iden- tify genes where the annotationsmay have changed between GRCh37 and GRCh38. Due to the difficulty to generate ac- curate projections, the ‘lift37’ annotation set is not consid- ered official reference annotation and only minimal support is available. We welcome questions and feedback from the commu- nity directly via the helpdesks at https://www.gencodegenes. org, Ensembl and UCSC. In addition, the Ensembl and UCSC outreach activities annually reach thousands of re- D ow nloaded from https://academ ic.oup.com /nar/advance-article-abstract/doi/10.1093/nar/gky955/5144133 by guest on 29 N ovem ber 2018 Nucleic Acids Research, 2018 7 searchers via workshops at institutions and meetings, web- based training forums and ‘how-to’ guides focused on using the genome browsers and making best use of their features and data. CONCLUSION The GENCODE consortium continues to improve the quality of the reference gene annotation in human and mouse. We have integrated cutting-edge developments in the technology and scientific understanding of genome bi- ology into our annotation workflows to improve the rep- resentation of existing loci and extend annotation cover- age via the addition of entirely novel loci and alternatively spliced transcripts. While the high quality of our existing transcript annotation is extensively supported by both pub- lic data and data generatedwithin the consortium, the abun- dance of evidence from new transcriptomic and proteomic datasets makes it clear that they are not yet complete. ACKNOWLEDGEMENTS We thank TimHubbard and Jennifer Harrow for their lead- ership in the GENCODE project from 2003-2016 as well as all groups and group members involved in the GENCODE project since its inception including the HAVANA manual annotation group formerly at Wellcome Sanger Institute now at EMBL-EBI (founder), the Guigo group at Centre for Genomic Regulation (founder), the Gerstein group at Yale (founder), the Center for Biomolecular Science & En- gineering at UCSC (founder), the Ensembl team at EMBL- EBI (joined 2007), the Kellis group at MIT (joined 2007), the Tress group at CNIO (joined 2007), the Choudhary group formerly at Wellcome Sanger Institute now at Insti- tute of Cancer Research (joined 2012), the Reymond group at University of Lausanne (2003–2017), the Antonarakis group at University of Geneva (2003–2007), the Wei group at Genome Institute of Singapore (2003–2007), the Gin- geras group at Affymetrix Ltd (2003–2007) and the Brent group at Washington University in St. Louis (2007–2012). FUNDING National Human Genome Research Institute of the Na- tional Institutes of Health [U41HG007234]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National In- stitutes of Health; Wellcome Trust [WT108749/Z/15/Z, WT200990/Z/16/Z]; European Molecular Biology Labo- ratory; Swiss National Science Foundation through theNa- tional Center of Competence in Research ‘RNA&Disease’ (to R.J.); Medical Faculty of the University of Bern (to R.J). Funding for open access charge: National Institutes of Health. Conflict of interest statement. Paul Flicek is a member of the Scientific Advisory Boards of Fabric Genomics, Inc., and Eagle Genomics, Ltd. REFERENCES 1. ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 306, 636–640. 2. ENCODE Project Consortium (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799–816. 3. Harrow,J., Denoeud,F., Frankish,A., Reymond,A., Chen,C.-K., Chrast,J., Lagarde,J., Gilbert,J.G.R., Storey,R., Swarbreck,D. et al. (2006) GENCODE: producing a reference annotation for ENCODE. Genome Biol., 7(Suppl. 1), doi:10.1186/gb-2006-7-s1-s4. 4. ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. 5. Harrow,J., Frankish,A., Gonzalez,J.M., Tapanari,E., Diekhans,M., Kokocinski,F., Aken,B.L., Barrell,D., Zadissa,A., Searle,S. et al. (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res., 22, 1760–1774. 6. Zerbino,D.R., Achuthan,P., Akanni,W., Amode,M.R., Barrell,D., Bhai,J., Billis,K., Cummins,C., Gall,A., Giro´n,C.G. et al. (2018) Ensembl 2018. Nucleic Acids Res., 46, D754–D761. 7. Casper,J., Zweig,A.S., Villarreal,C., Tyner,C., Speir,M.L., Rosenbloom,K.R., Raney,B.J., Lee,C.M., Lee,B.T., Karolchik,D. et al. (2018) The UCSC Genome Browser database: 2018 update. Nucleic Acids Res., 46, D762–D769. 8. GTEx Consortium. (2017) Genetic effects on gene expression across human tissues. Nature, 550, 204–213. 9. International Cancer Genome Consortium. (2010) International network of cancer genome projects. Nature, 464, 993–998. 10. Stunnenberg,H.G., International,Human Epigenome Consortium and Hirst,M. (2016) The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery. Cell, 167, 1145–1149. 11. 1000, Genomes Project Consortium. (2015) A global reference for human genetic variation. Nature, 526, 68–74. 12. Lek,M., Karczewski,K.J., Minikel,E.V., Samocha,K.E., Banks,E., Fennell,T., O’Donnell-Luria,A.H., Ware,J.S., Hill,A.J., Cummings,B.B. et al. (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536, 285–291. 13. Regev,A., Teichmann,S.A., Lander,E.S., Amit,I., Benoist,C., Birney,E., Bodenmiller,B., Campbell,P., Carninci,P., Clatworthy,M. et al. (2017) The Human Cell Atlas. Elife, 6, e27041. 14. The UniProt Consortium. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45, D158–D169. 15. Schneider,V.A., Graves-Lindsay,T., Howe,K., Bouk,N., Chen,H.C., Kitts,P.A., Murphy,T.D., Pruitt,K.D., Thibaud-Nissen,F., Albracht,D. et al. (2017) Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res., 27, 849–864. 16. Pujar,S., O’Leary,N.A., Farrell,C.M., Loveland,J.E., Mudge,J.M., Wallin,C., Giro´n,C.G., Diekhans,M., Barnes,I., Bennett,R. et al. (2018) Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res., 46, D221–D228. 17. Lin,M.F., Jungreis,I. and Kellis,M. (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics, 27, i275–i282. 18. Irimia,M., Weatheritt,R.J., Ellis,J.D., Parikshak,N.N., Gonatopoulos-Pournatzis,T., Babor,M., Quesnel-Vallie`res,M., Tapial,J., Raj,B., O’Hanlon,D. et al. (2014) A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell, 159, 1511–1523. 19. Tilgner,H., Jahanbani,F., Blauwkamp,T., Moshrefi,A., Jaeger,E., Chen,F., Harel,I., Bustamante,C.D., Rasmussen,M. and Snyder,M.P. (2015) Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events. Nat. Biotechnol., 33, 736–742. 20. Abascal,F., Juan,D., Jungreis,I., Martinez,L., Rigau,M., Rodriguez,J.M., Vazquez,J. and Tress,M.L. (2018) Loose ends: almost one in five human genes still have unresolved coding status. Nucleic Acids Res., 46, 7070–7084. 21. Ezkurdia,I., Juan,D., Rodriguez,J.M., Frankish,A., Diekhans,M., Harrow,J., Vazquez,J., Valencia,A. and Tress,M.L. (2014) Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes. Hum. Mol. Genet., 23, 5866–5878. 22. Kozomara,A. and Griffiths-Jones,S. (2014) miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res., 42, D68–D73. D ow nloaded from https://academ ic.oup.com /nar/advance-article-abstract/doi/10.1093/nar/gky955/5144133 by guest on 29 N ovem ber 2018 8 Nucleic Acids Research, 2018 23. Lowe,T.M. and Eddy,S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res., 25, 955–964. 24. Aken,B.L., Ayling,S., Barrell,D., Clarke,L., Curwen,V., Fairley,S., Fernandez Banet,J., Billis,K., Garcı´a Giro´n,C., Hourlier,T. et al. (2016) The Ensembl gene annotation system. Database (Oxford), 2016, baw093. 25. Kalvari,I., Argasinska,J., Quinones-Olvera,N., Nawrocki,E.P., Rivas,E., Eddy,S.R., Bateman,A., Finn,R.D. and Petrov,A.I. (2018) Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res., 46, D335–D342. 26. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. 27. Eddy,S.R. (2002) Computational genomics of noncoding RNA genes. Cell, 109, 137–140. 28. Derrien,T., Johnson,R., Bussotti,G., Tanzer,A., Djebali,S., Tilgner,H., Guernec,G., Martin,D., Merkel,A., Knowles,D.G. et al. (2012) The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res., 22, 1775–1789. 29. Hardwick,S.A., Chen,W.Y., Wong,T., Deveson,I.W., Blackburn,J., Andersen,S.B., Nielsen,L.K., Mattick,J.S. and Mercer,T.R. (2016) Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods, 13, 792–798. 30. Steijger,T., Abril,J.F., Engstro¨m,P.G., Kokocinski,F., Abril,J.F., Akerman,M., Alioto,T., Ambrosini,G., Antonarakis,S.E., Behr,J. et al. (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods, 10, 1177–1184. 31. Lagarde,J., Uszczynska-Ratajczak,B., Carbonell,S., Pe´rez-Lluch,S., Abad,A., Davis,C., Gingeras,T.R., Frankish,A., Harrow,J., Guigo,R. et al. (2017) High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat. Genet., 49, 1731–1740. 32. Uszczynska-Ratajczak,B., Lagarde,J., Frankish,A., Guigo´,R. and Johnson,R. (2018) Towards a complete map of the human long non-coding RNA transcriptome. Nat. Rev. Genet., 19, 535–548. 33. Weisser,H., Wright,J.C., Mudge,J.M., Gutenbrunner,P. and Choudhary,J.S. (2016) Flexible Data Analysis Pipeline for High-Confidence Proteogenomics. J Proteome Res., 15, 4686–4695. 34. Wright,J.C. and Choudhary,J.S. (2016) DecoyPyrat: Fast Non-redundant Hybrid Decoy Sequence Generation for Large Scale Proteomics. J Proteomics Bioinform, 9, 176–180. 35. Wright,J.C., Mudge,J., Weisser,H., Barzine,M.P., Gonzalez,J.M., Brazma,A., Choudhary,J.S. and Harrow,J. (2016) Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat. Commun., 7, 11778. 36. Howald,C., Tanzer,A., Chrast,J., Kokocinski,F., Derrien,T., Walters,N., Gonzalez,J.M., Frankish,A., Aken,B.L., Hourlier,T. et al. (2012) Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome Res., 22, 1698–1710. 37. Fiddes,I.T., Armstrong,J., Diekhans,M., Nachtweide,S., Kronenberg,Z.N., Underwood,J.G., Gordon,D., Earl,D., Keane,T., Eichler,E.E. et al. (2018) Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome Res., 28, 1029–1038. 38. Lilue,J., Doran,A.G., Fiddes,I.T., Abrudan,M., Armstrong,J., Bennett,R., Chow,W., Collins,J., Collins,S., Czechanski,A. et al. (2018) Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci. Nat. Genet., doi:10.1038/s41588-018-0223-8. 39. Gordon,D., Huddleston,J., Chaisson,M.J.P., Hill,C.M., Kronenberg,Z.N., Munson,K.M., Malig,M., Raja,A., Fiddes,I., Hillier,L.W. et al. (2016) Long-read sequence assembly of the gorilla genome. Science, 352, aae0344. 40. Kronenberg,Z.N., Fiddes,I.T., Gordon,D., Murali,S., Cantsilieris,S., Meyerson,O.S., Underwood,J.G., Nelson,B.J., Chaisson,M.J.P., Dougherty,M.L. et al. (2018) High-resolution comparative analysis of great ape genomes. Science, 360, eaar6343. 41. Paten,B., Earl,D., Nguyen,N., Diekhans,M., Zerbino,D. and Haussler,D. (2011) Cactus: algorithms for genome multiple sequence alignment. Genome Res., 21, 1512–1528. 42. Ko¨nig,S., Romoth,L.W., Gerischer,L. and Stanke,M. (2016) Simultaneous gene finding in multiple genomes. Bioinformatics, 32, 3388–3395. 43. Rodriguez,J.M., Maietta,P., Ezkurdia,I., Pietrelli,A., Wesselink,J.-J., Lopez,G., Valencia,A. and Tress,M.L. (2013) APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res., 41, D110–D117. 44. Ezkurdia,I., Rodriguez,J.M., Carrillo-de Santa Pau,E., Va´zquez,J., Valencia,A. and Tress,M.L. (2015) Most highly expressed protein-coding genes have a single dominant isoform. J. Proteome Res., 14, 1880–1887. 45. Rodriguez,J.M., Rodriguez-Rivas,J., Di Domenico,T., Va´zquez,J., Valencia,A. and Tress,M.L. (2018) APPRIS 2017: principal isoforms for multiple gene sets. Nucleic Acids Res., 46, D213–D217. 46. Zhang,Z., Carriero,N., Zheng,D., Karro,J., Harrison,P.M. and Gerstein,M. (2006) PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics, 22, 1437–1439. 47. Baertsch,R., Diekhans,M., Kent,W.J., Haussler,D. and Brosius,J. (2008) Retrocopy contributions to the evolution of the human genome. BMC Genomics, 9, 466. 48. Navarro,F.C.P. and Galante,P.A.F. (2013) RCPedia: a database of retrocopied genes. Bioinformatics, 29, 1235–1237. D ow nloaded from https://academ ic.oup.com /nar/advance-article-abstract/doi/10.1093/nar/gky955/5144133 by guest on 29 N ovem ber 2018