################################################################################################ This directory contains data files produced by the GENCODE project which is a subproject of ENCODE (http://www.genome.gov/10005107) and is headed by Tim Hubbard at the Wellcome Trust Sanger Institute, UK. For questions, please contact Felix Kokocinski, fsk@sanger.ac.uk. ################################################################################################ ####################################### General format of the data freeze files ####################################### We supply genome-wide features on three different confidence levels, to get annotation close to the GENCODE gene set we are aiming for, level 1 + 2 should be used: * Level 1: validated At this time only pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC as well as by HAVANA manual annotation from WTSI. * Level 2: manual annotation HAVANA manual annotation from WTSI. The following regions are considered "fully annotated" and contain level 2 annotation from HAVANA only, although they will still be updated: chromosomes 1, 2, 6, 9, 10, 13, 20, 21, 22, X, Y, ENCODE pilot regions, chr11:2353995-3878750. * Level 3: automated annotation ENSEMBL loci in regions where no HAVANA annotation can be found. This data will be supplied in single file in GTF2.2 format as defined here with the following tags added to the attributes column where appropriate: * level [1,2,3]: validition status as described. * tag "pseudo_consens": member of the pseudogene set predicted by YALE, UCSC and HAVANA. * tag "CCDS": member of the consensus CDS gene set, confirming coding regions between ENSEMBL, UCSC, NCBI and HAVANA. Please note: if start codons are split between two exons, two start-codon features will be listed. ################################################################################################ Updated dump (23.06.2009) ################################################################################################# Updated version of the January 2009 freeze file produced for use in the 1000 Genomes projects and others. gencode_data.rel2b.gtf.gz The field "CCDSID" with a valid CCDS id has been added to tha 9th column. The field "CCDSOL" has been added indicating that the transcript overlaps a CCDS transcript, but was not flagged as such directly. A few genes have been removed and added respectively. Some formatting issues have been resolved. ################################################################################################ Updated dumps (21.04.2009) ################################################################################################# 1. protein-coding transcripts, their sequences and translations gencode_data.rel2.pc_transcripts.gtf.gz gencode_data.rel2.pc_transcripts_cdnas.fa.gz gencode_data.rel2.pc_transcripts_translations.fa.gz 2. re-dump (as January freeze) without external annotation gencode_data.rel2a.gtf.gz ################################################################################################ For the analysis data freeze of January 2009 there are the following files in this directory: ################################################################################################ 1. gencode_data.rel2.gtf.gz Data file in GTF format, compressed with gzip, containing annotation on three levels. Same format as described below, with the addition of one line for every gene and transcript. To remove these, do something like awk '{if(($3 !~ "gene|transcript") && ($3 != "transcript")){print $0}}' \ gencode_data.rel2.gtf > gencode_data.rel2_mod.gtf 2. gencode_tRNAscans.rel2.gtf.gz tRNAscan predictions from the ENSEMBL simpleFeature table (level 3). 3. gencode_polyAs.rel2.gtf.gz poly signals from the loutre database (polyA_site, pseudo_polyA) (seperate level). ################################################################################################ For the initial data freeze of October 1st 2008 there are the following files in this directory: ################################################################################################ 1.gencode_data.rel1.v2.gtf.gz Data file in GTF format, compressed with gzip, containing annotation on three levels. Here is what the first 11 lines look like (containing first transcript): ##description: evidence-based annotation of the human genome (NCBI36) ##provider: GENCODE ##contact: fsk@sanger.ac.uk ##format: gtf 2.2 ##date: 2008-10-02 chr1 HAVANA exon 1873 1920 . + . gene_id "OTTHUMG00000000961"; transcript_id "OTTH UMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2; chr1 HAVANA exon 2042 2090 . + . gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2; chr1 HAVANA exon 2476 2560 . + . gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2; chr1 HAVANA exon 2838 2915 . + . gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2; chr1 HAVANA exon 3084 3237 . + . gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2; chr1 HAVANA exon 3316 3533 . + . gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2; 2.gencode_data.rel1.v2.regions.txt list of regions (genomic coordinates) where only HAVANA annotation (level 1 & 2) can be found. 3.gencode_data.rel1.v2.regions_with_ids.txt list of regions (genomic coordinates) where only HAVANA annotation (level 1 & 2) can be found, with all OTT-ids from the region listed.