################################################################################################
This directory contains data files produced by the 

           GENCODE project

which is a subproject of ENCODE (http://www.genome.gov/10005107) and is headed by Tim Hubbard
at the Wellcome Trust Sanger Institute, UK.
For questions, please contact Felix Kokocinski, fsk@sanger.ac.uk.

################################################################################################

#######################################
General format of the data freeze files
#######################################

We supply genome-wide features on three different confidence levels, to get annotation close to the 
GENCODE gene set we are aiming for, level 1 + 2 should be used:

* Level 1: validated 

At this time only pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC 
as well as by HAVANA manual annotation from WTSI. 

* Level 2: manual annotation 

HAVANA manual annotation from WTSI. 
The following regions are considered "fully annotated" and contain level 2 annotation from HAVANA only, 
although they will still be updated: 
chromosomes 1, 2, 6, 9, 10, 13, 20, 21, 22, X, Y, ENCODE pilot regions, chr11:2353995-3878750. 

* Level 3: automated annotation 

ENSEMBL loci in regions where no HAVANA annotation can be found. 

This data will be supplied in single file in GTF2.2 format as defined here with the following tags 
added to the attributes column where appropriate:

* level [1,2,3]: validition status as described.
* tag "pseudo_consens": member of the pseudogene set predicted by YALE, UCSC and HAVANA.
* tag "CCDS": member of the consensus CDS gene set, confirming coding regions between ENSEMBL, UCSC, NCBI and HAVANA. 

Please note: if start codons are split between two exons, two start-codon features will be listed. 


################################################################################################
Updated dump (23.06.2009)
#################################################################################################

Updated version of the January 2009 freeze file produced for use in the 1000 Genomes projects and others.
gencode_data.rel2b.gtf.gz

The field "CCDSID" with a valid CCDS id has been added to tha 9th column.
The field "CCDSOL" has been added indicating that the transcript overlaps a CCDS transcript, but was not flagged as such directly.
A few genes have been removed and added respectively. Some formatting issues have been resolved.


################################################################################################
Updated dumps (21.04.2009)
#################################################################################################

1. protein-coding transcripts, their sequences and translations
gencode_data.rel2.pc_transcripts.gtf.gz
gencode_data.rel2.pc_transcripts_cdnas.fa.gz
gencode_data.rel2.pc_transcripts_translations.fa.gz

2. re-dump (as January freeze) without external annotation 
gencode_data.rel2a.gtf.gz

################################################################################################
For the analysis data freeze of January 2009 there are the following files in this directory:
################################################################################################

1. gencode_data.rel2.gtf.gz
  Data file in GTF format, compressed with gzip, containing annotation on three levels.
  Same format as described below, with the addition of one line for every gene and transcript.
  To remove these, do something like 
  awk '{if(($3 !~ "gene|transcript") && ($3 != "transcript")){print $0}}' \
    gencode_data.rel2.gtf > gencode_data.rel2_mod.gtf 

2. gencode_tRNAscans.rel2.gtf.gz
   tRNAscan predictions from the ENSEMBL simpleFeature table (level 3).

3. gencode_polyAs.rel2.gtf.gz
   poly signals from the loutre database (polyA_site, pseudo_polyA) (seperate level).

################################################################################################
For the initial data freeze of October 1st 2008 there are the following files in this directory:
################################################################################################

1.gencode_data.rel1.v2.gtf.gz
  Data file in GTF format, compressed with gzip, containing annotation on three levels.
  Here is what the first 11 lines look like (containing first transcript):

  ##description: evidence-based annotation of the human genome (NCBI36)
  ##provider: GENCODE
  ##contact: fsk@sanger.ac.uk
  ##format: gtf 2.2
  ##date: 2008-10-02
  chr1	HAVANA	exon	1873	1920	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTH  UMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	2042	2090	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	2476	2560	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	2838	2915	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	3084	3237	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	3316	3533	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;

2.gencode_data.rel1.v2.regions.txt
  list of regions (genomic coordinates) where only HAVANA annotation (level 1 & 2) can be found.

3.gencode_data.rel1.v2.regions_with_ids.txt
  list of regions (genomic coordinates) where only HAVANA annotation (level 1 & 2) can be found, with all OTT-ids from the region listed.