################################################################################ README: Matched Annotation by NCBI and EMBL-EBI (MANE) Last updated: 2024-03-06 ################################################################################ The RefSeq project at the NCBI and the Ensembl/GENCODE project at EMBL-EBI have provided independent high-quality human reference gene datasets to biologists since the sequencing of the human genome. In response to user requests for consistent annotation, we initiated the Matched Annotation from the NCBI and EMBL-EBI or MANE collaboration in 2018 to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. At this time, the MANE deliverables include: 1. MANE Select, a single representative transcript for every protein-coding gene for clinical reporting and other applications 2. MANE Plus Clinical transcript for genes where the MANE Select alone is inadequate for describing all publicly available pathogenic variants A complete list of all MANE Select and MANE Plus Clinical transcripts in a given MANE release is available in the MANE.GRCh38.v##.summary.txt.gz file. Each MANE transcript pair (represented by a row in this file) includes RefSeq and Ensembl identifiers for the transcript and protein, along with the RefSeq and Ensembl gene identifiers. Clinical users should consider both MANE Select and MANE Plus Clinical transcripts for interpreting variants. If users are interested in MANE data for a subset of genes that are clinically relevant, we recommend that you obtain the list of genes from the GenCC (Gene Curation Coalition) download page (https://search.thegencc.org/download) and parse the MANE data file using the list. GenCC is a global coalition which curates information on the validity of disease-gene associations. More information about MANE project is here: NCBI RefSeq -- https://www.ncbi.nlm.nih.gov/refseq/MANE/ NCBI Insights Blog -- http://go.usa.gov/xpSDt Ensembl -- https://ensembl.org/info/genome/genebuild/mane.html Ensembl Blog -- http://bit.ly/ens-mane NOTES: 1. An update to the filenames was made in January 2022 to remove the word 'select'. All releases prior to 1.0 still include the word 'select' in the filenames. For example, MANE.GRCh38.v0.95.select_refseq_genomic.gtf.gz vs. MANE.GRCh38.v1.0.refseq_genomic.gff.gz. 2. Updates were made to the file "MANE.GRCh38.v##.genes_not_in_mane.txt.gz" in March 2024. Only protein-coding genes are included in this file and any non-coding genes that were previously present were removed. To accurately reflect the contents, the filename was changed to "MANE.GRCh38.v##.protein_coding_genes_not_in_mane.txt.gz". The directories and files included in this FTP path are described below. 1. MANE.GRCh38.v##.ensembl_genomic.gtf.gz Transcripts from the MANE Project, with Ensembl identifiers for nucleotide, protein and genes in GTF format 2. MANE.GRCh38.v##.ensembl_genomic.gff.gz Transcripts from the MANE Project, with Ensembl identifiers for nucleotide, protein and genes in GFF3 format 3. MANE.GRCh38.v##.refseq_genomic.gtf.gz Transcripts from the MANE Project, with NCBI Refseq identifiers for nucleotide, protein and genes in GTF format 4. MANE.GRCh38.v##.refseq_genomic.gff.gz Transcripts from the MANE Project, with NCBI RefSeq identifiers for nucleotide, protein and genes in GFF3 format 5. MANE.GRCh38.v##.ensembl_protein.faa.gz Protein sequences, with Ensembl identifiers in FASTA format 6. MANE.GRCh38.v##.ensembl_rna.fna.gz Transcript sequences, with Ensembl identifiers in FASTA format 7. MANE.GRCh38.v##.refseq_protein.faa.gz Protein sequences, with NCBI RefSeq identifiers in FASTA format 8. MANE.GRCh38.v##.refseq_protein.gpff.gz Proteins in GenPept flatfile view 9. MANE.GRCh38.v##.refseq_rna.fna.gz Transcript sequences, with NCBI RefSeq identifiers in FASTA format 10. MANE.GRCh38.v##.refseq_rna.gbff.gz Transcripts in GenBank flatfile view 11. MANE.GRCh38.v##.summary.txt.gz A summary file with the following tab-delimited fields: [ 1] NCBI_GeneID [ 2] Ensembl_Gene [ 3] HGNC_ID [ 4] symbol [ 5] name [ 6] RefSeq_nuc [ 7] RefSeq_prot [ 8] Ensembl_nuc [ 9] Ensembl_prot [ 10] MANE_status [ 11] GRCh38_chr [ 12] chr_start [ 13] chr_end [ 14] chr_strand 12. trackhub This directory includes files that can used to add a custom track hub to NCBI Genome Data Viewer, UCSC genome browser or Ensembl genome browser. Specifically, the url for the track hub is: http://ftp.ncbi.nlm.nih.gov/refseq/MANE/trackhub/hub.txt Please see the following pages for help with adding track hubs to specific browsers: NCBI GDV -- https://www.ncbi.nlm.nih.gov/genome/gdv/browser/help/#TRACKHUBS UCSC -- http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html Ensembl -- http://ensembl.org/info/website/adding_trackhubs.html Users interested in consuming these data in a stand-alone genome browser can use the following files in bigBed format: trackhub/data/release_##/MANE.GRCh38.v##.refseq.bb RefSeq Identifiers trackhub/data/release_##/MANE.GRCh38.v##.ensembl.bb EMBL-EBI Identifiers 13. MANE.GRCh38.v##.protein_coding_genes_not_in_mane.txt.gz This file includes a list of human protein-coding genes not in the MANE set, and a category associated with each gene indicating the reason why it is not in MANE. Genes in the category "pending MANE review" are yet to be reviewed by RefSeq and Ensembl/GENCODE curators. Genes in the following categories are currently out of scope for MANE: * genome error on assembled chromosomes: The gene cannot be represented accurately at this time due to an assembly error in GRCh38. * non-coding allele on assembled chromosomes: The gene may be annotated as protein-coding on an unlocalized scaffold in the primary assembly or on an alternate loci scaffold, but cannot be represented as protein-coding on an assembled chromosome due to allelic differences. * gene not on assembled chromosomes: The gene is only found on an unplaced, unlocalized, alternate loci, or patch scaffold. * gene in false duplication region: The gene falls within a region of GRCh38 that has now been identified as a false duplication by the Genome Reference Consortium (GRC). * gene undergoes ribosomal slippage: Four human genes known to undergo ribosomal slippage are currently out of scope for MANE due to data model differences between RefSeq and Ensembl/GENCODE. ################################################################################ CHANGELOG ################################################################################ 2022-01-31 * Updated introduction * Added note about change in filenames * Corrected filenames of bigBed files * Added "genes_not_in_mane" file and description of categories 2024-03-06 * Added note about changes to the "genes_not_in_mane" file contents and name