MGA (Mass sequence for Genome Annotation) data First release: 24 Jan, 2005. Last update: 13 Dec, 2007 DNA Data Bank of Japan has collected data in the MGA category since January 2005. The definition of MGA is as follows. ******************************************************************************** MGA is defined as those sequences which are produced in large quantity in view of genome annotation. ******************************************************************************** MGA is classified into a special category like WGS (Whole Genome Shotgun), but not into a specific division like EST or GSS. Thus, the taxonomic division is assigned to the MGA data. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Contents 1. Acceesion number 2. Distibution file of MGA data 2-1) The master record 2-2) the Variable record 3. History ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1. Accession Number The accession number for each MGA entry is composed of 5 letter prefixes and 7 digits like an example given below. ABCDE0000001, in which, AB; PROJECT identifier, CDE; RESOURCE identifier of each PROJECT, ABCDE; RESOURCE identifier in a given project, and ABCDE00000001; the accession number for each sequence in the PROJECT. A cross reference between the project name and the first two letters of the prefix is shown in project_index.html/.txt RESOURCE showing the source from which the sequences obtained is unified for each PROJECT. One RESOURCE is comparable with one cDNA library. For details about the relationship between the resource information and the prefix, refer to resource.html/.text. 2. Distribution file of MGA data MGA data are released in a unit of a resource of a project with two types of files, master and variable records. The master record consists of the information common in each RESOURCE of the PROJECT. The variable record includes the sequence data and related information for each sequence entry. 2-1) The Master record Table 1. The format of LOCUS line It is adjusted to the GenBank format from release 51. ================================================================================ 1 10 20 30 40 50 60 70 79 +--------+---------+---------+---------+---------+---------+---------+--------- LOCUS AAAAA0000000 mRNA linear ROD 24-JAN-2005 format specification: --------- -------- Positions Contents --------- -------- 01-05 'LOCUS' 06-12 spaces 13-24 Locus name 25-47 space 48-53 mRNA (messenger RNA). Left justified. 54-55 space 56-63 'linear' followed by two spaces, or 'circular' 64-64 space 65-67 The division code 68-68 space 69-79 Date, in the form dd-MMM-yyyy (e.g., 24-JAN-2005) ================================================================================ Table 2. The format of the master record for RESOURCE of the MGA data ================================================================================ 1 10 20 30 40 50 60 70 79 +--------+---------+---------+---------+---------+---------+---------+--------- LOCUS AAAAA0000000 mRNA linear ROD 24-JAN-2005 DEFINITION Mus musculus 1 month adult cerebellum RIKEN Cap Analysis Gene Expression (CAGE) library. ACCESSION AAAAA0000000 VERSION AAAAA0000000.1 KEYWORDS MGA; CAGE (Cap Analysis Gene Expression). SOURCE Mus musculus ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. REFERENCE 1 AUTHORS Arakawa,T., Carninci,P., Fukuda,S., Hasegawa,A., Hayashida,K., Hori,F., Iida,J., Imamura,K., Izumi,H., Kai,C., Kanetaka,H., Katayama,S., Kawazu,C., Kodzius,R., Kojima,M., Komatsu,S., Kondo,S., Matsuda,N., Murata,M., Nakamura,K., Nakamura,M., Nishiyori,H., Noma,S., Nomura,K., Ohono,M., Sasaki,D., Sato,H., Shimamura,K., Suwa,M., Tagami,M., Tagami-Takeda,Y., Tanaka,N., Tashiro,Y., Tokuyasu,M., Usami,Y., Waki,K., Watahiki,A., Yoshino,K. and Hayashizaki,Y. TITLE Direct Submission JOURNAL Submitted (19-NOV-2004) to the DDBJ/EMBL/GenBank databases. Yoshihide Hayashizaki, The Institute of Physical and Chemical Research (RIKEN), Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute; 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan (E-mail:genome-res@gsc.riken.jp, URL:http://genome.gsc.riken.jp/, Tel:81-45-503-9222, Fax:81-45-503-9216) REFERENCE 2 AUTHORS Arakawa,T., Carninci,P., Fukuda,S., Hasegawa,A., Hayashida,K., Hori,F., Iida,J., Imamura,K., Izumi,H., Kai,C., Kanetaka,H., Katayama,S., Kawazu,C., Kodzius,R., Kojima,M., Komatsu,S., Kondo,S., Matsuda,N., Murata,M., Nakamura,K., Nakamura,M., Nishiyori,H., Noma,S., Nomura,K., Ohono,M., Sasaki,D., Sato,H., Shimamura,K., Suwa,M., Tagami,M., Tagami-Takeda,Y., Tanaka,N., Tashiro,Y., Tokuyasu,M., Usami,Y., Waki,K., Watahiki,A., Yoshino,K. and Hayashizaki,Y. TITLE The gene expression analysis of CAGE tags JOURNAL Published Only in Database(2004) REFERENCE 3 AUTHORS Shiraki,T., Kondo,S., Katayama,S., Waki,K., Kasukawa,T., Kawaji,H., Kodzius,R., Watahiki,A., Nakamura,M., Arakawa,T., Fukuda,S., Sasaki,D., Podhajska,A., Harbers,M., Kawai,J., Carninci,P. and Hayashizaki,Y. TITLE Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage JOURNAL Proc. Natl. Acad. Sci. U.S.A. 100, 15776-15781 (2003) REFERENCE 4 AUTHORS Kodzius,R., Matsumura,Y., Kasukawa,T., Shimokawa,K., Fukuda,S., Shiraki,T., Nakamura,M., Arakawa,T., Sasaki,D., Kawai,J., Harbers,M., Carninci,P. and Hayashizaki,Y. TITLE Absolute expression values for mouse transcripts: re-annotation of the READ expression database by the use of CAGE and EST sequence tags JOURNAL FEBS Lett. 559, 22-26 (2004) COMMENT CAGE library was prepared and sequenced in Genome Science Laboratory (Wako) and Genome Exploration Research Group Genomics Science Center (GSC) in RIKEN Yokohama Institute Please visit our web site for further details. URL:http://genome.gsc.riken.jp/ FEATURES Location/Qualifiers source /clone_lib="mouse Cap Analysis Gene Expression (CAGE) library" /db_xref="taxon:10090" /dev_stage="1 month adult" /mol_type="mRNA" /note="primer Oligo dT derived" /note="MA_id: MA:0000198" /note="eVOC_id: EVM:2280099" /organism="Mus musculus" /strain="C57BL/6J" /tissue_type="cerebellum" MGA AAAAA0000001-AAAAA0240780 total number of count : 346609 Header Format >[ACC#]|[submitter's identifier]|[number of sequence count]|[map]|[free text]|[db_xref1(,db_xref2,...)]| // +--------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79 ================================================================================ The word "MGA" should be described in the KEYWORD line. Other words related to obtaining the sequences are also described if necessary. The summary of the registered sequences is described below the source FEATURE Key in MGA. MGA includes the accession numbers, total counts of tags that represent the number of obtained sequences from the RESOURCE and Header format for the explanation of the header line in variable record file. Table 3. The format of the summary field of registered sequence in a source shown as MGA. ================================================================================ 1 10 20 30 40 50 60 70 79 +--------+---------+---------+---------+---------+---------+---------+--------- MGA AAAAA0000001-AAAAA0240780 total number of count : 346609 Header Format >[ACC#]|[submitter's identifier]|[number of sequence count]|[map]|[free text]|[db_xref1(,db_xref2,...)]| // ================================================================================ 2-2) The variable record The variable record has the sequence data and related information that is explained in Header format of the the master record for each sequence. The variable record is shown in multi-fasta-formatted style. A single sequence entry consists of two lines. The first line in the table 4 describes the accession number, the identifier provided by the submitter(s), the number of the same sequences obtained from the resource, map information, others in free text and cross-reference information for external database following [>]. The first line starts with [>] followed by each field which is delimited by pipe (|). The second line shows each of the nucleotide sequences obtained from the source. The set of the two lines is repeated as many times as the number of the different sequences obtained from the source. Table 4. The format of the Variable record for the RESOURCE of MGA. ================================================================================ >AAAAA0000001|BC1004AA60F1902|1|||| gcggaagtcggaccggtcgc >AAAAA0000002|BC1003AE78G1607|1|||| gactgtcttcggtgaatgca >AAAAA0000003|BC1003AE72P1806|1|||| gggagaccgatccgggatct >AAAAA0000004|BC1003AE30G1801|2|||| gagtcgggtcggtggggctgt >AAAAA0000005|BC1003AA45J1501|1|||| ggggaatctgcagcctgggc >AAAAA0000006|BC1003AE67B0902|1|||| gagccgtccccgacgccgcca (skip the remaining data) ================================================================================ 3. History ################################################################################ 13-DEC-2007 Change of Master record format. Deletion of E-mail address, phone and fax numbers. Please refer to the "Release 71.0", Sep. 2007, more details. (ftp://ftp.ddbj.nig.ac.jp/database/release_note/ddbj/ddbjrel.71.txt) Additon of keyword items(ex. 5'-end tag). 02-SEP-2005 Change of publication format of Variable record. "//" lines to be distiguished each entry were removed. ################################################################################