DDBJ Amino Acid Sequence Database (DAD) Release 68.0, Sep. 2014, including 35,066,241 entries, 10,653,219,680 residues Last published date in the present release: August 29, 2014 ------------------------------------------------------------------------------- Table of contents ------------------------------------------------------------------------------- 1. Introduction 1.1. Announcement for changes in the present release 1.2. Announcement for the forthcoming changes 2. Format of DAD entries 3. DAD categories 4. Contact information 5. Disclaimer 6. DAD file categories 7. A sample of DAD entries 8. Release history 9. Statistics of DAD ------------------------------------------------------------------------------- 1. Introduction This is release 68.0 of DDBJ Amino Acid Sequence Database (DAD). This database has been produced by extracting all translated sequences from the DDBJ periodical release 98.0 and TPA dataset (August 2014). 1.1. Announcement for changes in the present release Nothing particular. 1.2. Announcement for the forthcoming changes Nothing particular. 2. Format of DAD entries The standard format of DAD is almost the same as that of the DDBJ nucleotide sequence database except for those described below. Accession numbers of the DAD entries are written in the lines labeled as "ACCESSION." An accession number of DAD is comprised of a DDBJ accession number and an integer that begins with 1. These two numbers are combined by a hyphen (-). For example, two amino acid sequences extracted from a DDBJ entry D12345 respectively have accession numbers of D12345-1 and D12345-2. The number is useful for identifying a DAD entry. An amino acid sequence begins from the next line of "BEGIN." Up to sixty amino acids are written in one line. Following the amino acid sequence, there is a double slash (//) which means the end of the entry. LOCUS line contains locus name, length of protein, molecular type (this is always "PRT"), division name, and date of release of DNA counterpart. DEFINITION line contains species name and protein name. The other parts of a DAD entry, including FEATURES, are almost the same as those of the corresponding DDBJ entry. 3. DAD categories DAD entries are classified into 23 categories, adding TPA and TPACON to the 21 categories of DDBJ periodical release. Please refer to the release note of the DDBJ release for details (filename: ddbjrel.txt). Also, there are two types of DAD files for each division; files with suffix ".DAD" in the DAD standard format, and those with suffix ".DAD.fasta" in a FASTA-compatible format. [DDBJ release note] ftp://ftp.ddbj.nig.ac.jp/ddbj_database/ddbj/ddbjrel.txt 4. Contact information DNA Data Bank of Japan DDBJ Center National Institute of Genetics Research Organization of Information and Systems Mishima 411-8540, Japan Phone: +81 55 981 6853 FAX: +81 55 981 6849 E-mail: ddbj@ddbj.nig.ac.jp WWW: http://www.ddbj.nig.ac.jp/ 5. Disclaimer While DDBJ endeavors to keep its data correct, DDBJ makes no representations or warranties of any kind about the completeness, accuracy or reliability with respect to the entries contained in the DAD periodical release. DDBJ also makes no legal liability or responsibility of merchantability or fitness for a particular purpose or that the use of the sequence data will not infringe any patent or other rights. Any receipt, reliance or use you place on such data is therefore strictly at your own risk. 6. DAD file categories This release covers 23 categories (see also '3. DAD categories'.) of organisms and others as follows: ------------------------------------------------------------------------------ ddbjbct; Category for bacteria ddbjcon; Category for CON (contigs) ddbjenv; Category for ENV (environmental samples) ddbjest; Category for EST (expressed sequence tags) ddbjgss; Category for GSS (genome survey sequences) ddbjhtc; Category for HTC (high throughput cDNA sequences) ddbjhtg; Category for HTG (high throughput genomic sequences) ddbjhum; Category for human ddbjinv; Category for invertebrates ddbjmam; Category for mammals other than primates and rodents ddbjpat; Category for patents ddbjphg; Category for phages ddbjpln; Category for plants ddbjpri; Category for primates other than human ddbjrod; Category for rodents ddbjsts; Category for STS (sequence tagged sites) ddbjsyn; Category for synthetic DNAs ddbjtpa; Category for TPA (third party annotations) ddbjtpacon; Category for CON (contigs) of TPA (third party annotations) ddbjtsa; Category for TSA (transcriptome shotgun assemblies) ddbjuna; Category for unannotated sequences ddbjvrl; Category for viruses ddbjvrt; Category for vertebrates other than mammals ------------------------------------------------------------------------------ All of above in the present release are recorded in ddbj***##.DAD files as follows, respectively. file prefix number of files ------------------------------- ddbjbct 28 ddbjcon 36 ddbjenv 1 ddbjest 1 ddbjgss 1 ddbjhtc 1 ddbjhtg 1 ddbjhum 2 ddbjinv 3 ddbjmam 1 ddbjpat 1 ddbjphg 1 ddbjpln 4 ddbjpri 1 ddbjrod 1 ddbjsts 1 ddbjsyn 1 ddbjtpa 1 ddbjtpacon 1 ddbjtsa 1 ddbjuna 1 ddbjvrl 4 ddbjvrt 2 ------------------------------- 7. A sample of DAD entries Below is a typical DAD entry. This might be useful for understanding its format and contents. ----- ----- ----- ----- sample begin ----- ----- ----- ----- LOCUS BAA22986.1 220 aa PRT HUM 28-OCT-1997 DEFINITION Homo sapiens RVP1 protein. ACCESSION AB000714-1 PROTEIN_ID BAA22986.1 SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryotae; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1250) AUTHORS Katahira,J. TITLE Direct Submission JOURNAL Submitted (26-JAN-1997) to the DDBJ/EMBL/GenBank databases. Contact:Jun Katahira Institute for Microbial Diseases, Osaka University, Department of Bacterial Toxinology; 3-1, Yamadaoka, Suita, Osaka 565, Japan REFERENCE 2 AUTHORS Katahira,J., Sugiyama,H., Inoue,N., Horiguchi,Y., Matsuda,M. and Sugimoto,N. TITLE Clostridium perfringens enterotoxin utilizes two structurally related membrane proteins as functional receptors in vivo JOURNAL J. Biol. Chem. 272, 26652-26658 (1997) COMMENT FEATURES Qualifiers source /db_xref="H-InvDB:HIT000057926" /mol_type="mRNA" /organism="Homo sapiens" /tissue_lib="lung" protein /gene="hRVP1" /transl_table=1 BEGIN 1 MSMGLEITGT ALAVLGWLGT IVCCALPMWR VSAFIGSNII TSQNIWEGLW MNCVVQSTGQ 61 MQCKVYDSLL ALPQDLQAAR ALIVVAILLA AFGLLVALVG AQCTNCVQDD TAKAKITIVA 121 GVLFLLAALL TLVPVSWSAN TIIRDFYNPV VPEAQKREMG AGLYVGWAAA ALQLLGGALL 181 CCSCPPREKK YTATKVVYSA PRSTGPGASL GTGYDRKDYV // ----- ----- ----- ----- sample end ----- ----- ----- ----- 8. Release history ------------------ Since release 50 ------------------ The format of the SOURCE line in DAD flat file has been changed: As results of this change, 1) the order of organism name and organelle name is changed and 2) some of DAD flat files have included a common name like as GenBank flat files. The change is shown below in detail. ---------------- Old (-rel. 49) ---------------- Format: SOURCE [] Example: SOURCE Homo sapiens mitochondrion ---------------- New (rel. 50-) ---------------- Format: SOURCE [] [()] Example: SOURCE mitochondrion Homo sapiens (human) See also '7. A sample of DAD entries'. ------------------ Since release 45 ------------------ A new division, TSA (Transcriptome Shotgun Assembly) is started: A new division for assembled mRNA sequences, Transcriptome Shotgun Assembly (TSA), is included in the present release. With new sequencing technologies, INSDC has faced many requests to accept assembled EST sequences. These sequence data have become more useful than used to be, although they may not be correctly assembled or exist in nature. Therefore, INSDC decided to collect assembled EST sequences into the new division 'TSA'. TSA sequences are shotgun assemblies of primary sequences deposited in the EST division of INSDC, the Trace Archive (TA) or the Short-Read Archive (SRA). Two specific keywords, "TSA" and "Transcriptome Shotgun Assembly", are present in all TSA entries. The new division code, "TSA", is also described in the LOCUS line in all TSA entries. No format changes are anticipated for this new division, however, note that TSA entries make use of the same PRIMARY line that is described for the entries in TPA category. The PRIMARY block contains references to the underlying reads/transcripts that were assembled to construct a TSA record. ------------------ Since release 42 ------------------ Deletion of E-mail address, phone and fax numbers from DAD flat file To follow the Japanese law of protecting personal information, DDBJ delete both phone and fax numbers, and E-mail address from the flat files of entries submitted to DDBJ. Also, it would be helpful to protect DAD releases against SPAM mail senders. DDBJ retrofitted most of all entries submitted to DDBJ, not to GenBank or EMBL, by the DDBJ periodical release 72. Before the DAD periodical release 42, the submitter information was described in JOURNAL line at REFERENCE 1 as, ------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Taro Mishima, DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan (E-mail:ddbj@ddbj.nig.ac.jp, URL:http://www.ddbj.nig.ac.jp/, Tel:81-12-345-6789, Fax:81-12-345-9876) ------------------------------------------------------------------------------- After the deletion or the information in question, DAD flat file is either one of the following two types; Type 1: Phone and fax numbers and E-mail address are deleted. ------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Contact:Taro Mishima DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan URL :http://www.ddbj.nig.ac.jp/ ------------------------------------------------------------------------------- Type 2: When the submitters wish to keep their contact information disclosed, it is described as, ------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Contact:Taro Mishima DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan URL :http://www.ddbj.nig.ac.jp/ E-mail :ddbj@ddbj.nig.ac.jp Phone :81-12-345-6789 Fax :81-12-345-9876 ------------------------------------------------------------------------------- ------------------ Since release 40 ------------------ The CON division has been included. CON; Contig / Constructed To conjugate a series of entries, such as those submitted from a genome project, each of the three data banks constructs an entry and assign an accession number to a large scale sequence dataset. Such entries are classified into the CON division. ------------------ Since release 38 ------------------ From the present release, we change the maximum file size to 1.5 GB, because the network capacity has been remarkably increased. Each file named as ddbj***##.DAD has at most 1.5 GB storage capacity. See also the sections, '9. Statistics of DAD'. ------------------ Since release 32 ------------------ Introduction of ENV division : Recently, the submissions of the sequences derived from environmental samples have rapidly increased. To accommodate such submissions, a new division, ENV, has been created. This division contains the sequences obtained via direct molecular isolation such as PCR, DGGE, or any anonymous method. In the past, the sequences derived from environmental samples belonged to taxonomic divisions, mainly BCT. At DDBJ, the retrofit to transfer relevant entries from taxonomic divisions to the ENV division starts in the present release, and ends by the next periodical release. Please note that during this transitional period, some entries to be eventually placed in the ENV division will be found in other divisions. ------------------ Since release 30 ------------------ "H-InvDB" has been added to db_xref(cross-reference) as a qualifier key. The following is an example. FEATURES Location/Qualifiers source 1..5589 /clone="hf00223s1" /clone_lib="pBluescriptII SK plus" /db_xref="H-InvDB:HIT000000001" ------------------ Since release 29 ------------------ The GSS division has been included since release 29. GSS stands for the Genome Survey Sequence, which is similar to EST, except that GSS is genomic DNA whereas EST is cDNA. ------------------ Since release 21 ------------------ 1) Some information on introns has been added. It is given as "intron_pos" in the Feature/Qualifiers. Examples: intron_pos 142:1 (2/12) means that the 2nd intron among 12 in total is located between the 1st and 2nd bases of the 142th codon (amino acid residue). intron_pos 228:0 (4/12) means that the 4th intron among 12 in total is located between the 227th and 228th codons (between the 3rd base of the 227th codon and the 1st base of the 228th codon). 2) the Locus line has been changed. The following is an example and its explanation: LOCUS BAA21794.1 263 aa PRT BCT 05-FEB-1999 Positions Contents --------- -------- 01-05 'LOCUS' 06-12 spaces 13-28 Locus name 29-29 space 30-40 Length of sequence, right-justified 41-41 space 42-43 'aa' 44-47 spaces 48-53 'PRT' 54-64 spaces 65-67 Division code 68-68 space 69-79 Date, in the form DD-MMM-YYYY (e.g., 15-MAR-1991) --------------------- 3) TPA data have been provided in a separate file (ddbjtpa.DAD). 9. Statistics of DAD The followings are statistics of this release of DAD. total number of entries 35,066,241 total length of sequences 10,653,219,680 average length 303 aa name of longest sequence CP000108-608 PID:ABB27887.1 length of longest sequence 36,805 aa (CP000108-608) ========================================================================= file name no. of entries no. of amino acids file size ========================================================================= ddbjbct1.DAD 322506 97606017 1468006406 ddbjbct2.DAD 495048 150537613 1468042140 ddbjbct3.DAD 588019 181888359 1468010119 ddbjbct4.DAD 496442 157090773 1468006862 ddbjbct5.DAD 427234 137271972 1468008079 ddbjbct6.DAD 432791 134897131 1468010655 ddbjbct7.DAD 434753 136235157 1468008563 ddbjbct8.DAD 462281 141928192 1468010073 ddbjbct9.DAD 329572 107480222 1468008536 ddbjbct10.DAD 391221 122438087 1468006460 ddbjbct11.DAD 396018 124353901 1468011165 ddbjbct12.DAD 337552 107472091 1468006779 ddbjbct13.DAD 401888 127531741 1468009876 ddbjbct14.DAD 465349 145955411 1468009494 ddbjbct15.DAD 425142 133146982 1468011614 ddbjbct16.DAD 409671 129593446 1468008120 ddbjbct17.DAD 576004 181342781 1468006591 ddbjbct18.DAD 511986 157016942 1468007359 ddbjbct19.DAD 458003 141538460 1468007360 ddbjbct20.DAD 391875 121057778 1468008870 ddbjbct21.DAD 332151 100708673 1468009945 ddbjbct22.DAD 321128 96325443 1468009396 ddbjbct23.DAD 365283 111395044 1468008487 ddbjbct24.DAD 450685 138526889 1468006410 ddbjbct25.DAD 557445 163393941 1468006811 ddbjbct26.DAD 616597 185298967 1468007745 ddbjbct27.DAD 702547 201675609 1468007583 ddbjbct28.DAD 118327 36300997 249601661 ddbjcon1.DAD 205463 86181429 1468007532 ddbjcon2.DAD 270104 108673637 1468010526 ddbjcon3.DAD 206524 88376864 1468014242 ddbjcon4.DAD 303156 113274980 1468007852 ddbjcon5.DAD 306700 120320826 1468007952 ddbjcon6.DAD 475215 191851710 1468007115 ddbjcon7.DAD 470328 154051117 1468006526 ddbjcon8.DAD 366501 63971386 1468010061 ddbjcon9.DAD 366467 64050680 1468008286 ddbjcon10.DAD 366533 63965841 1468008319 ddbjcon11.DAD 366514 63958104 1468009270 ddbjcon12.DAD 366556 63919450 1468006470 ddbjcon13.DAD 366456 64149346 1468009205 ddbjcon14.DAD 367155 62874943 1468009066 ddbjcon15.DAD 366168 64725739 1468007483 ddbjcon16.DAD 361555 76602646 1468008619 ddbjcon17.DAD 362461 74500614 1468009525 ddbjcon18.DAD 361764 73978197 1468008889 ddbjcon19.DAD 362366 74466464 1468007446 ddbjcon20.DAD 361111 77733958 1468009508 ddbjcon21.DAD 358066 84549115 1468009352 ddbjcon22.DAD 356964 87141501 1468007449 ddbjcon23.DAD 357526 83857191 1468008066 ddbjcon24.DAD 407059 125888420 1468008003 ddbjcon25.DAD 458665 173038846 1468007491 ddbjcon26.DAD 404786 148350321 1468008876 ddbjcon27.DAD 403372 164056576 1468009725 ddbjcon28.DAD 478276 197044281 1468008498 ddbjcon29.DAD 324325 133849234 1468008451 ddbjcon30.DAD 346798 147023442 1468006919 ddbjcon31.DAD 371217 157117242 1468008529 ddbjcon32.DAD 454699 202704622 1468007506 ddbjcon33.DAD 453066 180453362 1468009201 ddbjcon34.DAD 426917 192849187 1468007215 ddbjcon35.DAD 429671 179555734 1468008209 ddbjcon36.DAD 190723 75580971 542432678 ddbjenv1.DAD 560060 111810701 1184654990 ddbjest1.DAD 1163 153762 2558150 ddbjgss1.DAD 2859 905772 7498415 ddbjhtc1.DAD 106984 33381853 408705684 ddbjhtg1.DAD 37571 11757618 195859434 ddbjhum1.DAD 621766 182085151 1468007895 ddbjhum2.DAD 1900 529087 4325917 ddbjinv1.DAD 575943 167394139 1468006422 ddbjinv2.DAD 666342 177473497 1468006677 ddbjinv3.DAD 498331 125242765 1072290035 ddbjmam1.DAD 224856 56505217 456794223 ddbjpat1.DAD 390844 163615251 579082428 ddbjphg1.DAD 214745 44191698 486318089 ddbjpln1.DAD 464976 162727747 1468007458 ddbjpln2.DAD 549154 166208930 1468008428 ddbjpln3.DAD 706724 198389926 1468008003 ddbjpln4.DAD 334674 84400575 677478562 ddbjpri1.DAD 71226 16846683 155931965 ddbjrod1.DAD 195395 61851070 499862379 ddbjsts1.DAD 9 812 21985 ddbjsyn1.DAD 116510 43558187 311310376 ddbjtpa1.DAD 47479 12093295 125187147 ddbjtpacon1.DAD 71668 31580322 309013607 ddbjtsa1.DAD 120360 49444651 314503099 ddbjuna1.DAD 214 35721 355926 ddbjvrl1.DAD 668976 209063076 1468006525 ddbjvrl2.DAD 691671 208148435 1468009208 ddbjvrl3.DAD 632037 205571384 1468006480 ddbjvrl4.DAD 143973 50614163 328663385 ddbjvrt1.DAD 690972 169660145 1468007254 ddbjvrt2.DAD 238114 53281452 479602901 ========================================================================= Total 35066241 10653219680 114088686266 =========================================================================