DDBJ Amino Acid Sequence Database (DAD) Release 74.0, Mar. 2016, including 47,659,541 entries, 14,850,873,151 residues Last published date in the present release: February 26, 2016 ------------------------------------------------------------------------------- Table of contents ------------------------------------------------------------------------------- 1. Introduction 1.1. Announcement for changes in the present release 1.2. Announcement for the forthcoming changes 2. Format of DAD entries 3. DAD categories 4. Citation 5. Contact information 6. Disclaimer 7. DAD file categories 8. A sample of DAD entries 9. Release history 10. Statistics of DAD ------------------------------------------------------------------------------- 1. Introduction This is release 74.0 of DDBJ Amino Acid Sequence Database (DAD). This database has been produced by extracting all translated sequences from the DDBJ periodical release 104.0 and TPA dataset (February 2016). 1.1. Announcement for changes in the present release Nothing particular. 1.2. Announcement for the forthcoming changes Nothing particular. 2. Format of DAD entries The standard format of DAD is almost the same as that of the DDBJ nucleotide sequence database except for those described below. Accession numbers of the DAD entries are written in the lines labeled as "ACCESSION." An accession number of DAD is comprised of a DDBJ accession number and an integer that begins with 1. These two numbers are combined by a hyphen (-). For example, two amino acid sequences extracted from a DDBJ entry D12345 respectively have accession numbers of D12345-1 and D12345-2. The number is useful for identifying a DAD entry. An amino acid sequence begins from the next line of "BEGIN." Up to sixty amino acids are written in one line. Following the amino acid sequence, there is a double slash (//) which means the end of the entry. LOCUS line contains locus name, length of protein, molecular type (this is always "PRT"), division name, and date of release of DNA counterpart. DEFINITION line contains species name and protein name. The other parts of a DAD entry, including FEATURES, are almost the same as those of the corresponding DDBJ entry. 3. DAD categories DAD entries are classified into 23 categories, adding TPA and TPACON to the 21 categories of DDBJ periodical release. Please refer to the release note of the DDBJ release for details (filename: ddbjrel.txt). Also, there are two types of DAD files for each division; files with suffix ".DAD" in the DAD standard format, and those with suffix ".DAD.fasta" in a FASTA-compatible format. [DDBJ release note] ftp://ftp.ddbj.nig.ac.jp/ddbj_database/ddbj/ddbjrel.txt 4. Citation When you use DAD in your research, we would appreciate it if you would include a reference to DDBJ in your publications related to your research. When citing an entry in the DAD database, it is appropriate to give the protein_id and its accession number. Also, it is recommended to cite the first publication in REFERENCE of the entry other than submitter information. DDBJ suggests authors add a reference to DDBJ itself. The following publication, which describes the recent activities of the DDBJ center, would be appropriate to be cited: Mashima J, Kodama Y, Kosuge T, Fujisawa T, Katayama T, Nagasaki H, Okuda Y, Kaminuma E, Ogasawara O, Okubo K, Nakamura Y and Takagi T. DNA data bank of Japan (DDBJ) progress report. Nucleic Acids Res. 44 (Database issue), D51-D57 (2016) DOI: 10.1093/nar/gkv1105 The following sentence is an example to cite an entry in the DAD database: ----------------------------------------------------------------------------- "We searched the DAD database (1) by sequence similarities and found an amino acid sequence (2), with protein_id BAA22986.1 in DDBJ accession number AB000714, which had significant similarity with ..." (1) Mashima, J. et al, Nucleic Acids Res. 44(Database issue), D51-D57 (2016). (2) Katahira, J. et al, J. Biol. Chem. 272, 26652-26658 (1997). ------------------------------------------------------------------------------ 5. Contact information DNA Data Bank of Japan DDBJ Center National Institute of Genetics Research Organization of Information and Systems Mishima 411-8540, Japan Phone: +81 55 981 6853 FAX: +81 55 981 6849 E-mail: ddbj@ddbj.nig.ac.jp WWW: http://www.ddbj.nig.ac.jp/ 6. Disclaimer While DDBJ endeavors to keep its data correct, DDBJ makes no representations or warranties of any kind about the completeness, accuracy or reliability with respect to the entries contained in the DAD periodical release. DDBJ also makes no legal liability or responsibility of merchantability or fitness for a particular purpose or that the use of the sequence data will not infringe any patent or other rights. Any receipt, reliance or use you place on such data is therefore strictly at your own risk. 7. DAD file categories This release covers 23 categories (see also '3. DAD categories'.) of organisms and others as follows: ------------------------------------------------------------------------------ ddbjbct; Category for bacteria ddbjcon; Category for CON (contigs) ddbjenv; Category for ENV (environmental samples) ddbjest; Category for EST (expressed sequence tags) ddbjgss; Category for GSS (genome survey sequences) ddbjhtc; Category for HTC (high throughput cDNA sequences) ddbjhtg; Category for HTG (high throughput genomic sequences) ddbjhum; Category for human ddbjinv; Category for invertebrates ddbjmam; Category for mammals other than primates and rodents ddbjpat; Category for patents ddbjphg; Category for phages ddbjpln; Category for plants ddbjpri; Category for primates other than human ddbjrod; Category for rodents ddbjsts; Category for STS (sequence tagged sites) ddbjsyn; Category for synthetic DNAs ddbjtpa; Category for TPA (third party annotations) ddbjtpacon; Category for CON (contigs) of TPA (third party annotations) ddbjtsa; Category for TSA (transcriptome shotgun assemblies) ddbjuna; Category for unannotated sequences ddbjvrl; Category for viruses ddbjvrt; Category for vertebrates other than mammals ------------------------------------------------------------------------------ All of above in the present release are recorded in ddbj***##.DAD files as follows, respectively. file prefix number of files ------------------------------- ddbjbct 45 ddbjcon 42 ddbjenv 2 ddbjest 1 ddbjgss 1 ddbjhtc 1 ddbjhtg 1 ddbjhum 2 ddbjinv 4 ddbjmam 1 ddbjpat 1 ddbjphg 1 ddbjpln 6 ddbjpri 1 ddbjrod 1 ddbjsts 1 ddbjsyn 1 ddbjtpa 1 ddbjtpacon 1 ddbjtsa 1 ddbjuna 1 ddbjvrl 5 ddbjvrt 2 ------------------------------- 8. A sample of DAD entries Below is a typical DAD entry. This might be useful for understanding its format and contents. ----- ----- ----- ----- sample begin ----- ----- ----- ----- LOCUS BAA22986.1 220 aa PRT HUM 28-OCT-1997 DEFINITION Homo sapiens RVP1 protein. ACCESSION AB000714-1 PROTEIN_ID BAA22986.1 SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryotae; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1250) AUTHORS Katahira,J. TITLE Direct Submission JOURNAL Submitted (26-JAN-1997) to the DDBJ/EMBL/GenBank databases. Contact:Jun Katahira Institute for Microbial Diseases, Osaka University, Department of Bacterial Toxinology; 3-1, Yamadaoka, Suita, Osaka 565, Japan REFERENCE 2 AUTHORS Katahira,J., Sugiyama,H., Inoue,N., Horiguchi,Y., Matsuda,M. and Sugimoto,N. TITLE Clostridium perfringens enterotoxin utilizes two structurally related membrane proteins as functional receptors in vivo JOURNAL J. Biol. Chem. 272, 26652-26658 (1997) COMMENT FEATURES Qualifiers source /db_xref="H-InvDB:HIT000057926" /mol_type="mRNA" /organism="Homo sapiens" /tissue_lib="lung" protein /gene="hRVP1" /transl_table=1 BEGIN 1 MSMGLEITGT ALAVLGWLGT IVCCALPMWR VSAFIGSNII TSQNIWEGLW MNCVVQSTGQ 61 MQCKVYDSLL ALPQDLQAAR ALIVVAILLA AFGLLVALVG AQCTNCVQDD TAKAKITIVA 121 GVLFLLAALL TLVPVSWSAN TIIRDFYNPV VPEAQKREMG AGLYVGWAAA ALQLLGGALL 181 CCSCPPREKK YTATKVVYSA PRSTGPGASL GTGYDRKDYV // ----- ----- ----- ----- sample end ----- ----- ----- ----- 9. Release history ------------------ Since release 50 ------------------ The format of the SOURCE line in DAD flat file has been changed: As results of this change, 1) the order of organism name and organelle name is changed and 2) some of DAD flat files have included a common name like as GenBank flat files. The change is shown below in detail. ---------------- Old (-rel. 49) ---------------- Format: SOURCE [] Example: SOURCE Homo sapiens mitochondrion ---------------- New (rel. 50-) ---------------- Format: SOURCE [] [()] Example: SOURCE mitochondrion Homo sapiens (human) See also '8. A sample of DAD entries'. ------------------ Since release 45 ------------------ A new division, TSA (Transcriptome Shotgun Assembly) is started: A new division for assembled mRNA sequences, Transcriptome Shotgun Assembly (TSA), is included in the present release. With new sequencing technologies, INSDC has faced many requests to accept assembled EST sequences. These sequence data have become more useful than used to be, although they may not be correctly assembled or exist in nature. Therefore, INSDC decided to collect assembled EST sequences into the new division 'TSA'. TSA sequences are shotgun assemblies of primary sequences deposited in the EST division of INSDC, the Trace Archive (TA) or the Short-Read Archive (SRA). Two specific keywords, "TSA" and "Transcriptome Shotgun Assembly", are present in all TSA entries. The new division code, "TSA", is also described in the LOCUS line in all TSA entries. No format changes are anticipated for this new division, however, note that TSA entries make use of the same PRIMARY line that is described for the entries in TPA category. The PRIMARY block contains references to the underlying reads/transcripts that were assembled to construct a TSA record. ------------------ Since release 42 ------------------ Deletion of E-mail address, phone and fax numbers from DAD flat file To follow the Japanese law of protecting personal information, DDBJ delete both phone and fax numbers, and E-mail address from the flat files of entries submitted to DDBJ. Also, it would be helpful to protect DAD releases against SPAM mail senders. DDBJ retrofitted most of all entries submitted to DDBJ, not to GenBank or EMBL, by the DDBJ periodical release 72. Before the DAD periodical release 42, the submitter information was described in JOURNAL line at REFERENCE 1 as, ------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Taro Mishima, DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan (E-mail:ddbj@ddbj.nig.ac.jp, URL:http://www.ddbj.nig.ac.jp/, Tel:81-12-345-6789, Fax:81-12-345-9876) ------------------------------------------------------------------------------- After the deletion or the information in question, DAD flat file is either one of the following two types; Type 1: Phone and fax numbers and E-mail address are deleted. ------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Contact:Taro Mishima DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan URL :http://www.ddbj.nig.ac.jp/ ------------------------------------------------------------------------------- Type 2: When the submitters wish to keep their contact information disclosed, it is described as, ------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Contact:Taro Mishima DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan URL :http://www.ddbj.nig.ac.jp/ E-mail :ddbj@ddbj.nig.ac.jp Phone :81-12-345-6789 Fax :81-12-345-9876 ------------------------------------------------------------------------------- ------------------ Since release 40 ------------------ The CON division has been included. CON; Contig / Constructed To conjugate a series of entries, such as those submitted from a genome project, each of the three data banks constructs an entry and assign an accession number to a large scale sequence dataset. Such entries are classified into the CON division. ------------------ Since release 38 ------------------ From the present release, we change the maximum file size to 1.5 GB, because the network capacity has been remarkably increased. Each file named as ddbj***##.DAD has at most 1.5 GB storage capacity. See also the sections, '9. Statistics of DAD'. ------------------ Since release 32 ------------------ Introduction of ENV division : Recently, the submissions of the sequences derived from environmental samples have rapidly increased. To accommodate such submissions, a new division, ENV, has been created. This division contains the sequences obtained via direct molecular isolation such as PCR, DGGE, or any anonymous method. In the past, the sequences derived from environmental samples belonged to taxonomic divisions, mainly BCT. At DDBJ, the retrofit to transfer relevant entries from taxonomic divisions to the ENV division starts in the present release, and ends by the next periodical release. Please note that during this transitional period, some entries to be eventually placed in the ENV division will be found in other divisions. ------------------ Since release 30 ------------------ "H-InvDB" has been added to db_xref(cross-reference) as a qualifier key. The following is an example. FEATURES Location/Qualifiers source 1..5589 /clone="hf00223s1" /clone_lib="pBluescriptII SK plus" /db_xref="H-InvDB:HIT000000001" ------------------ Since release 29 ------------------ The GSS division has been included since release 29. GSS stands for the Genome Survey Sequence, which is similar to EST, except that GSS is genomic DNA whereas EST is cDNA. ------------------ Since release 21 ------------------ 1) Some information on introns has been added. It is given as "intron_pos" in the Feature/Qualifiers. Examples: intron_pos 142:1 (2/12) means that the 2nd intron among 12 in total is located between the 1st and 2nd bases of the 142th codon (amino acid residue). intron_pos 228:0 (4/12) means that the 4th intron among 12 in total is located between the 227th and 228th codons (between the 3rd base of the 227th codon and the 1st base of the 228th codon). 2) the Locus line has been changed. The following is an example and its explanation: LOCUS BAA21794.1 263 aa PRT BCT 05-FEB-1999 Positions Contents --------- -------- 01-05 'LOCUS' 06-12 spaces 13-28 Locus name 29-29 space 30-40 Length of sequence, right-justified 41-41 space 42-43 'aa' 44-47 spaces 48-53 'PRT' 54-64 spaces 65-67 Division code 68-68 space 69-79 Date, in the form DD-MMM-YYYY (e.g., 15-MAR-1991) --------------------- 3) TPA data have been provided in a separate file (ddbjtpa.DAD). 10. Statistics of DAD The followings are statistics of this release of DAD. total number of entries 47,659,541 total length of sequences 14,850,873,151 average length 311 name of longest sequence CP000108-608 PID:ABB27887.1 length of longest sequence 36,805 aa (CP000108-608) ========================================================================= file name no. of entries no. of amino acids file size ========================================================================= ddbjbct1.DAD 328407 99340280 1468006777 ddbjbct2.DAD 496011 150775410 1468030923 ddbjbct3.DAD 581966 184023155 1468007881 ddbjbct4.DAD 575527 173607351 1468009319 ddbjbct5.DAD 453798 145569803 1468009335 ddbjbct6.DAD 434476 138405981 1468009361 ddbjbct7.DAD 426209 132513228 1468006747 ddbjbct8.DAD 457942 141700234 1468007130 ddbjbct9.DAD 386423 122478500 1468009830 ddbjbct10.DAD 354449 112624133 1468008824 ddbjbct11.DAD 391797 122127121 1468007209 ddbjbct12.DAD 354999 112796142 1468010834 ddbjbct13.DAD 393008 125870018 1468009117 ddbjbct14.DAD 428307 134946586 1468006961 ddbjbct15.DAD 449294 141451430 1468007993 ddbjbct16.DAD 429104 134732135 1468007519 ddbjbct17.DAD 471196 146909046 1468008254 ddbjbct18.DAD 544860 172371913 1468008961 ddbjbct19.DAD 469945 145034520 1468008688 ddbjbct20.DAD 460698 141045409 1468007648 ddbjbct21.DAD 399027 124071461 1468007796 ddbjbct22.DAD 367567 113594129 1468009386 ddbjbct23.DAD 275192 84543315 1468008571 ddbjbct24.DAD 298806 91159845 1468009052 ddbjbct25.DAD 384011 116871048 1468008081 ddbjbct26.DAD 467733 143895795 1468007265 ddbjbct27.DAD 441515 140793527 1468009111 ddbjbct28.DAD 436560 138652563 1468009464 ddbjbct29.DAD 429333 134458729 1468007532 ddbjbct30.DAD 464376 142510587 1468007614 ddbjbct31.DAD 431168 133632786 1468006633 ddbjbct32.DAD 414043 126543067 1468007946 ddbjbct33.DAD 421195 130311822 1468008565 ddbjbct34.DAD 433483 135980839 1468009133 ddbjbct35.DAD 411775 129444238 1468007214 ddbjbct36.DAD 406440 127474122 1468010196 ddbjbct37.DAD 403495 125641380 1468006839 ddbjbct38.DAD 411446 130466230 1468008099 ddbjbct39.DAD 399133 126476222 1468009838 ddbjbct40.DAD 334262 104877223 1468007944 ddbjbct41.DAD 532425 156465469 1468007306 ddbjbct42.DAD 621594 187180017 1468006642 ddbjbct43.DAD 705645 204028029 1468007435 ddbjbct44.DAD 775375 226732681 1468006466 ddbjbct45.DAD 903 190237 2950421 ddbjcon1.DAD 211249 92618928 1468010411 ddbjcon2.DAD 277466 115225480 1468010230 ddbjcon3.DAD 180507 95265222 1468016516 ddbjcon4.DAD 251102 102984306 1468007825 ddbjcon5.DAD 327247 124110199 1468010758 ddbjcon6.DAD 314530 134216426 1468013337 ddbjcon7.DAD 505148 213161768 1468007513 ddbjcon8.DAD 454217 172445138 1468007740 ddbjcon9.DAD 436953 111311941 1468008131 ddbjcon10.DAD 366566 63878336 1468009790 ddbjcon11.DAD 366465 63959015 1468008429 ddbjcon12.DAD 366468 64026649 1468007816 ddbjcon13.DAD 366535 64001365 1468009933 ddbjcon14.DAD 366514 64007880 1468007064 ddbjcon15.DAD 366593 63944965 1468006487 ddbjcon16.DAD 366991 62768015 1468008059 ddbjcon17.DAD 364546 69676202 1468007119 ddbjcon18.DAD 361345 77032933 1468007181 ddbjcon19.DAD 363074 71701768 1468008390 ddbjcon20.DAD 360780 77361333 1468008465 ddbjcon21.DAD 363577 71202294 1468008749 ddbjcon22.DAD 358787 83590710 1468009107 ddbjcon23.DAD 357979 84650389 1468010478 ddbjcon24.DAD 356684 86631126 1468010164 ddbjcon25.DAD 357043 85885915 1468008351 ddbjcon26.DAD 457432 166777213 1468006438 ddbjcon27.DAD 431767 159848905 1468007459 ddbjcon28.DAD 397025 159941338 1468007851 ddbjcon29.DAD 380489 150784268 1468008193 ddbjcon30.DAD 473534 191967120 1468007897 ddbjcon31.DAD 315309 135860905 1468010677 ddbjcon32.DAD 364369 156810470 1468007110 ddbjcon33.DAD 413568 176240385 1468010212 ddbjcon34.DAD 432632 188632254 1468007698 ddbjcon35.DAD 441985 177330475 1468007445 ddbjcon36.DAD 431654 191456612 1468007834 ddbjcon37.DAD 483186 194655602 1468007920 ddbjcon38.DAD 468502 205697514 1468012437 ddbjcon39.DAD 330959 110190544 1468009944 ddbjcon40.DAD 414416 176646180 1468009654 ddbjcon41.DAD 452443 197159288 1468006718 ddbjcon42.DAD 165456 63528382 488119538 ddbjenv1.DAD 684733 138181839 1468007823 ddbjenv2.DAD 46846 9049647 92195919 ddbjest1.DAD 1163 153762 2565756 ddbjgss1.DAD 3137 962078 8036938 ddbjhtc1.DAD 115924 35183445 427479345 ddbjhtg1.DAD 64287 17592284 262498796 ddbjhum1.DAD 620039 180957946 1468006481 ddbjhum2.DAD 82673 20993216 171974891 ddbjinv1.DAD 584572 176084889 1468006802 ddbjinv2.DAD 690472 179177293 1468007049 ddbjinv3.DAD 696527 150238576 1468007314 ddbjinv4.DAD 414378 106628961 939066027 ddbjmam1.DAD 265906 67123099 543934553 ddbjpat1.DAD 391458 163940387 581382852 ddbjphg1.DAD 336735 70536100 727891340 ddbjpln1.DAD 455611 161101626 1468006882 ddbjpln2.DAD 463587 181090115 1468008559 ddbjpln3.DAD 517365 229427381 1468007297 ddbjpln4.DAD 699786 200975752 1468007222 ddbjpln5.DAD 748668 163678144 1468008128 ddbjpln6.DAD 147276 49919772 319235543 ddbjpri1.DAD 79997 19025434 174355821 ddbjrod1.DAD 213377 66513929 536613596 ddbjsts1.DAD 9 812 22053 ddbjsyn1.DAD 129533 48437166 346754165 ddbjtpa1.DAD 63306 25368312 194625050 ddbjtpacon1.DAD 71628 31568870 308879820 ddbjtsa1.DAD 121318 49616887 325908206 ddbjuna1.DAD 229 39764 391999 ddbjvrl1.DAD 659414 209376982 1468008856 ddbjvrl2.DAD 691350 208300631 1468006529 ddbjvrl3.DAD 634608 202750107 1468007718 ddbjvrl4.DAD 596885 226622954 1468009653 ddbjvrl5.DAD 68721 20359826 133569829 ddbjvrt1.DAD 693321 170691515 1468007103 ddbjvrt2.DAD 426692 95700136 858347441 ========================================================================= Total 47659541 14850873151 154247660284 =========================================================================