ABOUT THE DATASETS There are three Illumina simulated datasets: one standard dataset, two additional datasets, and a small test dataset. For each dataset, sequencing reads are in FASTQ files (*.fq) and the alignments of read and reference are in ALN files (*.aln). For each paired‐read dataset, *1.fq contains data of the first reads, and *2.fq for the second reads. Similarly, *1.aln has read alignments for the first reads and *2.aln for the second reads. The FASTAQ and ALN formats are described in the Data File Format section below. READ ID CONVENTION A read id is in the form part1‐part2‐part3, where part1 is reference sequence id, part2 is a number assigned by ART and unique to a paired‐end read pair, and part3 is either 1 for the first read or 2 for the second read. DATASETS For each paired-read dataset below, *1.fq.gz contains data of the first reads, and *2.fq.gz for the second reads. Similarly, *1.aln.gz has read alignments for the first reads and *2.aln.gz for the second reads. STANDARD DATASETS Read type: 35bp x 2 (paired-end reads) Read fragment size distribution: Gaussian (MEAN 200, STD 20) 1) 6 trio individuals, 60X each chr17p12_trio_60x1.fq.gz chr17p12_trio_60x1.aln.gz chr17p12_trio_60x2.fq.gz chr17p12_trio_60x2.aln.gz 2) 1,000 unrelated individuals, 4X each chr17p12_1000_4x1.fq.gz chr17p12_1000_4x1.aln.gz chr17p12_1000_4x2.fq.gz chr17p12_1000_4x2.aln.gz ADDITIONAL DATASET #1 Read type: 35bp x 2 (paired-end reads) Read fragment size distribution: Gaussian (MEAN 200, STD 20) 1) 200 individuals, 4X each Subjects: first 66 from pop0, first 67 from pop1, first 67 from pop3 in the 1,000 unrelated individuals above chr17p12_200_4x1.fq.gz chr17p12_200_4x1.aln.gz chr17p12_200_4x2.fq.gz chr17p12_200_4x2.aln.gz 2) 100 individuals, 8X each Subjects: first 33 from pop0, first 33 from pop1, first 34 from pop3 in the 1,000 unrelated individuals above chr17p12_100_8x1.fq.gz chr17p12_100_8x1.aln.gz chr17p12_100_8x2.fq.gz chr17p12_100_8x2.aln.gz ADDITIONAL DATASET #2: Subject: a single trio individual Read type: 35bp x 2 (paired-end reads) 1) a single trio individual, 60X, fragment size: mean 500, STD 50 aTrio_child_5h60x1.fq.gz aTrio_child_5h60x1.aln.gz aTrio_child_5h60x2.fq.gz aTrio_child_5h60x2.aln.gz 2) the same trio individual, 60X, fragment size: mean 3000, STD 300 aTrio_child_3k60x1.fq.gz aTrio_child_3k60x1.aln.gz aTrio_child_3k60x2.fq.gz aTrio_child_3k60x2.aln.gz A SMALL TEST DATASET (for data validation only) Read type: 35bp x 2 (paired-end reads) Read fragment size distribution: Gaussian (MEAN 200, STD 20) 1) 6 trio individuals, 8X each test_trio_200_20_8x1.fq.gz test_trio_200_20_8x1.aln.gz test_trio_200_20_8x2.fq.gz test_trio_200_20_8x2.aln.gz SIMULATION TOOL ART (Artificial Read Transcriber), version 0.8.1.6 READ ERROR PROFILE The read error profile is based on the empirical error model derived from the calibrated Illumina paired-end read data by Richard Durbin group at Sanger. FRAGMENT SIZE DISTRIBUTION The fragment sizes of all paired‐end reads data above follow the Gaussian distribution. DATA FILE FORMAT FASTQ format A FASTQ file contains both sequence bases and quality scores of sequencing reads and is in the following format: @read_id sequence_read + base_quality_scores A base quality score is coded by the ASCII code of a single character, where the quality score is equal to ASCII code of the character minus 33. Example: @pop0_ind0_chr0-4028550-1 caacgccactcagcaatgatcggtttattcacgat + ????????????7???????=&?<ref_seq_id read_id aln_start_pos ref_seq_strand ref_seq_aligned read_seq_aligned aln_start_pos is the alignment start position of reference sequence. aln_start_pos is always relative to the strand of reference sequence. That is, aln_start_pos 10 in the plus (+) strand is different from aln_start_pos 10 in the minus (‐) stand. ref_seq_aligned is the aligned region of reference sequence, which can be from plus strand or minus strand of the reference sequence. read_seq_aligned is the aligned sequence read, which always in the same orientation of the same read in the corresponding fastq file. Example: >pop0_ind348_chr0 pop0_ind348_chr0-4028548-1 2644400 + gtaccatacttccttagtcgtgtactcttggtcta gtaccatacttccttagtcgtgtacgcttggtcta >pop0_ind348_chr0 pop0_ind348_chr0-4028550-1 4557676 - ttcatgggggctccttcacgtatgactgaaaagtg ttcatgggggctccttcacgtatgactgaaaagtg