TITLE: SNP detection by Applied Biosystems SOLiD sequencing of HapMap individual NA19240 AUTHORS: AUTHORS: McLauglin S, Peckham H, Hyland,F.C., Scafe C., Costa G., Muzny D., Gibbs R., De La Vega F.M., McKernan K. YEAR: 2008 STATUS: 1 || TYPE: METHOD HANDLE: 1000GENOMES ID: SOLiD.Corona.v2 METHOD_CLASS: Sequence SEQ_BOTH_STRANDS: YES TEMPLATE_TYPE: DIPLOID MULT_PCR_AMPLIFICATION: NA MULT_CLONES_TESTED: NO METHOD: Summary: The YRI 1000 Genomes sample NA19240 was sequenced with the SOLiD System at Baylor College of Medicine HGSC and at Applied Biosystems. The sample was sequenced to 29x. Mapping and SNP Calling: Mapping, pairing, and SOLiD SNP calling was performed with v2 of the SOLiD Corona SNP caller (source code can be downloaded at www.solidsoftwaretools.com). SOLiD Corona SNP caller is a heuristic rules-based method for calling SNPs, based on removing sequencing error while retaining evidence of a genetic variant using SOLiD colorspace rules. The method incorporates information about coverage, number of observations of the reference allele and the alternate allele, quality values of the reads, resulting in a score reflecting the characteristics of the reads providing evidence of a SNP. Filtering: The SNPs list was filtered to eliminate candidiate false positives SNPs as follows (about 4% of total SNPs were removed after filtering). Novel SNPs within 10bp of any dbSNP entry 93,674 Within 10bp of a SOLiD Indel 21,600 Invalid Double SOLiD SNPs 21,465 SNPs with >75x coverage 17,055 Heterozygotes in SOLiD CNVs with increased copy number 15,112 * Notes: we identified all SNPs within 10bp of any dbSNP entry (SNPs and InDels) but we removed only those that were novel. For the other filters, we removed both dbSNP and novel SNPs. SOLiD Indels means those small InDels discovered in this sample using these SOLiD sequencing runs, and similarly with SOLiD CNVs. Further filtering removed all SNPs which were not also detected in this family using the Sanger Centre trio calling method applied to Illumina high-throughput sequencing data averaging 12.6, 15.1 and 17 x coverage for individuals NA19238, NA19239 and NA19240 (mother, father and daughter) respectively. Validation: The validation was done by Baylor College of Medicine HGSC for only novel SNPs, ie those not in dbSNP129. Validation was done by sequencing, using primers already in Baylor freezers (so biased towards SNPs in Encode regions but not in dbSNP, and towards SNPs in exons but not in dbSNP, since these were over-represented in the freezers). AB provided a SNP list, Baylor randomly selected from this list SNPs covered by primers in their freezer. Validation was performed by Sanger sequencing. A major advantage of validation by Sanger sequencing is that the primers were specificially designed to amplify regions that may contain repeats, homologous genes, etc; that is, regions hard to access by genotyping and by arrays, and so more difficult regions than the 'HapMappable' genome. These should be regions representative of sequencing results on the whole genome (excluding highly repetitive regions such as telomeres, centromeres, Y chromosome, etc) and not just the 'easier to sequence and map' regions represented by arrays and HapMap. In the two summary data files, the interpretation of the columns is as follows: 1. Number of reads covering the position with 2 color space calls (i.e., reads only partially covering the base with the last color space call of the read aren't counted) 2. Reference base at this position 3. Consensus base at this position (IUPAC codes are used for heterozygotes) 4. Score* of the reads of the Reference allele 5. Confidence* of the reads of the Reference allele 6. Number of single tag (from fragment runs) reads containing the reference allele / number of unique start positions of reads containing the reference allele 7. Number of mate pair tags (from paired-end runs) reads containing the reference allele / number of unique start positions of reads containing the reference allele 8. Score* of the reads of the non-reference allele 9. Confidence* of the reads of the non-reference allele 10. Number of single tag (from fragment runs) reads containing the non-reference allele / number of unique start positions of reads containing the non-reference allele 11. Number of mate pair tags (from paired-end runs) reads containing the non-reference allele / number of unique start positions of reads containing the non-reference allele 12. Base space coordinate of the position (1-based) 13. Chromosome Number (where 23=X and 24=Y) 14. Chromosome name 15. rs_id if in dbSNP. Score* = the weighted score assigned to the reference or consensus base, based on all the reads that cover it with 2 color space calls (the weighted scores for each of the 16 types of dibase combinations sum to 1) Confidence* = the average confidence of each read that was used to generate the weighted score of the reference or consensus base (aka Valid Score) || TYPE: POPULATION HANDLE: 1000GENOMES ID: 1000GENOMES_pilot_2_YRI_child POP_CLASS: west africa POPULATION: One individual, NA19240, from the HapMap YRI sample || TYPE: SNPASSAY HANDLE: 1000GENOMES BATCH: 1000GENOMES-NA19240-2008-12-12 MOLTYPE: Genomic METHOD: SOLiD.Corona.v2 SAMPLESIZE: 2 chromosomes ORGANISM: Homo sapiens COMMENT: initial data release ||