description |
Here we have generated deeper sequencing coverage for the HT-SELEX experiment generated massively parallel sequencing libraries that we described in our 2013 published Cell-article "DNA-binding specificities of human transcription factors” (Jolma et al. 2013; PMID:23332764). In The DNA-fragments produced by the experiments were re-pooled using lesser extent of multiplexing (~55 samples per lane vs. ~800 samples used in the original study) and these libraries were then subjected to sequencing using Illumina Hiseq2000-system. We note that the PWM-models and all other analysis published in the original article have been generated using the earlier data that is also available through ENA, under accession: PRJEB3289, and while the data under this study is based on same experiments and thus leads to very similar results, it has not been scrutinized as extensively as our earlier data and can potentially contain new artifacts derived from e.g. the different multiplexing scheme.Samples are composed of single read sequencing of synthetic DNA fragments with a fixed length randomized region or samples derived from such a initial library by selection with a sequence specific DNA binding protein. Originally multiple samples with different "barcode" tag sequences were run on the same Illumina sequencing lane but the released files have been already de-multiplexed, and the constant regions and "barcodes" of each sequence have been cut out of the sequencing reads to facilitate the use of data. Barcodes and oligonucleotide designs are indicated in the names of individual entries. Depending of the selection ligand design, the sequences in each of these fastq-files are 14, 20, 30 or 40 bases long and had different flanking regions in both sides of the sequence. The names of the sequencing result files are same as in previous data for the same experiments and selection cycles except that letters ES (Extended Sequencing) have been added to "experimental batch" identifying field to distinguish it from the original data. The run entries are named in either of the following ways:Example 1) "BCL6B_DBD_ESAC_TGCGGG20NGA_1", where name is composed of following fields ProteinName_CloneType_Batch_BarcodeDesign_SelectionCycle.This experiment used barcode ligand TGCGGG20NGA, where both of the variable flanking constant regions are indicated as they were on the original sequence-reads. This ligand has been selected for one round of HT-SELEX using recombinant protein that contained the DNA binding domain of human transcription factor BCL6B. It also tells that the file is based on Extended Sequencing “ES” of an experiment that was performed on a original batch of experiments named as "AC”. Example 2) ES0_TGCGGG20NGA_0, where name is composed of ES(zero)_BarcodeDesign_(zero) These sequences have been generated from extended sequencing of the initial non-selected pool. Same initial pools have been used in multiple experiments that were on different batches, thus for example this background sequence pool is the shared background for all of the following samples. BCL6B_DBD_ESAC_TGCGGG20NGA_1, ZNF784_full_ESAE_TGCGGG20NGA_3, DLX6_DBD_ESY_TGCGGG20NGA_4 and MSX2_DBD_ESW_TGCGGG20NGA_2. This new deeper sequencing data has been used in the publication "Transcription factor family‐specific DNA shape readout revealed by quantitative specificity models” by Yang et al. 2016http://msb.embopress.org/content/13/2/910 Pre-processed data as described by Yang et al. is available at http://rohslab.cmb.usc.edu/MSB2017/ |