THiCweed: fast, sensitive detection of sequence features by clustering big datasets
| Title | THiCweed: fast, sensitive detection of sequence features by clustering big datasets |
| Publication Type | Journal Article |
| Year of Publication | 2018 |
| Authors | Agrawal, A, Sambare, SV, Narlikar, L, Siddharthan, R |
| Journal | Nucleic Acids Research |
| Volume | 46 |
| Issue | 5 |
| Pagination | e29 |
| Date Published | MAR |
| ISSN | 0305-1048 |
| Abstract | We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1-2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large `window' sizes (>50 bp), much longer than typical binding sites (7-15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity. |
| DOI | 10.1093/narlgkx1251 |
| Type of Journal (Indian or Foreign) | Foreign |
| Impact Factor (IF) | 10.162 |
