Cis-BP FAQ

What is CisBP?
CisBP (Catalog of Inferred Sequence Binding Preferences) is a freely available online database of transcription factor (TF) binding specificities. It currently incorporates data from >700 species covering >300 TF families, totaling >390,000 TFs (of which, >165,000 have at least one DNA binding motif). CisBP collects data from >70 sources, including other database such as Transfac, JASPAR, HOCOMOCO, FactorBook, UniProbe, Fly Factor Survey, and dozens of additional publications. In addition to housing these “directly determined” DNA binding motifs, CisBP also includes “inferred” motifs. Inferences are performed by mapping motifs across and within species, using DNA binding domain similarity thresholds established separately for each TF family (see publication for details). In other words, if a mouse TF has a known motif, we can infer its human ortholog’s motif, provided that the ortholog’s DNA binding domain is “similar enough”.

Who made CisBP?
CisBP is a collaborative effort between the labs of Tim Hughes (University of Toronto) and Matt Weirauch (Cincinnati Children’s Hospital). It originated while Matt was a postdoc in Tim’s lab. In addition to Matt and Tim, extensive contributions have been made by Ally Yang (experimental), Mihai Albu (database/web server), and Sam Lambert (computational).

How should I cite CisBP?
Please cite our paper:
Determination and inference of eukaryotic transcription factor sequence specificity.
Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, Zheng H, Goity A, van Bakel H, Lozano JC, Galli M, Lewsey MG, Huang E, Mukherjee T, Chen X, Reece-Hoyes JS, Govindarajan S, Shaulsky G, Walhout AJ, Bouget FY, Ratsch G, Larrondo LF, Ecker JR, Hughes TR.
Cell. 2014 Sep 11;158(6):1431-43. doi: 10.1016/j.cell.2014.08.009.
PMID: 25215497

To cite the Similarity Regression method:
Lambert SA, Yang A, Sasse A, Cowley G, Albu M, Caddick MX, Morris QD, Weirauch MT, and Hughes TR. Similarity Regression predicts evolution of transcription factor sequence specificity. Nature Genetics, 2019. (in press)

What is the difference between a “direct” and “inferred” motif?
“Direct” motifs were determined specifically for the TF of interest. For example, if a Protein Binding Microarray experiment was performed for the mouse Gata3 TF, then the associated motif would be “direct” for mouse Gata3. If the human GATA3 TF has a DNA binding domain that is “similar enough” to the mouse Gata3 TF, then we can infer that the human TF will have the same motif. Inference thresholds are quantified separately within each TF family – see Lambert et al., Nature Genetics 2018 for details of the “Similarity Regression” method that we use.

Whom should I comment with questions or suggestions?
Please use our “Contact us” page (a link is located on the left navigation panel). You can also contact Matt directly at matthew.weirauch@cchmc.org

Why are some of the logos and motifs empty or missing?
We have incorporated TF binding motifs from the Transfac database. The majority of their motifs require a license. We therefore cannot give these data away. We have included their “public” (freely available) motifs, and we indicate cases where a non-public motif is available, so that users with licenses can still use them.

I have TF data, and would like to add it to CisBP. What should I do?
We would love to add your data to a future build of CisBP! Please contact us using the “Contact us” page, or email Matt directly at matthew.weirauch@cchmc.org

How do you identify your TFs?
TFs are identified by scanning all available eukaryotic proteomes for putative DNA binding domains (DBDs). DBDs are identified by using the HMMER tool to scan for Pfam models. We use a set of ~90 Pfam models that describe established DNA binding domains (taken from Weirauch and Hughes, Subcell Biochem. 2011;52:25-73). See the CisBP manuscript for more details.

What is a Protein Binding Microarray (PBM)?
PBMs were originally developed by Martha Bulyk, and have since been adopted by many other groups. Briefly, PBMs contain ~40,000 double-stranded 60-base DNA probes, which are used to systematically measure the binding preferences of a GST-tagged TF construct of interest. PBMs are unique because they offer an unbiased survey of the binding of a given protein to all possible DNA sequences – the probe sequences of a given PBM array are designed such that each of the 32,896 possible 8-base sequences appear in diverse flanking sequence contexts on 32 different probes. The resulting data, which track well with both in vivo-derived motifs and motifs derived from other in vitro assays, therefore offer a complete, robust, unbiased survey of the binding preferences of a given TF.

How are your DNA binding motifs obtained from experimental data?
We apply a panel of algorithms to PBM data, in order to extract the single PWM that performs best, in terms of its ability to “predict” the replicate arrays intensities, utilizing a schema similar to that originally described in Weirauch et al. Nature Biotech 2013. For the purposes of the CisBP database, we compare the performance of four different algorithms: BEEML-PBM, FeatureREDUCE, PWM_align, and PWM_align_Z. Due to this procedure, the motifs included here might differ slightly from those of other databases – for example, UniProbe’s motifs are instead derived from PBM data using the Seed-and-Wobble algorithm. See our manuscript for more details. For extracting PWMs from ChIP-seq data, we use the ChIPMunk algorithm, with default settings.