Cis-BP

Introduction

The Catalog of Inferred Sequence Binding Preferences (CIS-BP) is a library of transcription factor (TF) DNA binding motifs and specificities. The data are organized in a user friendly manner for ease of searching, browsing, and downloading. CIS-BP also includes built-in web tools for scanning DNA sequences for putative TF binding sites, predicting the DNA binding motif of a given TF, and identifying a TF that might recognize a given DNA motif.

Searching or browsing for TFs

Searching and browsing capability is available for users interested in a specific TF, organism, data source, or TF family. To search for a specific TF by name or identifier, enter the search string into the box at the top of the home page labeled "Search for a TF by identifier", and press the GO! button. Wildcards (denoted as '*') are accepted, and the search is case insensitive. For example, a search for "hox*" will return all TFs whose name begins with "hox", in any organism. A spreadsheet file containing the search/browse results can be obtained by clicking on "Download excel spreadsheet (csv text format)" at the top of the page. Searches can be restricted by using the pull-down bars under the text search box. For example, all mouse bZIP family TFs whose names start with "cebp" can be found by entering "cebp*" in the search box, selecting "Mus_musculus" under the "Species" pulldown menu, and selecting "bZIP" under the "Domain Type" pulldown menu. To browse all mouse bZIP family TFs, simply remove the "cebp" search string from the search box. The "Motif evidence" pull-down menu offers several options to restrict to or browse TFs with specific motif evidence statuses. Motif evidence statuses indicate how the motif for a given TF was determined. "Direct" indicates that the motif was directly determined for the TF using an experimental assay. "Inferred" indicates that the motif was determined indirectly, by inferring the motif from a TF with a similar DNA binding domain (DBD). For example, the mouse Gata4 TF has a motif that has been directly determined using a Protein Binding Microarray (PBM) assay, so its motif status is "Direct". The zebra fish (Danio rerio) gata4 TF has not had its motif directly determined, but its DNA binding domain is 98.6% identical to the mouse Gata4 TF, so its motif can be “inferred” to be similar to the directly-determined mouse motif. We determined separate inference thresholds for each TF family, based on %DBD identities in our publication (Weirauch et al., Cell 2014).

TF pages

Each of the 160,000+ TFs contained in CIS-BP has its own page, which can be reached using the search and browse capabilities discussed above. At the top of each TF page is the name, organism, and TF family for the given TF. Each TF page is divided into several different sections, which are outlined below.


TF information
The "TF information" section provides basic information about the TF, and links to external databases. Clicking on the "Pfam ID" or "Interpro ID" links opens a new window for the corresponding domain database. Clicking on the "Gene ID" opens a link to the corresponding organism's genomic database (e.g. SGD for Saccharomyces cerevisiae, WormBase for Caenorhabditis elegans, etc). Clicking on the "Sequence source" opens a link to the corresponding database from which the given TF's amino acid sequence was obtained. A link to the AnimalTF database is also provided for metazoan TFs.


Directly determined binding motifs
This section contains information about the DNA binding motif(s) that have been directly experimentally determined for the given TF. Sequence logos are displayed that summarize the binding preferences for the given TF (in forward and reverse orientations). Clicking on a sequence logo provides a popup window with the corresponding position frequency matrix (PFM). Under "Type/Study/Study ID", information is provided about the technology used to generate the motif (i.e. PBM, HT-SELEX, ChIP-seq, etc). A link is also provided to Pubmed for the publication that the data were obtained from, along with the ID used in the study. Note: many motifs derived from Transfac require a license – hence, we do not provide these motifs, and instead indicate that a “Transfac license is required.”


Motifs from related TFs
This section provides motifs obtained for related TFs (i.e., TFs with DNA binding domains that are similar to the given TF). The format is similar to that of the "Directly determined binding motifs" section, with a few differences. For one, clicking on the name of the TF takes the user directly to the CIS-BP page for the corresponding TF. Second, the final column contains values indicating the degree of similarity of the corresponding TF to the current TF. A value of 1 means that the corresponding TF has identical amino acid sequences in its DNA binding domain (based on ClustalOmega alignments within each TF family - see Weirauch et al., Cell 2014 for more details). Different TF families have different identity thresholds for consideration as an inferred motif; the threshold for the corresponding family is indicated at the bottom of this section, and only TFs exceeding this threshold are displayed.


Experimental Constructs
This section provides information about the DNA binding domain(s) of the corresponding experimental construct used to assay the TF’s binding specificity (when known). At the top, a schematic indicates the location of each DNA binding domain within each construct. Below, a table indicates the location of each domain, along with its corresponding amino acid sequence. Clicking on a “Motif ID” provides the full amino acid sequence of the construct.


DNA Binding Domains
This section provides information about the DNA binding domain(s) of the corresponding TF protein isoform. At the top, a schematic indicates the location of each DNA binding domain within each isoform of the corresponding TF. Below, a table indicates the location of each domain, along with its corresponding amino acid sequence. Clicking on a “Protein ID” provides the full amino acid sequence of the protein.


Links
This section provides links to other TFs from the same organism, or from the same TF family.


Related TFs
This section shows all related TFs across all organisms, regardless of their motif. The “motif evidence” section indicates if the corresponding TF has a Direct or Inferred motif, or None.


Bulk downloads

The bulk downloads section can be reached via the left navigation toolbar. Pre-compiled .zip files are available containing bulk downloads of various subsets of the data (and the entire dataset). Users can obtain all data for a specific organism or TF family, including sequence logos (in .png format), E- and Z-scores (as tab-delimited text files), PBM probe intensities (as tab-delimited text files), Position Frequency Matrices (text files), and TF information (see above "The TF download cart" section below for more information). We also provide raw MySQL table dumps.


TF download cart

Throughout CIS-BP, you will find buttons for adding TFs to your cart. The CIS-BP cart acts in a similar manner to popular shopping websites such as Amazon, allowing the user to browse and search for TFs and add interesting TFs to the cart for later use. TFs can be added to (or removed from) the cart individually, or in groups (depending on the corresponding button). At any time, the user can view the contents of the cart by clicking on the "View cart" button in the left navigation window. The cart contains information on its current contents, as well as links to the individual TF pages. The cart can be emptied by clicking on the "Remove all TFs from the cart" link at the top. Data for the current TFs contained in the cart can be obtained by clicking on the "Download TFs in cart" link. Doing so opens a page allowing the user to download information such as sequence logos (in .png format), E and Z-scores (which provide comprehensive scores for all possible 8 base sequences and are available only for PBM data), Position Frequency Matrices (in simple text format), and information about the corresponding TFs (tab-delimited text format). Clicking on "Download Archive" initiates the downloading of a zipped archive containing the relevant files. Be aware that E- and Z-score files are large, and hence might take a while to download when many TFs are contained in the cart.

Tools

Scan a single sequence for TF binding
This tool allows the user to input a DNA sequence (or sequences) in multiple formats and scan for putative TF binding sites (on both strands) for any organism, using one of three different scoring systems.

Accepted input formats (max 8000 base limit):
  1. Plain text
    ATTGCTAGTAGCACTAGCA...
  2. Fasta
    >Header
    TAGCTAGCATCGATCAGCA...
  3. Multi-fasta
    >Header 1
    TAGCTACGATCAGCTAGCAT....
    >Header 2
    ATATCTATCTATCTATATTCA...


Scoring system options:
  1. 8 mers - E-scores
    This option is only available for TFs that have been characterized using PBM assays (or TFs with inferred motifs from a PBM assay). For these RBPs, each sequence is scanned for subsequences with E-score 8-mer scores exceeding the chosen threshold (minimum possible threshold is 0.45). See Berger et al Nature Biotech 2006 (PMID 16998473) for more information on E-scores.
  2. PWMs - Energy
    This option scores each position in each sequence with all PWMs, using an energy-based scoring method. A description of this scoring scheme is provided in Zhao and Stormo 2011 Nature Biotech (PMID 21654662).
  3. PWMs - Log Odds
    This option scores each position in each sequence with all PWMs, using a standard log odds scoring method. A description of this scoring scheme is provided in Stormo 1990 Methods Enzymol (PMID 2179676).
Scan two sequences for differential TF binding
This tool allows the user to scan two DNA sequences, in order to identify TFs that might bind to one sequence, but not the other. For example, it can be used to scan the alleles of a SNP associated with a disease, in order to identify TFs that might differentially bind these sequences. This is, in fact, the same method used in our manuscript to identify known TFs that differentially bind to disease-associated SNPs (Weirauch et al., Cell 2014).

Accepted input formats (max 8000 base limit):
  1. Plain text
    ATTGCTAGTAGCACTAGCA...
  2. Fasta
    >Header
    TAGCTAGCATCGATCAGCA...


Scoring system options:For now, only E-scores are available. This method identifies all TFs with maximum E-score > 0.45 for one allele, and maximum E-score < 0.45 for the other.

Protein Scan
This tool takes an amino acid sequence as input (without a “>” fasta-style header), identifies any putative DNA binding domains (DBDs) it contains (using the same methods as Weirauch et al., Cell 2014), and compares these DBDs to all DBDs in our database to predict the recognized DNA motif. The results page depicts all identified DBDs, along with their corresponding HMMER E-values. It then presents a ranked list of possible DNA motifs the protein might recognize, in descending order (based on DBD %Identity), along with links to the corresponding TFs’ webpages.

Motif Scan
This tool takes a DNA motif as input, and compares it to all motifs in the database, in order to identify the TF(s) that might recognize it. Motif comparisons are made using the TomTom algorithm. Three input types are allowed: (1) A PFM (in CisBP format); (2) a multi-sequence alignment, in fasta format (the positions are tallied at each position to create a PFM); and (3) IUPAC nucleic acid ambiguity codes (which are converted to a PFM based on the definition of each letter). By default, the input motif is aligned at all positions to all motifs in the database (in both orientations).