Pre- and post-processing tips

On Unix systems, awk, sed and grep can be used for file modifications and are powerful enough for most tasks. A common way of processing a file looks like this:

  cat infile.txt | filter > outfile.txt

Common filters:

  • Print only certain columns, delimited by some character, in specific order:
      awk '{ print $2 ", " $1 ", " $4 }'
  • As above but only if a certain condition is met in some column:
      awk '$3>1.4 { print $1 "\t" $2 }'
  • Remove a certain string or regex-pattern (e.g. labels from proteins prot1@species -> prot1):
      sed 's/@\S+//g'

Frequently asked questions

  • Where do I get PPINs in simple accession number - accession number format?

    There are many good sources for data, some of which we have referenced in our publications.
    For biogrid, download the latest organism release. Extract columns 2 and 3 (Entrez Gene), or 8 and 9 (Official Symbol). If appropriate, filter by other columns (e.g. allow only certain types of experimental evidence).
      awk -F"\t" '{print $8 "\t" $9}'

  • How do I translate between accession numbers, gene name, ...?

    UniProt offers an online mapping tool that can translate between different protein identification schemas. The full DB can be downloaded and used locally for repeated, large-scale lookup.

  • How to create BLAST all-vs-all bit scores for my PPINs?

    There are many online tools for one-off lookups but for large-scale all-against-all comparisons it is probably best to run local. Create a DB first:
      makeblastdb -in my_prots.fasta \
        -dbtype prot -out my_db

    Then do the lookup:
      blastp -db my_db -query my_prots.fasta \
        -outfmt 6 -out scores.tab

    The NCBI BLAST Command Line Applications User Manual may serve as a reference.

Infos, Warnings, Errors and Debug messages

Infos
Stats and useful info presented with the results.
Warnings
Collection of non-fatal issues that will affect results.
Errors
Collection of fatal issue that prevent generating results.
Debug
Detailed/Verbose information that might help troubleshooting errors and warnings.

Instructions

File formats and conventions
Creating MNAs from PNAs
Expanding existing SMAL MNAs with additional PNAs
Visualizing MNAs
Creating PNAs

File formats and conventions

  1. PPINs should be presented in text files where each line represents an interaction between two proteins, delimited by 'tab/space'.
      proteinA proteinB
      proteinC proteinD
      ...
    Each element (protein identifier) can consist of alphanumeric characters and a few select special characters specified via the following regex:
      '\w\)\(\]\[\+\-\.'
    The following characters have special meaning (and hence are not allowed in protein names):
      '_' is used as the delimiter between two proteins forming an interaction in the edge alignment.
      '#/;' can be used to comment out lines in PPINs or alignment files.
      '@' can be used to label proteins for easier identification with their source PPIN (e.g. A0AUZ9@human).
    The order in which interactions are presented can be arbitrary (no directionality).
  2. Each line in the PNA files represents corresponding proteins from the two aligned PPINs delimited by 'tab/space'. PNAs may consist of one-to-one or many-to-many mappings.
      proteinA@species1 proteinB@species1 proteinX@species2
      proteinC@species1 proteinY@species2
      ...
    Most PNA algorithms are symmetric with regards to the input PPINs (i.e. the order in which PPINs are presented doesn't influence the alignment). When that is not the case, ensure that the PNA represents a mapping of the non-scaffold PPIN onto the scaffold (PNA: scaffold <- PPIN).
    Some pairwise aligners will append a "similarity score" to each set of aligned proteins. This score needs to be removed before uploading the file. If your PNA contains similarity scores, try something like:
       cat pna_sim.txt |grep -Po '^.*(?=\s+[\d\.eE+-]*\d+\s*$)' , or
       cat pna_sim.txt |awk '{print $1 "\t" $2}' >pna_clean.txt
  3. Each line of the node-node similarity files used for PNA creation contains a pair of nodes followed by some kind of similarity score (e.g. the BLAST bit score) delimited by space or tab.

Creating MNAs from PNAs

  1. Select one protein-protein interaction network (PPIN) to be used as the scaffold for the MNA.
  2. Select at least two pairwise network alignments (PNA) between the scaffold and some other PPIN. Both, the PNAs and the non-scaffold PPINs have to be uploaded. PNAs need to be created up-front using a pairwise alignment algorithm of choice.
  3. A label can be provided for each PPIN. It will be used to append proteins with some text (e.g. species).
    Labels may contain up to 8 alphanumeric characters.
    If no label is provided, the first 8 alphanumeric characters from the filename (without extension) will be used.
  4. Once all required files are selected, click the "Create MNA" button.
    A page containing links to MNA files in several formats as well as stats and additional functionality (modifying and visualizing the MNA) will be created. Please note that Result directories and files will be deleted after a day or when storage limitations are reached so please download important results promptly.

Expanding existing SMAL MNAs with additional PNAs

This can be done in many ways depending on the exact data that is provided. The server is designed to target most common use cases and be flexible (e.g. allow different combinations of data and do what is possible). There is always more that could be done. If you have specific use cases not covered here, we encourage you to download the source and set up a local instance (which can be easily modified) or to reach out to us for help.

Incremental modifications of SMAL MNAs in their online result directories on this server is straightforward as all required input files are accessible to the software already. Instructions here are for the dedicated "Modify MNA" page.

  1. Select the scaffold PPIN central to the existing MNA.
  2. Select an existing SMAL MNA node alignment.
  3. Select an existing SMAL MNA edge alignment.
  4. Select at least one pairwise network alignment (PNA) between the scaffold and some other PPIN. Both, the new PNAs and corresponding non-scaffold PPINs have to be uploaded. The alignments need to be created up-front using a pairwise alignment algorithm of choice.
  5. Note that MNAs can expanded using less data than outlined above:
      If the scaffold PPIN is missing, it will be reconstructed from the MNA (either completely from MNA edge alignment or at least the nodes from an MNA node alignment).
      If the MNA edge alignment is missing, the MNA node alignments will be extended.
      If no PPINs are uploaded together with the PNAs, MNA node alignments will be expended but not the MNA edge alignments.
  6. Once all required files are selected, click the "Expand MNA" button.
    A page containing links to MNA files in several formats as well as stats and additional functionality (modifying and visualizing the MNA) will be created. Please note that Result directories and files will be deleted after a day or when storage limitations are reached so please download important results promptly.

Visualizing MNAs

Note that visualization of networks is resource intensive. Be patient when interacting with large graphs.

When creating or modifying an MNA, a link to the cytoscape.js canvas used for visualization is provided. Alternatively, the 'Visualize MNA' tab allows entering the location of a network file on this server. The location is expected to be the relative path from the SMAL pages. Example:
  results/YYYMMDD-HHMMSS-random/nw.js

To make use of the full browser window or to visualize external graphs, such as previously generated MNAs that are no longer available on the server, the SMAL implementation of a cytoscsape.js canvas can be downloaded and used locally.

SMAL MNAs are visualized on the basis of the scaffold PPIN. Each scaffold protein is displayed as a node representing the alignment cluster, that is the set of specific nodes (proteins or other entities) that are aligned. Each scaffold interaction then is an edge between two nodes. Edge and node diameter are indicative of the level of conservation. The more correspondences, the thicker the edge or the larger the node. Additionally, each species that is part of the MNA is associated with a specific color. When a species is represented in an alignment cluster, the corresponding node contains a 'slice' in the appropriate color. Conserved interactions are drawn between nodes in the respective species color as well.
The following functions are available:

  • Nodes can be grabbed and moved using the mouse. Zooming is controlled via the mouse wheel or the buttons in the control panel. Note that graph labels will be disabled when zooming out too far. Click on an empty spot on the canvas and drag to move the whole graph.
  • Nodes can be searched using the scaffold protein name (without species label). If the protein is found, it will be highlighted and the view zoomed and centered on the requested node. Multiple node IDs can be entered and searched all at once (separated by space or comma).
  • A selection of layouts can be applied to the collection of nodes. 'grid' can be created very quickly but does not take interactions into account (nodes are arranged on the grid in alphabetic order). Other layouts might be better suited to visualize small clusters of interactions but take longer to compute.
  • Filtering of proteins and/or interactions based on the level of conservation. This allows to reduce the current collection to only show proteins or interactions that have correspondences in at least the specified number of PPINs.
    "scaffold+1 PPINs" will show all scaffold nodes and edges that have at least one correspondences in any other network. The filter is applied to the current collection of nodes. Once a node or edge has been filtered, it cannot be brought back using filters. Either expand the collection by adding neighborhoods or restore the full graph.
  • "Remove unconnected Nodes" hides any vertexes that do not have visible edges in the current collection. Note that this function evaluates the currently visible interactions as opposed to the degree in the full network.
  • "Restore previous Collection" takes back the last change to the visible collection of nodes.
  • Clicking an edge or a node will show further information such as the alignment cluster constitution, a list of functions that allow modifying the displayed collection, and links to external information.

Creating PNAs

Upload two PPIN and a file containing node-node similarities. Since PNA can be time consuming, alignments are computed asynchronously. The results page will contain a timer (updated every 30 seconds) and links to the alignment files will be created as soon as computation is complete. The integrated SMETANA requires node-node similarities. An error will be thrown if no similarities are provided. Other algorithms work without node-node similarities and compute alignments purely based on topology if no external similarity scores are provided.