RODEO :: Rapid ORF Description & Evaluation Online

Advanced uses

The sections here primarily describe how to interface RODEO output with some handy visualization or analysis tools. We found that the following programs were particularly useful:

Note: the most helpful practice to adopt when subjecting your dataset to a variety of analyses in several different programs is to embrace the use of primary keys. A primary key is simply a unique identifier used to unambiguously identify each record and helps as an index when importing/exporting data. If you just give each peptide and each protein in your output CSVs a unique number, this can serve as the primary key.

Pfam counts

Using Excel, counts of Pfam pHMM hits can readily be calculated for an output RODEO dataset.

1. Generate RODEO data including a protein CSV

First, make sure to generate RODEO data including an output CSV (via the -csva option) to yield a list of Pfam pHMM hits. In this example we use as input knownlassos.inp, a list of lasso cyclases from numerous characterized lasso peptide biosynthetic gene clusters.

python rodeo_main.py knownlassos.txt -out known

When the run has completed, open the .csv file in Excel. Select the first column containing the PFxxxxx IDs and copy these to a new worksheet. Do the same with the second column containing PFxxxxx IDs; copy this to the same worksheet, below the first set of PFxxxxx IDs.

2. Generate a list of unique Pfams

Select the list of Pfam IDs created in the last step. Then, in Excel, choose "Remove Duplicates" from the "Data" tab.

This will leave only one instance of each Pfam ID.

3. Obtain the count of each Pfam

This is accomplished using the COUNTIF() function in Excel. For instance, assuming your Pfam list occupies cells A2:A185 in your Unique Pfams sheet, and you've named your RODEO output sheet as Output (as above), enter the following in cell B2:

=COUNTIF(Output!H:H,A2)

Next, copy/paste this cell into cells B3:B185 to apply the formula to all of them.

Note: the above procedure will get you the counts of the top Pfam hits. To also include second-place Pfam hits, use the following formula instead:

=COUNTIF(Output!H:H,A2)+COUNTIF(Output!J:J,A2)

Co-occurrence analysis

Using the output generated from the Pfam counts section, you can determine when and how frequently certain proteins co-occur within a local region.

1. Generate list of clusters

First we'll want to determine a list of unique clusters. This is simply the collection of unique input GI/accession identifiers.

To do this, select the list of input identifiers in your protein list spreadsheet and copy them to a new worksheet.

As in the section above, in your new worksheet select the column containing this list and from the "Data" tab select "Remove Duplicates".

This will leave one GI/accession to represent each cluster.

2. Annotate each cluster with the presence of a specific Pfam

Put any Pfam IDs of interest as column heads in the first row of the Clusters sheet (here we have chosen the 12 most common Pfams as determined in our analysis of Pfam count above).

Then, assuming column A is the list of identifiers, type the following formula in cell B2:

=IF((COUNTIFS(Output!$A:$A,$A2,Output!$H:$H,B$1)+COUNTIFS(Output!$A:$A,$A2,Output!$J:$J,B$1))>0,1,0)

3. Determine Pfam co-occurrence

Copy/paste this formula to the remaining rows and columns. This will display 1 if the cluster contains the Pfam and 0 if it does not.

All Pfam counts are revealed on a per-cluster basis

Phylogenetic tree annotation

Using the output generated from the co-occurrence analysis section, you can annotate a phylogenetic tree made with iTOL in order to show co-occurring genes with icons.

RODEO allows annotation of phylogenetic trees

References

Letunic, I.; Bork, P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees Nucleic Acids Res (2016) Online ahead of print.
Letunic, I.; Bork, P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation Bioinformatics (2007) 23:127-128.

Taxonomic analysis

The NCBI Taxonomy database can be used to annotate RODEO output with taxonomic details of the organisms associated with proteins.

References

Agarwala, R. et al. Database resources of the National Center for Biotechnology Information Nucleic Acids Res (2016) 44:D7-19

Sequence similarity analysis annotation

RODEO output, including co-occurrence analysis, can be imported into Cytoscape to annotate a sequence-similarity network generated with EFI-EST or BLAST (Gerlt et al., 2015). The resulting networks can then be colored to show protein/protein co-occurrence, for example.

RODEO allows annotation of sequence similarity networks

References

Gerlt, J.A. et al. Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks. Biochem Biophs Acta (2015) 1854:1019-1037
Atkinson, H.J. et al. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One (2009) 4:e4345
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res (2003) 13:2498-2504

Circos diagrams

The program Circos can be used to give information-rich overviews of a large number of gene clusters. We've found it particularly informative to annotate a Circos diagram to show the discovery status, sequence similarity, local genomic region, and taxonomy of a large group of BGCs.

References

Krzywinski, M. et al. Circos: an Information Aesthetic for Comparative Genomics. Genome Res (2009) 19:1639-1645

Amino acid sequence composition

Trends in amino acid composition within a set of precursor peptides can give broad insight into the selectivity of the biosynthetic enzymes as well as reveal the chemical diversity or physicochemical properties of a class of natural products. Given a set of leader, core, or whole precursor peptides, this can be accomplished in Excel.

A template Excel file that can be filled with any input data is given in the sample data.

The following is a brief run-through of how to accomplish this.

Open a spreadsheet containing in one column (here we'll use a single column, A, with citrulassin-family lasso peptide precursors, to illustrate this) a list of peptides.

Spreadsheet containing a list of sequences.

In cell B1 (the header for the second column), write the one-letter code for an amino acid of interest (here, "A"). Then in cell B2, type the following formula:

=LEN($A2)-LEN(SUBSTITUTE($A2,B$1,""))

By way of explanation, this calculates the number of a particular character in a string by comparing its full length to the length if you delete all instances of that character.

In adjacent row headers, fill in any other amino acids of interest.

It's important to use the "$" above (and in other examples)--these specify absolute rather than relative cell references and enable you to copy & paste formulas quickly. In this case, we are keeping column A constant for each row (enabling us to copy/paste to evaluate many peptide sequences), and keeping row 1 constant for each column (enabling us to copy/paste to evaluate many different amino acids). For more on absolute and relative references, read further.

Copy & paste this formula into all pertinent columns and rows. This will give the amino acid count for each peptide.

Amino acid counts for all peptides can be given.

Alternatively, you can get the amino acid percentage by using the following formula instead:

=(LEN($A2)-LEN(SUBSTITUTE($A2,B$1,"")))/LEN($A2)

Finding strict sequence motifs

You use the SEARCH() function to find strict sequence motifs in the list. Wildcards (including ? and *) are useful here.

Within the same spreadsheet as above, we can use the following function to find instances of a YxxP motif (where x is any amino acid). Type Y??P into cell B1 and then the following formula into B2:

=IF(ISNUMBER(SEARCH(B$1,$A2,1)),SEARCH(B$1,$A2,1),"")

The above formula will return the numerical position of the first instance of the motif, or if the motif is not found, the cell will remain blank.

Alternatively, you can return the position relative to the end of the string with the following formula:

=LEN($A2)-IF(ISNUMBER(SEARCH(B$1,$A2,1)),SEARCH(B$1,$A2,1),"")+LEN(B$1)

The motif distance from the C-terminus can also be ascertained.

References & further reading

LEN, LENB functions (Office Support)
SUBSTITUTE function (Office Support)
SEARCH, SEARCHB functions (Office Support)
IS functions (Office Support)
IF function (Office Support)