RODEO

RODEO (Rapid ORF Description & Evaluation Online) is an algorithm to help biosynthetic gene cluster (BGC) analysis, with an emphasis on ribosomal natural product (RiPP) discovery.

RODEO diagram

The sections here primarily describe how to interface RODEO output with some handy visualization or analysis tools. We found that the following programs were particularly useful:

Pfam counts

Using Excel, counts of Pfam pHMM hits can readily be calculated for an output RODEO dataset.

1. Generate RODEO data including a protein CSV

First, make sure to generate RODEO data including an output CSV (via the -csva option) to yield a list of Pfam pHMM hits. In this example we use as input knownlassos.inp, a list of lasso cyclases from numerous characterized lasso peptide biosynthetic gene clusters.

python rodeo_main.py knownlassos.txt -out known

When the run has completed, open the .csv file in Excel. Select the first column containing the PFxxxxx IDs and copy these to a new worksheet. Do the same with the second column containing PFxxxxx IDs; copy this to the same worksheet, below the first set of PFxxxxx IDs.

Copy one or both columns of Pfam IDs

2. Generate a list of unique Pfams

Select the list of Pfam IDs created in the last step. Then, in Excel, choose "Remove Duplicates" from the "Data" tab.

Remove duplicate Pfam IDs

This will leave only one instance of each Pfam ID.

Duplicate Pfam IDs were removed

3. Obtain the count of each Pfam

This is accomplished using the COUNTIF() function in Excel. For instance, assuming your Pfam list occupies cells A2:A185 in your Unique Pfams sheet, and you've named your RODEO output sheet as Output (as above), enter the following in cell B2:

=COUNTIF(Output!H:H,A2)
Obtaining the count for each Pfam

Next, copy/paste this cell into cells B3:B185 to apply the formula to all of them.

Obtain the count for each Pfam

Note: the above procedure will get you the counts of the top Pfam hits. To also include second-place Pfam hits, use the following formula instead:

=COUNTIF(Output!H:H,A2)+COUNTIF(Output!J:J,A2)

Back to top

Co-occurrence analysis

Using the output generated from the Pfam counts section, you can determine when and how frequently certain proteins co-occur within a local region.

1. Generate list of clusters

First we'll want to determine a list of unique clusters. This is simply the collection of unique input GI/accession identifiers.

To do this, select the list of input identifiers in your protein list spreadsheet and copy them to a new worksheet.

Select input identifiers

As in the section above, in your new worksheet select the column containing this list and from the "Data" tab select "Remove Duplicates".

Remove duplicate records

This will leave one GI/accession to represent each cluster.

Only unique values remain

2. Annotate each cluster with the presence of a specific Pfam

Put any Pfam IDs of interest as column heads in the first row of the Clusters sheet (here we have chosen the 12 most common Pfams as determined in our analysis of Pfam count above).

Then, assuming column A is the list of identifiers, type the following formula in cell B2:

=IF((COUNTIFS(Output!$A:$A,$A2,Output!$H:$H,B$1)+COUNTIFS(Output!$A:$A,$A2,Output!$J:$J,B$1))>0,1,0)
Input desired Pfam IDs as column heads

3. Determine Pfam co-occurrence

Copy/paste this formula to the remaining rows and columns. This will display 1 if the cluster contains the Pfam and 0 if it does not.

All Pfam counts are revealed on a per-cluster basis

Back to top

Phylogenetic tree annotation

Using the output generated from the co-occurrence analysis section, you can annotate a phylogenetic tree made with iTOL in order to show co-occurring genes with icons.

RODEO allows annotation of phylogenetic trees

References

Back to top

Taxonomic analysis

The NCBI Taxonomy database can be used to annotate RODEO output with taxonomic details of the organisms associated with proteins.

Taxonomy can be annotated

References

Back to top

Sequence similarity analysis annotation

RODEO output, including co-occurrence analysis, can be imported into Cytoscape to annotate a sequence-similarity network generated with EFI-EST or BLAST (Gerlt et al., 2015). The resulting networks can then be colored to show protein/protein co-occurrence, for example.

RODEO allows annotation of sequence similarity networks

References

Back to top

Circos diagrams

The program Circos can be used to give information-rich overviews of a large number of gene clusters. We've found it particularly informative to annotate a Circos diagram to show the discovery status, sequence similarity, local genomic region, and taxonomy of a large group of BGCs.

References

Back to top

Amino acid sequence composition

Trends in amino acid composition within a set of precursor peptides can give broad insight into the selectivity of the biosynthetic enzymes as well as reveal the chemical diversity or physicochemical properties of a class of natural products. Given a set of leader, core, or whole precursor peptides, this can be accomplished in Excel.

A template Excel file that can be filled with any input data is given in the sample data.

The following is a brief run-through of how to accomplish this.

Open a spreadsheet containing in one column (here we'll use a single column, A, with citrulassin-family lasso peptide precursors, to illustrate this) a list of peptides.

Spreadsheet containing a list of sequences.

In cell B1 (the header for the second column), write the one-letter code for an amino acid of interest (here, "A"). Then in cell B2, type the following formula:

=LEN($A2)-LEN(SUBSTITUTE($A2,B$1,""))

By way of explanation, this calculates the number of a particular character in a string by comparing its full length to the length if you delete all instances of that character.

In adjacent row headers, fill in any other amino acids of interest.

Counts of individual amino acids.

It's important to use the "$" above (and in other examples)--these specify absolute rather than relative cell references and enable you to copy & paste formulas quickly. In this case, we are keeping column A constant for each row (enabling us to copy/paste to evaluate many peptide sequences), and keeping row 1 constant for each column (enabling us to copy/paste to evaluate many different amino acids). For more on absolute and relative references, read further.

Copy & paste this formula into all pertinent columns and rows. This will give the amino acid count for each peptide.

Amino acid counts for all peptides can be given.

Alternatively, you can get the amino acid percentage by using the following formula instead:

=(LEN($A2)-LEN(SUBSTITUTE($A2,B$1,"")))/LEN($A2)
Residue percents can be calculated.

Back to top

Finding strict sequence motifs

You use the SEARCH() function to find strict sequence motifs in the list. Wildcards (including ? and *) are useful here.

Within the same spreadsheet as above, we can use the following function to find instances of a YxxP motif (where x is any amino acid). Type Y??P into cell B1 and then the following formula into B2:

=IF(ISNUMBER(SEARCH(B$1,$A2,1)),SEARCH(B$1,$A2,1),"")

The above formula will return the numerical position of the first instance of the motif, or if the motif is not found, the cell will remain blank.

Motif position can be calculated.

Alternatively, you can return the position relative to the end of the string with the following formula:

=LEN($A2)-IF(ISNUMBER(SEARCH(B$1,$A2,1)),SEARCH(B$1,$A2,1),"")+LEN(B$1)
The motif distance from the C-terminus can also be ascertained.

References & further reading

Back to top