Sneato






Minimum Spanning Trees with Genetic Data



Sneato generates minimum spanning trees that depict relationships among biological sequences such as DNA, RNA or amino acid sequences. To use Sneato, you provide it with a datafile of sequences. The program will then infer the minimum spanning tree and draw it for you.

Sneato uses Prim's algorithm - a standard algorithm for generating minimum spanning trees. To work properly, Prim's algorithm requires information about the distances between the things being compared. Sneato estimates the distance between pairs of sequences as the number of differences (amino acid, nucleotide, or similar) between them. A nice tutorial on Prim's and other algorithms for generating minimum spanning trees can be found on Papagelis Athanasios's Algorithms Tutoring Web Page, which can be found here: http://students.ceid.upatras.gr/~papagel.

Sneato is an extention of VGJ, which was developed by some very clever students at Auburn University. They very kindly distribute the source code under the GNU General Public License on their web site.


1. Installing and starting the Sneato program

Prerequisites

Sneato requires the Java Runtime Environment. This is installed by default on most modern computers. If it is not installed, you can get it by going to http://www.java.com, choosing "Free Download" and following the instructions.

Downloading Sneato

Download the latest version of Sneato (v. 2) by clicking here: Sneato_v_2.zip.

Installing Sneato

1. Unpack the Sneato zip file by double clicking on it.
2. Double click on the Sneato icon to start the program.
Here is a screenshot of Sneato immediately after starting:

To start an analysis, click on the "Start an Analysis" button and adjust the size of the window. Next, choose "Open DB File" from the File menu, and choose the file you want to analyze. This file must be in format described below.

2. Formatting a *.db file

Sneato works with plain text files, which must be in the *.db file format.
DB File Format:

There are two kinds of lines in the DB format: comment lines and data lines. Comment lines start with the # character and are ignored. Data lines have two parts: name and sequence, and these parts are separated by a colon (:). If multiple identical lines (i.e., sequences) are found in your file, they will be drawn as nodes whose area is proportional to the sequence frequency, so you will get a graph with big circles representing common sequences and little circles representing rare sequences.
Example:
# This is a comment about my otter data.
# The data are faked.
otter1:GATCGATCGATC
otter2:GATCGATCGATC
otter3:GATCGATCGATC
# This last otter is especially interesting...
otter4:GGGGGGGGGGGG
Three test datasets are available:
test.db - A simple fabricated dataset.
ptc.db - Real DNA sequence data from the PTC gene, which encodes the receptor that enables some people to taste phenylthiocarbamide (Wooding 2004).
jcv_aa_types.db - Real amino acid sequence data from the JC virus, an interesting circular DNA virus with a worldwide distribution (Wooding 2001).

3. Generating and drawing the tree

To infer and draw a minimum spanning tree that describes your data choose "Spring" from the "Algorithms" menu and watch the fun!.

Sneato often draws tree with nodes that are too big and too close together. To solve this problem, click on the "Scale / 2" button two or three times and choose "Spring" from the "Algorithms" menu again.

Here is a screenshot of Sneato after it has run an analysis of PTC DNA haplotypes (Wooding 2004):


The graph that appears on the screen is a minimum spanning tree that shows the relationships among your sequences. In the graph, each circle (node) represents a different sequence. The size of the circle represents the sequence's relative frequency, and the number of occurrences of each sequences is listed as "k = ..." below the corresponding circle. Each edge (connecting line) is also labeled with the number of positional differences between the two sequences.

If you don't like the appearance of your graph, you can click and drag the nodes (circles) into suitable positions. You can also add and delete nodes and edges and change their labels and properties (size, shape, color) by choosing the appropriate mode from the control panel and double clicking. A detailed discussion of this interface is provided in the VGJ Manual.

4. Saving the results

Sneato saves results in Postscript format. Postscript is a convenient format because it looks good, is easy to print, and can be imported into a number of graphics editing programs. To save your file, choose "PostScript Output" from the File menu, then choose "Save". I prefer to modify the diagram to my liking with the Adobe Illustrator program.