Gegenees Fragmented Aligner version 3.0

This is a beta-version for testing functionality. Please look for updates at www.gegenees.org and report any problems to bo.segerman@sva.se or info@gegenees.org

Installation

Windows

Download the windows version of Gegenees-FA. If you have a 64-Bit environment, download the ”x_86_64” version and if you have a 32-bit environment download the ”x86” version. Note, even if you have a 64-bit windows version, the java version may be 32-bit. To check this: Open a command prompt (search for cmd in the windows menu) and type ”java -version”. If java is installed a version will be reported. The java version must be 1.8.x or higher. If the java version is 64-bit, there should be a statement about ”64-Bit” somewhere in the version information. If java stops and returns ”exit code 13” when you try to start Gegenees-FA, you are probably trying to run a 64-bit Gegenees version using a 32-bit Java version. Gegenees-FA is started by double-clicking the Gegenees-FA program located under the eclipse folder.

Macintosh

Download the Macintosh version of Gegenees-FA. Gegenees-FA requires java version 1.8.x or higher. Note, on macintosh the commandline version of java is not always the same as the graphical version. Gegenees uses the command line version. To check your java version, open a terminal window and type ”java -version”. If java is installed in the command line environment, version information will be displayed. The version must be 1.8.x or higher. If you need to install or update your command line version of java, download and install the latest Java Development Kit (JDK) version. Note, it must be the JDK version, not the JRE version for it to become accessible from the command line. Gegenees-FA is started by double-clicking the Gegenees-FA executable file in the MacOS folder. Eventually you need to open it by right clicking (ctrl-clicking) and explicitly allowing it to be run.

Linux

The first Linux version is nor released for beta-testing. Gegenees-FA requires java version 1.8.x or higher. Type ”java –version” in a terminal to check. Start the program by double clicking Gegenees in the eclipse folder. If it does not start, check that the launcher is marked as beeing allowed to execute.

Getting started

The workspace

The first time you start Gegenees-FA, you need to select a workspace directory (File->Select workspace…). If you have no Gegenees-FA workspace for precious, you may create/select an empty directory. The workspace directory will contain your databases with genomic sequences (a sub-directory starting with ”database_”) and comparisons (a subdirectory starting with ”comparison_”). You may have several different workspaces for your different projects. If you press the ”Projects” tab, you may see the name of the active workspace in the bottom of the window. In general, pathways containing spaces may cause problems for command-line programs such as BLAST, but Gegenees-FA should work with workspace paths containing spaces. If you experience problems, you may try to move/rename the workspace directory so that no spaces are within the pathway.

BLAST

Gegenees-FA depends upon the standalone executable version of NCBI Blast. Some versions of NCBI Blast has under some operating systams (mainly windows) had problems with multithreading. This version of Gegenees-FA has been tested with the latest BLAST version (2.6.0). NCBI Blast can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/. Download the version that matches your operating system (Windows: win64.tar.gz or alternatively the installer wrapped win64.exe version Macintosh: macosx.tar.gz Linux: linux.tar.gz) Extract the archive file and place it in a pathway that do not contain spaces. Note, spaces are not allowed anywhere in the blast pathway (directories and all parent directories). Windows: Eventually you need a tool to unpack the archive such as 7-zip. If you do not have permissions/possibilities to unpack/install the blast on your computer, you may extract the files on another computer and move them to the computer of interest. Macintosh: It should be possible to double-click the tar.gz file. Alternatively use the linux style extraction method in the terminal window. Linux: extract with a command such as ”tar –zxvf ncbi-blast-2.6.0……tar.gz”. The blast path (which is the path to the ”bin” directory of the extracted directory structure) must then be specified in the ”File->Configure Blast path…” dialog of Gegenees-FA.

Databases

In a workspace there is a directory called ”database”. This is the ”default database” that stores genome sequences that can be used to set up a comparison. Each genome is stored in a directory which is named as the genome name and in it one or several genbank-formated files. Eventually Gegenees-FA will make an ”info.geg” file with some statistics on the genome. It is also possible to have more databases which then is a directory beginning with ”database_”. The databases are handled by the ”Database Manager” which can be launched from the Data->Database manager menu command. When starting a new comparison, the available genomes comes from the currently active database. If you need to change database, do it through the database manager. To ensure correct formatting, genomes (in.eg. fasta format) may be imported into a database using the database manager.

The Projects tab

The Projects tab lists all comparisons in the current workspace dir. By selecting a comparison the contents of the other tabs will be changed and reflect the selected comparison. In an empty workspace there will be no comparisons listed. A new comparison can be initiated by (File->New Comparison…). In the bottom line, the path of the current workspace is listed.

The Run Alignments Tab

When a new comparison is made (New comparison wizard is completed), you will end up on this tab. You start the comparison by pressing ”start”. The logwindow and the statusbar will show progression of the alignment. After some initial work, the logwindow will be dominated by messages such as ”Thread=12/625 (P=13) blast producing: G8_G17.result” and ”Thread=6/625 (P=17) done: OK!”. In this case there are 625 blastcommands that needs to be run (each containing all fragments from a certain genome compared to an unfragmented sequence of another genome). The P value represents how many parallel threads are executing BLAST command. It is based on the number of processor cores reported by the system. It can be limited in the preferences page. Eventually some threads may not produce a valid blast-result file in reasonable time and then the tread is killed and that particular blast job is started again after all other threads have finished. A few restarts may be tolerable, but massive amounts of fails/restarts indicate something is wrong. After some post-processing, the logwindow will reload it contents and start with a row ”COMPLETED!”. The data can then be explored in the other tabs. The ”The Run Alignments Tab” has little function, except looking at the logfile, after the comparison has completed. The logfile can also be looked at in a text editor. It is called ”logfilecomparison.txt” and is located in the ”alignment_analysis” directory of the comparison directory.

The Included Genomes Tab

This Tab lists the genomes in the comparison. Information about the genomes can be displayed and the comparison can be modified by adding or removing genomes. The progress reporting of the modify function is not functioning properly in this beta-testing version.

The Group settings Tab

This Tab allows the included genomes to be assignes to the ”background” or the ”target” group. The Signature tabs uses this information. A signature represents what is conserved in the target but absent from the background group. A reference genome can also be selected, which will define the coordinate system when the signature is graphically plotted.

The heatmap Tab

In this tab, a heatmap based on the average fragment similarity between the genomes is displayed. The color profile can be changed and the rows/columns can be sorted by similarity. A new sort-order is given a name and stored so that it can be reloaded. Also, a threshold can be set so that the fragments with poor alignment is filtered out. This allows the ”core genomes” to be compared. However, if too much of the genome is filtered out, the data becomes unreliable. How much of the genome that is included at a certain threshold level can be investigated by changing from ”show score” to ”show core”.

The Signature graph Tab

In this tab, the selected reference genome is plotted and on top of it ”biomarker score” values are plotted above and below. (max/min) biomarker score means the max value in the background group (worst false positive) and the min value in the target group (the worst false negative) is used to calculate the score. This is most the most stringent way too look at signatures and require full conservation in the target group and no trace of the sequence in the background group for high scoring. Alternative scores based on average values can be used. If you select a single genome as the target group, you will highlight what is unique for this genome.

The Signature Table Tab

In this tab, fragments with different ranges of biomarker score can be sorted out and analyzed.