FALCON is an alignment-free unsupervised machine learning system to detect pathogens in genomic and metagenomic samples. The core of the method is based on the relative algorithmic entropy, a notion that uses model-freezing and exclusive information from a reference, allowing to use much lower computational resources. Moreover, it uses variable multi-threading, without multiplying the memory for each thread, being able to run efficiently from a powerful server to a common laptop. To measure the similarity, the system will build multiple finite-context models that at the end of the reference sequence will be kept frozen. The target reads will then be measured using a mixture of the frozen models. The mixture estimates the probabilities assuming dependency from model performance, and thus, it will allow to adapt the usage of the models according to the nature of the target sequence. Furthermore, it uses fault tolerant (substitution edits) finite-context models that bridge the gap between context sizes. The tool is also able to identify locally where, in each database sequence, the similarity occur. FALCON provides programs to filter de local results (FALCON-FILTER) and to visualize the results (FALCON-EYE). Several running modes are available for different hardware and speed specifications. The system is able to automatically learn to measure similarity, whose properties are characteristics of the Artificial Intelligence field.
Cmake is needed for installation (http://www.cmake.org/) for systems not using Linux. You can download it directly from http://www.cmake.org/cmake/resources/software.html or use an appropriate packet manager. In the following instructions we show the procedure to install FALCON:
git clone https://github.com/pratas/falcon.git cd falcon/src/ cmake . make cp FALCON ../../ cp FALCON-FILTER ../../ cp FALCON-EYE ../../ cd ../../
Alternatively to git use wget:
wget https://github.com/pratas/falcon/archive/master.zip unzip master.zip cd falcon-master/src cmake . make cp FALCON ../../ cp FALCON-FILTER ../../ cp FALCON-EYE ../../ cd ../../
or alternatively to cmake, for Linux, use the following:
git clone https://github.com/pratas/falcon.git cd falcon/src/ cp Makefile.linux Makefile make cp FALCON ../../ cp FALCON-FILTER ../../ cp FALCON-EYE ../../ cd ../../
This will create three binary files:
FALCON FALCON-FILTER FALCON-EYE
FALCON is the main program, FALCON-FILTER is used to filter local interactions and FALCON-EYE is used to visualize the output from FALCON-FILTER program.
After install, search for the top 10 similar virus in Chimpanzee chromosome 7:
cp falcon/scripts/DownloadViruses.pl . perl DownloadViruses.pl wget --trust-server-names -q \ ftp://ftp.ncbi.nlm.nih.gov/genomes/Pan_troglodytes/CHR_18/ptr_ref_Pan_troglodytes-2.1.4_chr18.fa.gz \ -O PT18.fa.gz gunzip PT18.fa.gz ./FALCON -v -n 4 -c 20 -t 10 -l 15 PT18.fa viruses.fa
It will use less than 3.5 GB of RAM memory and about 1 minute (in a common laptop) to run the FALCON.
In the case of problems with perl, run the following:
perl -MCPAN -e'install "LWP::Simple"'
To see the possible options of FALCON type
./FALCON
or
./FALCON -h
These will print the following options:
Usage: FALCON [OPTION]... [FILE1] [FILE2] Machine learning system to detect pathogens in metagenomic samples. Non-mandatory arguments: -h give this help, -F force mode (overwrites top file), -V display version number, -v verbose mode (more information), -Z database local similarity, -s show compression levels, -l <level> compression level [1;44], -p <sample> subsampling (default: 1), -t <top> top of similarity (default: 20), -n <nThreads> number of threads (default: 2), -x <FILE> similarity top filename, -y <FILE> local similarities filename, Mandatory arguments: [FILE1] metagenomic filename (FASTA or FASTQ), [FILE2] database filename (FASTA or Multi-FASTA). Report issues to <{pratas,ap,pjf,jmr}@ua.pt>.
All the parameters can be better explained trough the following table:
Parameters | Meaning |
---|---|
-h | It will print the parameters menu (help menu) |
-F | It will use the force mode, namely overwriting the output top file. |
-V | It will print the FALCON version number, license type and authors. |
-v | It will print progress information. |
-Z | It measures the local complexity to localize specific events. |
-s | It will show pre-defined running levels/modes. |
-l <level> | It will use the selected running levels/modes. |
-p <sample> | If FALCON is using a single model it will sample (or use) only this periodic value of bases. |
-t <top> | It will create a top with this size. |
-n <nThreads> | It will use multiple-threading. The time to accomplish the task will be much lower, without use more RAM memory. |
-x <FILE> | Output top filename. |
-y <FILE> | Output local similarities filename (profile). Only when -Z option is used. |
[FILE1] | The metagenomic filename (direct from the NGS sequencing platform). Possible file formats: FASTQ, multi-FASTA, FASTA or sequence [ACGTN]. |
[FILE2] | The database filename (e.g. virus or bacteria database). Possible file formats: FASTA, multi-FASTA or sequence [ACGTN]. There are several scripts, on directory scripts, to download several databases. |
For local interactions detection and visualization the package provides FALCON-FILTER and FALCON-EYE.
To see the possible options of FALCON-FILTER type
./FALCON-FILTER
or
./FALCON-FILTER -h
These will print the following options:
Usage: FALCON-FILTER [OPTION]... [FILE] Filter and segment FALCON output. Non-mandatory arguments: -h give this help, -F force mode (overwrites top file), -V display version number, -v verbose mode (more information), -s <size> filter window size, -w <type> filter window type, -x <sampling> filter window sampling, -t <threshold> threshold, -o <FILE> output filename, Mandatory arguments: [FILE] profile filename (from FALCON), Report issues to <{pratas,ap,pjf,jmr}@ua.pt>.
All the parameters can be better explained trough the following table:
Parameters | Meaning |
---|---|
-h | It will print the parameters menu (help menu) |
-F | It will use the force mode, namely overwriting the output top file. |
-V | It will print the FALCON version number, license type and authors. |
-v | It will print progress information. |
-s <size> | Filtering window size. |
-w <type> | Window type [0;3]. Types: 0-Hamming, 1-Hann, 2-Blackman, 3-Rectangular. |
-x <sampling> | Filtering window sampling (it will drop this number of bases). |
-t <threshold> | Threshold to segment regions of similarity [0;2]. |
-o <FILE> | Output filename to be, for example, computed in FALCON-EYE. It contains the local positions with the intervals describing similarity. |
[FILE] | Profile filename given by the output of FALCON (option:-Z -y <FILE>). |
To see the possible options of FALCON-EYE type
./FALCON-EYE
or
./FALCON-EYE -h
These will print the following options:
Usage: FALCON-EYE [OPTION]... [FILE] Vizualise FALCON-FILTER output. Non-mandatory arguments: -h give this help, -F force mode (overwrites top file), -V display version number, -v verbose mode (more information), -w <width> square width (for each value), -s <ispace> square inter-space (between each value), -i <indexs> color index start, -r <indexr> color index rotations, -u <hue> color hue, -sl <lower> similarity lower bound, -su <upper> similarity upper bound, -dl <lower> size lower bound, -du <upper> size upper bound, -g <color> color gamma, -e <size> enlarge painted regions, -ss do NOT show global scale, -sn do NOT show names, -o <FILE> output image filename, Mandatory arguments: [FILE] profile filename (from FALCON-FILTER), Report issues to <{pratas,ap,pjf,jmr}@ua.pt>.
All the parameters can be better explained trough the following table:
Parameters | Meaning |
---|---|
-h | It will print the parameters menu (help menu) |
-F | It will use the force mode, namely overwriting the output top file. |
-V | It will print the FALCON version number, license type and authors. |
-v | It will print progress information. |
-w <width> | square width. |
-s <iSpace> | space between squares. |
-i <indexs> | color index start. |
-r <indexr> | color index rotations. |
-u <hue> | color hue. |
-g <color> | color gamma. |
-sl <lower> | similarity lower bound. |
-su <upper> | similarity upper bound. |
-dl <lower> | size lower bound. |
-du <upper> | size upper bound. |
-e <size> | enlarge painter local regions. |
-ss | Does not show global scale. |
-sn | Does not show names. |
-o <FILE> | Output SVG image filename. |
[FILE] | Profile filename given by the output of FALCON-FILTER. |
On using this software/method please cite:
D. Pratas, A. J. Pinho, P. J. S. G. Ferreira, J. M. O. S. Rodrigues (2015). FALCON: a machine learning system to detect pathogens in genomic and metagenomic samples. Zenodo. 10.5281/zenodo.35745.
D. Pratas, R. M. Silva, A. J. Pinho, P. J. S. G. Ferreira. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci. Rep. 5, 10203; doi: 10.1038/srep10203 (2015).
For any issue let us know at issues link.
GPL v3.
For more information see LICENSE file or visit
http://www.gnu.org/licenses/gpl-3.0.html