Skip to content
/ fminer Public

Molecular feature miner for path- and tree-shaped structures from "Large Scale Graph Mining using Backbone Refinement Classes"

License

Notifications You must be signed in to change notification settings

amaunz/fminer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

 _   _ ______ ______   ___   _____  _____ 
| | | || ___ \|  _  \ / _ \ |_   _||  ___|
| | | || |_/ /| | | |/ /_\ \  | |  | |__  
| | | ||  __/ | | | ||  _  |  | |  |  __| 
| |_| || |    | |/ / | | | |  | |  | |___ 
 \___/ \_|    |___/  \_| |_/  \_/  \____/ 

A NEWER VERSION OF FMINER IS AVAILABLE AT
--- https://github.com/amaunz/fminer2 ----


 Welcome to Fminer.

 This is Fminer (application), available from https://github.com/amaunz/fminer/tree/master .
 It depends on LibFminer, available from https://github.com/amaunz/libfminer/tree/master . 
 The official website with documentation is http://www.maunz.de/libfminer-doc .

 For installation and documentation see INSTALL.
 For license information see LICENSE.


 File Formats:
 =============
 Graphs are allowed in SMILES and gSpan format. This is to support both general graph mining, but to also make life more convenient when mining molecular data. For instance, for SMILES formatted input there is a switch to enable/disable aromatic ring perception, which, in gSpan format, requires re-formatting the input.
  
  SMILES file line format is defined by 
    "ID\t SMILES", e.g.:
     1    COC1=CC=C(C=C1)C2=NC(=C([NH]2)C3=CC=CC=C3)C4=CC=CC=C4

  gSpan file block format (see also: http://illimine.cs.uiuc.edu/resources/readme_plain.php?program=gSpan) is defined by:
    "t # ID" means the Nth graph,
    "v M L" means that the Mth vertex in this graph has label L,
    "e P Q L" means that there is an edge connecting the Pth vertex with the Qth vertex. The edge has label L.

- Activity file line format is defined by 
    "ID\t endpoint\t activity", e.g.:
     1    Salmonella Mutagenicity    1

 The set of IDs in a gSpan or SMILES format file must be a subset of the set of activity IDs in the respective activity format file, i.e., not every Activity ID must be matched by a graph id, but vice versa.


 Switches (run Fminer w/o arguments for allowable parameter ranges, defaults and explanations):
 ========
 Constraint parameters:
 -f Minimum frequency constraint, used for anti-monotonic pruning.
 -l Subgraph type, choices are paths and trees.
 -p Chi-square significance level, used for statistical upper-bound pruning.

 Constraint switches:
 -d Deactivate dynamic adjustment of upper bound (less efficient, use only for performance evaluation). Can only be set when -c is not set.
 -b Switch off mining of only the most significant/most general representative of each backbone.
 -m Switch on mining of maximal trees, as induced by constraints.
 -u Deactivate upper bound pruning for chi-square constraint (less efficient, use only for performance evaluation).

 General switches:
 -s Refine fragments with frequency 1.
 -a Enable aromatic ring perception for SMILES input.
 -r Insert empty vectors in result vector to indicate BBRC borders.
 -n Output line numbers from input file instead of IDs in result file.
 -o Switch off output.


 Environment variables:
 ======================
 Define the FMINER_SMARTS environment variable to produce output in SMARTS format (e.g. export FMINER_SMARTS=1).
 Additionally define the FMINER_LAZAR environment variable to produce output in linfrag format which can be used as input to Lazar (e.g. export FMINER_LAZAR=1).


 Remarks:
 ========
 Structures are output to STDOUT in the specified format.
 Output consists of blocks of gSpan graphs.
 When using SMARTS output (environment variable FMINER_SMARTS is defined), each line is a YAML sequence, containing SMARTS fragment, p-value, and two sequences denoting positive and negative class occurrences (default compound ids, see also switch -n). 
 When also FMINER_LAZAR is set, the linfrag output format is used.


 Examples:
 =========

 In any case, 1-frequent patterns are not refined further, unless you use the -s switch 

 Use BBRC mining (default):

 # BBRC representatives (min frequency 2, min significance 95%), using dynamic UB pruning
 ./fminer <graphs> <activities> 
 # same as above, but much slower (explicitly no dynamic UB pruning)
 ./fminer -d <graphs> <activities>                                        

 Switch off BBRC mining:

 # all 2-frequent and 95%-significant features
 # Note, that the -d is mandatory (no dynamic UB pruning possible here)!
 ./fminer -d -b <graphs> <activities>
 # All 20-frequent patterns (standard frequent pattern mining)
 ./fminer -f 20 <graphs>

 
Andreas Maunz, 2008.

About

Molecular feature miner for path- and tree-shaped structures from "Large Scale Graph Mining using Backbone Refinement Classes"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages