net.sf.classifier4J.bayesian
Class BayesianClassifier

java.lang.Object
  |
  +--net.sf.classifier4J.AbstractClassifier
        |
        +--net.sf.classifier4J.AbstractCategorizedTrainableClassifier
              |
              +--net.sf.classifier4J.bayesian.BayesianClassifier
All Implemented Interfaces:
ICategorisedClassifier, IClassifier, ITrainable, ITrainableClassifier

public class BayesianClassifier
extends AbstractCategorizedTrainableClassifier

A implementation of IClassifier based on Bayes' theorem (see http://www.wikipedia.org/wiki/Bayes_theorem).

The basic usage pattern for this class is:

  1. Create a instance of IWordsDataSource
  2. Create a new instance of BayesianClassifier, passing the IWordsDataSource to the constructor
  3. Call IClassifier.classify(java.lang.String) or IClassifier.isMatch(java.lang.String)

For example:
IWordsDataSource wds = new SimpleWordsDataSource();
IClassifier classifier = new BayesianClassifier(wds);
System.out.println( "Matches = " + classifier.classify("This is a sentence") );

Author:
Nick Lothian, Peter Leschev

Field Summary
 
Fields inherited from class net.sf.classifier4J.AbstractClassifier
cutoff
 
Fields inherited from interface net.sf.classifier4J.ICategorisedClassifier
DEFAULT_CATEGORY
 
Fields inherited from interface net.sf.classifier4J.IClassifier
DEFAULT_CUTOFF, LOWER_BOUND, NEUTRAL_PROBABILITY, UPPER_BOUND
 
Constructor Summary
BayesianClassifier()
          Default constructor that uses the SimpleWordsDataSource & a DefaultTokenizer (set to BREAK_ON_WORD_BREAKS).
BayesianClassifier(IWordsDataSource wd)
          Constructor for BayesianClassifier that specifies a datasource.
BayesianClassifier(IWordsDataSource wd, ITokenizer tokenizer)
          Constructor for BayesianClassifier that specifies a datasource & tokenizer
BayesianClassifier(IWordsDataSource wd, ITokenizer tokenizer, IStopWordProvider swp)
          Constructor for BayesianClassifier that specifies a datasource, tokenizer and stop words provider
 
Method Summary
protected  double calculateOverallProbability(WordProbability[] wps)
          NOTE: Override this method with care.
 double classify(java.lang.String category, java.lang.String input)
          Function to determine the probability string matches a criteria for a given category.
protected  double classify(java.lang.String category, java.lang.String[] words)
           
 IStopWordProvider getStopWordProvider()
           
 ITokenizer getTokenizer()
           
 IWordsDataSource getWordsDataSource()
           
 boolean isCaseSensitive()
           
 boolean isMatch(java.lang.String category, java.lang.String input)
          Function to determine if a string matches a criteria for a given category
protected  boolean isMatch(java.lang.String category, java.lang.String[] input)
           
protected static double normaliseSignificance(double sig)
           
 void setCaseSensitive(boolean b)
           
 void teachMatch(java.lang.String category, java.lang.String input)
           
protected  void teachMatch(java.lang.String category, java.lang.String[] words)
           
 void teachNonMatch(java.lang.String category, java.lang.String input)
           
protected  void teachNonMatch(java.lang.String category, java.lang.String[] words)
           
 java.lang.String toString()
           
protected  java.lang.String transformWord(java.lang.String word)
          Allows transformations to be done to word.
 
Methods inherited from class net.sf.classifier4J.AbstractCategorizedTrainableClassifier
classify, teachMatch, teachNonMatch
 
Methods inherited from class net.sf.classifier4J.AbstractClassifier
getMatchCutoff, isMatch, isMatch, setMatchCutoff
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface net.sf.classifier4J.IClassifier
isMatch, isMatch, setMatchCutoff
 

Constructor Detail

BayesianClassifier

public BayesianClassifier()
Default constructor that uses the SimpleWordsDataSource & a DefaultTokenizer (set to BREAK_ON_WORD_BREAKS).


BayesianClassifier

public BayesianClassifier(IWordsDataSource wd)
Constructor for BayesianClassifier that specifies a datasource. The DefaultTokenizer (set to BREAK_ON_WORD_BREAKS) will be used.

Parameters:
wd - a IWordsDataSource

BayesianClassifier

public BayesianClassifier(IWordsDataSource wd,
                          ITokenizer tokenizer)
Constructor for BayesianClassifier that specifies a datasource & tokenizer

Parameters:
wd - a IWordsDataSource
tokenizer - a ITokenizer

BayesianClassifier

public BayesianClassifier(IWordsDataSource wd,
                          ITokenizer tokenizer,
                          IStopWordProvider swp)
Constructor for BayesianClassifier that specifies a datasource, tokenizer and stop words provider

Parameters:
wd - a IWordsDataSource
tokenizer - a ITokenizer
swp - a IStopWordProvider
Method Detail

isMatch

public boolean isMatch(java.lang.String category,
                       java.lang.String input)
                throws WordsDataSourceException
Description copied from interface: ICategorisedClassifier
Function to determine if a string matches a criteria for a given category

Parameters:
category - the category to check against
input - the string to classify
Returns:
true if the input string has a probability >= the cutoff probability of matching
WordsDataSourceException
See Also:
ICategorisedClassifier.isMatch(java.lang.String, java.lang.String)

classify

public double classify(java.lang.String category,
                       java.lang.String input)
                throws WordsDataSourceException
Description copied from interface: ICategorisedClassifier
Function to determine the probability string matches a criteria for a given category.

Parameters:
category - the category to check against
input - the string to classify
Returns:
the likelyhood that this string is a match for this net.sf.classifier4J. 1 means 100% likely.
WordsDataSourceException
See Also:
ICategorisedClassifier.classify(java.lang.String, java.lang.String)

teachMatch

public void teachMatch(java.lang.String category,
                       java.lang.String input)
                throws WordsDataSourceException
WordsDataSourceException

teachNonMatch

public void teachNonMatch(java.lang.String category,
                          java.lang.String input)
                   throws WordsDataSourceException
WordsDataSourceException

isMatch

protected boolean isMatch(java.lang.String category,
                          java.lang.String[] input)
                   throws WordsDataSourceException
WordsDataSourceException

classify

protected double classify(java.lang.String category,
                          java.lang.String[] words)
                   throws WordsDataSourceException
WordsDataSourceException

teachMatch

protected void teachMatch(java.lang.String category,
                          java.lang.String[] words)
                   throws WordsDataSourceException
WordsDataSourceException

teachNonMatch

protected void teachNonMatch(java.lang.String category,
                             java.lang.String[] words)
                      throws WordsDataSourceException
WordsDataSourceException

transformWord

protected java.lang.String transformWord(java.lang.String word)
Allows transformations to be done to word. This implementation transforms the word to lowercase if the classifier is in case-insenstive mode.

Parameters:
word -
Returns:
the transformed word
Throws:
java.lang.IllegalArgumentException - if a null is passed

calculateOverallProbability

protected double calculateOverallProbability(WordProbability[] wps)
NOTE: Override this method with care. There is a good chance it will be removed or have signature changes is later versions.


normaliseSignificance

protected static double normaliseSignificance(double sig)

isCaseSensitive

public boolean isCaseSensitive()
Returns:
true if the classifier is case sensitive, false otherwise (false by default)

setCaseSensitive

public void setCaseSensitive(boolean b)
Parameters:
b - True if the classifier should be case sensitive, false otherwise

getWordsDataSource

public IWordsDataSource getWordsDataSource()
Returns:
the IWordsDataSource used by this classifier

getTokenizer

public ITokenizer getTokenizer()
Returns:
the ITokenizer used by this classifier

getStopWordProvider

public IStopWordProvider getStopWordProvider()
Returns:
the IStopWordProvider used by this classifier

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object


Copyright © 2003-2005 Nick Lothian. All Rights Reserved.