Using Classifier4J

The basic usage model for Classifer4J is as follows

  1. Create an instance of an implementation of the IClassifer interface.
  2. Call either isMatch(String) (for boolean matching) or classifiy(String) (to get a match rating)

Basic usage

The simplest example possible is:

	SimpleClassifier classifier = new SimpleClassifier();
	classifier.setSearchWord( "java" );
	String sentance = "This is a sentance about java";
	System.out.println( "The string " + sentance +
		" contains the word java:" + classifier.isMatch(sentance) );
The SimpleClassifier class is an implementation of IClassifer which looks in the string passed to it for the word which was set using setSearchWord(String). For real world usage, it isn't very useful, but can be used for testing.

Using BayesianClassifier

The BayesianClassifier is an implementation of the IClassifier interface which uses Bayes' theorem to rate the text against a known input.

	IWordsDataSource wds = new SimpleWordsDataSource();
	IClassifier classifier = new BayesianClassifier(wds);
	System.out.println( "Matches = " + classifier.classify("This is a sentence") );

Some applications will find the JDBCWordsDataSource more useful than the SimpleWordsDataSource. This can be used almost as simply:

	DriverMangerJDBCConnectionManager cm = new DriverMangerJDBCConnectionManager(JDBCConnectionString, username, password);
	JDBCWordsDataSource wds = new JDBCWordsDataSource(cm);
	IClassifier classifier = new BayesianClassifier(wds);

However, the performance of the JDBCWordsDataSource is quite bad. If performance is a concern then the JDBMWordsDataSource (in the Classifier4J-Optional download) may be a better option.

The Bayesian Classifier can be trained using the teachMatch and teachNonMatch methods. Note that it must be trained with both matches and non matches for the alogrithm to work.

Using VectorClassifier

The VectorClassifier is an implementation of IClassifier that uses the vector space search algorithm. This algorithm is quite fast (compared to the Bayesian algorithm) and does not require training of non-matches. It also has the advantage that its match ratings (as returned by ther classify method) are fairly well distriubuted unlike the Bayesian Classifier which tended to return 0.99 or 0.01. This characteristic makes it ideally suited for categorization type tasks.

Sample code:

    TermVectorStorage storage = new HashMapTermVectorStorage();
    VectorClassifier vc = new VectorClassifier(storage);

    vc.teachMatch("category", "hello there is this a long sentence yes it is blah blah hello.");
    double result = vc.classify("category", "hello blah");


Currently it has the disadvantage that once trained it is impossible to incrementally add more training to a category.

Using ISummariser

Using the ISummarier is very simple. Give it some input, and decide how many sentances you'd like the summary to be.

	String input = "Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don't think there are any other java summarisers.";
	String result = summariser.summarise(input, 2);
will output Classifier4J is a java package for working with text. Classifier4J includes a summariser.. That would be kind of boring, except that
	String result = summariser.summarise(input, 1);
will output Classifier4J includes a summariser.