net.sf.classifier4J
Class DefaultTokenizer

java.lang.Object
  |
  +--net.sf.classifier4J.DefaultTokenizer
All Implemented Interfaces:
ITokenizer
Direct Known Subclasses:
SimpleHTMLTokenizer

public class DefaultTokenizer
extends java.lang.Object
implements ITokenizer

Author:
Peter Leschev

Field Summary
static int BREAK_ON_WHITESPACE
          Use a the "\s" (whitespace) regexp to split the string passed to classify
static int BREAK_ON_WORD_BREAKS
          Use a the "\W" (non-word characters) regexp to split the string passed to classify
 
Constructor Summary
DefaultTokenizer()
          Constructor that using the BREAK_ON_WORD_BREAKS tokenizer config by default
DefaultTokenizer(int tokenizerConfig)
           
DefaultTokenizer(java.lang.String regularExpression)
           
 
Method Summary
 java.lang.String getCustomTokenizerRegExp()
           
 int getTokenizerConfig()
           
 void setCustomTokenizerRegExp(java.lang.String string)
          Allows the use of custom regular expressions to split up the input to IClassifier.classify(java.lang.String).
 void setTokenizerConfig(int tokConfig)
           
 java.lang.String[] tokenize(java.lang.String input)
          Splits up the string passed into the tokens which have individual probabilities.
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

BREAK_ON_WORD_BREAKS

public static int BREAK_ON_WORD_BREAKS
Use a the "\W" (non-word characters) regexp to split the string passed to classify


BREAK_ON_WHITESPACE

public static int BREAK_ON_WHITESPACE
Use a the "\s" (whitespace) regexp to split the string passed to classify

Constructor Detail

DefaultTokenizer

public DefaultTokenizer()
Constructor that using the BREAK_ON_WORD_BREAKS tokenizer config by default


DefaultTokenizer

public DefaultTokenizer(int tokenizerConfig)

DefaultTokenizer

public DefaultTokenizer(java.lang.String regularExpression)
Method Detail

getCustomTokenizerRegExp

public java.lang.String getCustomTokenizerRegExp()
Returns:
the custom regular expression to use for tokenize(String)

getTokenizerConfig

public int getTokenizerConfig()
Returns:
The configuration setting used by tokenize(String).

setCustomTokenizerRegExp

public void setCustomTokenizerRegExp(java.lang.String string)

Allows the use of custom regular expressions to split up the input to IClassifier.classify(java.lang.String). Note that this regular expression will only be used if tokenizerConfig is set to #BREAK_ON_CUSTOM_REGEXP

Parameters:
string - set the custom regular expression to use for tokenize(String). Must not be null.

setTokenizerConfig

public void setTokenizerConfig(int tokConfig)
Parameters:
tokConfig - The configuration setting for use by tokenize(String). Valid values are #BREAK_ON_CUSTOM_REGEXP, BREAK_ON_WORD_BREAKS and BREAK_ON_WHITESPACE

tokenize

public java.lang.String[] tokenize(java.lang.String input)
Description copied from interface: ITokenizer

Splits up the string passed into the tokens which have individual probabilities.

Specified by:
tokenize in interface ITokenizer
Returns:
Should never return null, rather it should return an empty array of Strings if there aren't any elements to return.

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object


Copyright © 2003-2005 Nick Lothian. All Rights Reserved.