| 
 | |||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Object
  |
  +--net.sf.classifier4J.DefaultTokenizer
        |
        +--net.sf.classifier4J.SimpleHTMLTokenizer
Simple HTML Tokenizer. Its goal is to tokenize words that would be displayed in a normal web browser.
It does not handle meta tags, alt or text attributes, but it does remove CSS style definitions and javascript code.
It handles entity references by replacing them with a space(!!). This can be overridden.
| Field Summary | 
| Fields inherited from class net.sf.classifier4J.DefaultTokenizer | 
| BREAK_ON_WHITESPACE, BREAK_ON_WORD_BREAKS | 
| Constructor Summary | |
| SimpleHTMLTokenizer()Constructor that using the BREAK_ON_WORD_BREAKS tokenizer config by default | |
| SimpleHTMLTokenizer(int tokenizerConfig) | |
| SimpleHTMLTokenizer(java.lang.String regularExpression) | |
| Method Summary | |
| protected  java.lang.String | resolveEntities(java.lang.String contentsWithUnresolvedEntityReferences)Replaces entity references with spaces | 
|  java.lang.String[] | tokenize(java.lang.String input)Splits up the string passed into the tokens which have individual probabilities. | 
| Methods inherited from class net.sf.classifier4J.DefaultTokenizer | 
| getCustomTokenizerRegExp, getTokenizerConfig, setCustomTokenizerRegExp, setTokenizerConfig, toString | 
| Methods inherited from class java.lang.Object | 
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait | 
| Constructor Detail | 
public SimpleHTMLTokenizer()
public SimpleHTMLTokenizer(int tokenizerConfig)
public SimpleHTMLTokenizer(java.lang.String regularExpression)
| Method Detail | 
protected java.lang.String resolveEntities(java.lang.String contentsWithUnresolvedEntityReferences)
contentsWithUnresolvedEntityReferences - the contents with the entity references
public java.lang.String[] tokenize(java.lang.String input)
ITokenizerSplits up the string passed into the tokens which have individual probabilities.
tokenize in interface ITokenizertokenize in class DefaultTokenizerITokenizer.tokenize(java.lang.String)| 
 | |||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||