|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--net.sf.classifier4J.DefaultTokenizer | +--net.sf.classifier4J.SimpleHTMLTokenizer
Simple HTML Tokenizer. Its goal is to tokenize words that would be displayed in a normal web browser.
It does not handle meta tags, alt or text attributes, but it does remove CSS style definitions and javascript code.
It handles entity references by replacing them with a space(!!). This can be overridden.
Field Summary |
Fields inherited from class net.sf.classifier4J.DefaultTokenizer |
BREAK_ON_WHITESPACE, BREAK_ON_WORD_BREAKS |
Constructor Summary | |
SimpleHTMLTokenizer()
Constructor that using the BREAK_ON_WORD_BREAKS tokenizer config by default |
|
SimpleHTMLTokenizer(int tokenizerConfig)
|
|
SimpleHTMLTokenizer(java.lang.String regularExpression)
|
Method Summary | |
protected java.lang.String |
resolveEntities(java.lang.String contentsWithUnresolvedEntityReferences)
Replaces entity references with spaces |
java.lang.String[] |
tokenize(java.lang.String input)
Splits up the string passed into the tokens which have individual probabilities. |
Methods inherited from class net.sf.classifier4J.DefaultTokenizer |
getCustomTokenizerRegExp, getTokenizerConfig, setCustomTokenizerRegExp, setTokenizerConfig, toString |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
public SimpleHTMLTokenizer()
public SimpleHTMLTokenizer(int tokenizerConfig)
public SimpleHTMLTokenizer(java.lang.String regularExpression)
Method Detail |
protected java.lang.String resolveEntities(java.lang.String contentsWithUnresolvedEntityReferences)
contentsWithUnresolvedEntityReferences
- the contents with the entity references
public java.lang.String[] tokenize(java.lang.String input)
ITokenizer
Splits up the string passed into the tokens which have individual probabilities.
tokenize
in interface ITokenizer
tokenize
in class DefaultTokenizer
ITokenizer.tokenize(java.lang.String)
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |