net.sf.classifier4J
Class SimpleHTMLTokenizer

java.lang.Object
  |
  +--net.sf.classifier4J.DefaultTokenizer
        |
        +--net.sf.classifier4J.SimpleHTMLTokenizer
All Implemented Interfaces:
ITokenizer

public class SimpleHTMLTokenizer
extends DefaultTokenizer

Simple HTML Tokenizer. Its goal is to tokenize words that would be displayed in a normal web browser.

It does not handle meta tags, alt or text attributes, but it does remove CSS style definitions and javascript code.

It handles entity references by replacing them with a space(!!). This can be overridden.

Since:
18 Nov 2003
Author:
Nick Lothian

Field Summary
 
Fields inherited from class net.sf.classifier4J.DefaultTokenizer
BREAK_ON_WHITESPACE, BREAK_ON_WORD_BREAKS
 
Constructor Summary
SimpleHTMLTokenizer()
          Constructor that using the BREAK_ON_WORD_BREAKS tokenizer config by default
SimpleHTMLTokenizer(int tokenizerConfig)
           
SimpleHTMLTokenizer(java.lang.String regularExpression)
           
 
Method Summary
protected  java.lang.String resolveEntities(java.lang.String contentsWithUnresolvedEntityReferences)
          Replaces entity references with spaces
 java.lang.String[] tokenize(java.lang.String input)
          Splits up the string passed into the tokens which have individual probabilities.
 
Methods inherited from class net.sf.classifier4J.DefaultTokenizer
getCustomTokenizerRegExp, getTokenizerConfig, setCustomTokenizerRegExp, setTokenizerConfig, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SimpleHTMLTokenizer

public SimpleHTMLTokenizer()
Constructor that using the BREAK_ON_WORD_BREAKS tokenizer config by default


SimpleHTMLTokenizer

public SimpleHTMLTokenizer(int tokenizerConfig)

SimpleHTMLTokenizer

public SimpleHTMLTokenizer(java.lang.String regularExpression)
Method Detail

resolveEntities

protected java.lang.String resolveEntities(java.lang.String contentsWithUnresolvedEntityReferences)
Replaces entity references with spaces

Parameters:
contentsWithUnresolvedEntityReferences - the contents with the entity references
Returns:
the contents with the entities replaces with spaces

tokenize

public java.lang.String[] tokenize(java.lang.String input)
Description copied from interface: ITokenizer

Splits up the string passed into the tokens which have individual probabilities.

Specified by:
tokenize in interface ITokenizer
Overrides:
tokenize in class DefaultTokenizer
Returns:
Should never return null, rather it should return an empty array of Strings if there aren't any elements to return.
See Also:
ITokenizer.tokenize(java.lang.String)


Copyright © 2003-2005 Nick Lothian. All Rights Reserved.