SimpleHTMLTokenizer (Classifier4J 0.6 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.classifier4J
Class SimpleHTMLTokenizer

java.lang.Object
  |
  +--net.sf.classifier4J.DefaultTokenizer
        |
        +--net.sf.classifier4J.SimpleHTMLTokenizer

All Implemented Interfaces:: ITokenizer

public class SimpleHTMLTokenizer
extends DefaultTokenizer

Simple HTML Tokenizer. Its goal is to tokenize words that would be displayed in a normal web browser.

It does not handle meta tags, alt or text attributes, but it does remove CSS style definitions and javascript code.

It handles entity references by replacing them with a space(!!). This can be overridden.

Since:: 18 Nov 2003
Author:: Nick Lothian

Field Summary

Fields inherited from class net.sf.classifier4J.DefaultTokenizer

BREAK_ON_WHITESPACE, BREAK_ON_WORD_BREAKS

Constructor Summary

SimpleHTMLTokenizer()
          Constructor that using the BREAK_ON_WORD_BREAKS tokenizer config by default

SimpleHTMLTokenizer(int tokenizerConfig)


SimpleHTMLTokenizer(java.lang.String regularExpression)


Method Summary

protected java.lang.String resolveEntities(java.lang.String contentsWithUnresolvedEntityReferences)
          Replaces entity references with spaces

java.lang.String[] tokenize(java.lang.String input)
          Splits up the string passed into the tokens which have individual probabilities.

Methods inherited from class net.sf.classifier4J.DefaultTokenizer

getCustomTokenizerRegExp, getTokenizerConfig, setCustomTokenizerRegExp, setTokenizerConfig, toString

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Detail