|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.uima.internal.util.TextStringTokenizer
public class TextStringTokenizer
An implementation of a text tokenizer for whitespace separated natural lanuage text.
The tokenizer knows about four different character classes: regular word characters, whitespace characters, sentence delimiters and separator characters. Tokens can consist of
The character classes are completely user definable. By default, whitespace characters are the Unicode whitespace characters. All other characters are word characters. The two separator classes are empty by default. The different classes may have non-empty intersections. When determining the class of a character, the user defined classes are considered in the following order: end-of-sentence delimiter before other separators before whitespace before word characters. That is, if a character is defined to be both a separator and a whitespace character, it will be considered to be a separator.
By default, the tokenizer will return all tokens, including whitespace. That is, appending the sequence of tokens will recover the original input text. This behavior can be changed so that whitespace and/or separator tokens are skipped.
A tokenizer provides a standard iterator interface similar to
StringTokenizer
. The validity of the iterator can be queried
with hasNext()
, and the next token can be queried with nextToken()
.
In addition, getNextTokenType()
returns the type of the token as an integer. NB
that you need to call getNextTokenType()
before calling nextToken()
,
since calling nextToken()
will advance the iterator.
Field Summary | |
---|---|
static int |
EOS
Sentence delimiter character/word type. |
static int |
SEP
Separator character/word type. |
static int |
WCH
Word character/word type. |
static int |
WSP
Whitespace character/word type. |
Constructor Summary | |
---|---|
TextStringTokenizer(java.lang.String string)
Construct a tokenizer from a Java string. |
Method Summary | |
---|---|
void |
addSeparators(java.lang.String chars)
Add to the set of separator characters. |
void |
addToEndOfSentenceChars(java.lang.String chars)
Add to the set of sentence delimiters. |
void |
addWhitespaceChars(java.lang.String chars)
Add to the set of whitespace characters. |
void |
addWordChars(java.lang.String chars)
Add to the set of word characters. |
int |
getCharType(char c)
Get the type of an individual character. |
java.lang.String |
getToken()
Return the next token. |
int |
getTokenEnd()
Get the end of the token. |
int |
getTokenStart()
Get the start of the token. |
int |
getTokenType()
Get the type of the token returned by the next call to nextToken() . |
boolean |
isValid()
Return true iff there is a next token. |
void |
setEndOfSentenceChars(java.lang.String chars)
Set the set of sentence delimiters. |
void |
setSeparators(java.lang.String chars)
Set the set of separator characters. |
void |
setShowSeparators(boolean b)
Set the flag for showing separator tokens. |
void |
setShowWhitespace(boolean b)
Set the flag for showing whitespace tokens. |
void |
setToFirst()
Reset the tokenizer at any time. |
void |
setToNext()
Compute the next token. |
void |
setWhitespaceChars(java.lang.String chars)
Set the set of whitespace characters (in addition to the Unicode whitespace chars). |
void |
setWordChars(java.lang.String chars)
Set the set of word characters. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int EOS
public static final int SEP
public static final int WSP
public static final int WCH
Constructor Detail |
---|
public TextStringTokenizer(java.lang.String string)
string
- The string to tokenize.Method Detail |
---|
public void setShowWhitespace(boolean b)
b
- The whitespace flag.public void setShowSeparators(boolean b)
b
- The flag.public void setEndOfSentenceChars(java.lang.String chars)
chars
- A string containing EOS chars.public void addToEndOfSentenceChars(java.lang.String chars)
chars
- A string containing EOS chars.public void setSeparators(java.lang.String chars)
chars
- The separator chars.public void addSeparators(java.lang.String chars)
chars
- Separator chars.public void setWhitespaceChars(java.lang.String chars)
chars
- Whitespace chars.public void addWhitespaceChars(java.lang.String chars)
chars
- Whitespace chars.public void setWordChars(java.lang.String chars)
chars
- Word chars.public void addWordChars(java.lang.String chars)
chars
- Word chars.public int getTokenType()
nextToken()
.
-1
if there is no next token.public boolean isValid()
true
iff there is a next token.
true
iff there is a next token.public void setToFirst()
public java.lang.String getToken()
public int getTokenStart()
public int getTokenEnd()
public void setToNext()
public int getCharType(char c)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |