public class PreprocessingContext.AllTokens extends Object
PreprocessingContext.documents.
Each element of each of the arrays corresponds to one individual token from the
input or a synthetic separator inserted between documents, fields and sentences.
Last element of this array is a special terminator entry.
All arrays in this class have the same length and values across different arrays correspond to each other for the same index.
| Modifier and Type | Field and Description |
|---|---|
int[] |
documentIndex
Index of the document this token came from, points to elements of
PreprocessingContext.documents. |
byte[] |
fieldIndex
Document field the token came from.
|
char[][] |
image
Token image as it appears in the input.
|
int[] |
lcp
The Longest Common Prefix for the adjacent suffix-sorted token sequences.
|
int[] |
suffixOrder
The suffix order of tokens.
|
short[] |
type
Token's
ITokenizer bit flags. |
int[] |
wordIndex
A pointer to
PreprocessingContext.AllWords arrays for this token. |
| Constructor and Description |
|---|
PreprocessingContext.AllTokens() |
public char[][] image
type is
equal to one of ITokenizer.TF_TERMINATOR,
ITokenizer.TF_SEPARATOR_DOCUMENT or
ITokenizer.TF_SEPARATOR_FIELD , image is null.
This array is produced by Tokenizer.
public short[] type
ITokenizer bit flags.
This array is produced by Tokenizer.
public byte[] fieldIndex
PreprocessingContext.AllFields, equal to -1 for document and field separators.
This array is produced by Tokenizer.
public int[] documentIndex
PreprocessingContext.documents. Equal to -1 for document
separators.
This array is produced by Tokenizer.
This array is accessed in in CaseNormalizer and PhraseExtractor
to compute by-document statistics, e.g. tf-by document, which are then needed
to build a VSM or assign documents to labels. An alternative to this representation
would be creating an AllDocuments holder and keep there an array
of start token indexes for each document and then refactor the model building code
to do a binary search to determine the document index given token index. This is
likely to be a significant performance hit because model building code accesses
the documentIndex array pretty much randomly (in the suffix order), so we'd be
doing twice-the-number-of-tokens binary searches. Unless there's some other
data structure that can help us here.
public int[] wordIndex
PreprocessingContext.AllWords arrays for this token. Equal to -1
for document, field and ITokenizer.TT_PUNCTUATION tokens (including
sentence separators).
This array is produced by CaseNormalizer.
public int[] suffixOrder
This array is produced by PhraseExtractor.
public int[] lcp
This array is produced by PhraseExtractor.