StandardTokenizer
implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.See: Description
| Class | Description |
|---|---|
| ClassicAnalyzer |
Filters
ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of
English stop words. |
| ClassicFilter |
Normalizes tokens extracted with
ClassicTokenizer. |
| ClassicFilterFactory |
Factory for
ClassicFilter. |
| ClassicTokenizer |
A grammar-based tokenizer constructed with JFlex
|
| ClassicTokenizerFactory |
Factory for
ClassicTokenizer. |
| EmojiTokenizationTestUnicode_11_0 |
This class was automatically generated by generateEmojiTokenizationTest.pl
from: http://www.unicode.org/Public/emoji/11.0/emoji-test.txt
emoji-test.txt contains emoji char sequences, which are represented as
tokenization tests in this class.
|
| StandardAnalyzer |
Filters
StandardTokenizer with LowerCaseFilter and
StopFilter, using a configurable list of stop words. |
| StandardFilter | Deprecated
StandardFilter is a no-op and can be removed from code
|
| StandardFilterFactory | Deprecated
StandardFilter is a no-op and can be removed from filter chains
|
| StandardTokenizer |
A grammar-based tokenizer constructed with JFlex.
|
| StandardTokenizerFactory |
Factory for
StandardTokenizer. |
| StandardTokenizerImpl |
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29.
|
| UAX29URLEmailAnalyzer |
Filters
UAX29URLEmailTokenizer
with LowerCaseFilter and
StopFilter, using a list of
English stop words. |
| UAX29URLEmailTokenizer |
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
|
| UAX29URLEmailTokenizerFactory |
Factory for
UAX29URLEmailTokenizer. |
| UAX29URLEmailTokenizerImpl |
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
|
| WordBreakTestUnicode_9_0_0 |
This class was automatically generated by generateJavaUnicodeWordBreakTest.pl
from: http://www.unicode.org/Public/9.0.0/ucd/auxiliary/WordBreakTest.txt
WordBreakTest.txt indicates the points in the provided character sequences
at which conforming implementations must and must not break words.
|
StandardTokenizer
implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.
Unlike UAX29URLEmailTokenizer from the analysis module, URLs and email addresses are
not tokenized as single tokens, but are instead split up into
tokens according to the UAX#29 word break rules.
StandardAnalyzer includes
StandardTokenizer,
LowerCaseFilter
and StopFilter.Copyright © 2000–2025 The Apache Software Foundation. All rights reserved.