public class RIDFTermPruningPolicy extends TermPruningPolicy
TermPruningPolicy that uses "residual IDF"
metric to determine the postings of terms to keep/remove, as defined in
http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf.
Residual IDF measures a difference between a collection-wide IDF of a term (which assumes a uniform distribution of occurrences) and the actual observed total number of occurrences of a term in all documents. Positive values indicate that a term is informative (e.g. for rare terms), negative values indicate that a term is not informative (e.g. too popular to offer good selectivity).
This metric produces small values close to [-1, 1], so useful ranges for thresholds under this metrics are somewhere between [0, 1]. The higher the threshold the more informative (and more rare) terms will be retained. For filtering of common words a value of close to or slightly below 0 (e.g. -0.1) should be a good starting point.
fieldFlags, inDEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR| Constructor and Description |
|---|
RIDFTermPruningPolicy(org.apache.lucene.index.IndexReader in,
java.util.Map<java.lang.String,java.lang.Integer> fieldFlags,
java.util.Map<java.lang.String,java.lang.Double> thresholds,
double defThreshold) |
| Modifier and Type | Method and Description |
|---|---|
void |
initPositionsTerm(org.apache.lucene.index.TermPositions tp,
org.apache.lucene.index.Term t)
Called when moving
TermPositions to a new Term. |
boolean |
pruneAllPositions(org.apache.lucene.index.TermPositions termPositions,
org.apache.lucene.index.Term t)
Prune all postings per term (invoked once per term per doc)
|
int |
pruneSomePositions(int docNum,
int[] positions,
org.apache.lucene.index.Term curTerm)
Prune some postings per term (invoked once per term per doc).
|
boolean |
pruneTermEnum(org.apache.lucene.index.TermEnum te)
Pruning of all postings for a term (invoked once per term).
|
int |
pruneTermVectorTerms(int docNumber,
java.lang.String field,
java.lang.String[] terms,
int[] freqs,
org.apache.lucene.index.TermFreqVector v)
Pruning of individual terms in term vectors.
|
pruneAllFieldPostings, prunePayload, pruneWholeTermVectorpublic RIDFTermPruningPolicy(org.apache.lucene.index.IndexReader in,
java.util.Map<java.lang.String,java.lang.Integer> fieldFlags,
java.util.Map<java.lang.String,java.lang.Double> thresholds,
double defThreshold)
public void initPositionsTerm(org.apache.lucene.index.TermPositions tp,
org.apache.lucene.index.Term t)
throws java.io.IOException
TermPruningPolicyTermPositions to a new Term.initPositionsTerm in class TermPruningPolicytp - input term positionst - current termjava.io.IOExceptionpublic boolean pruneTermEnum(org.apache.lucene.index.TermEnum te)
throws java.io.IOException
TermPruningPolicypruneTermEnum in class TermPruningPolicyte - positioned term enum.java.io.IOExceptionpublic boolean pruneAllPositions(org.apache.lucene.index.TermPositions termPositions,
org.apache.lucene.index.Term t)
throws java.io.IOException
TermPruningPolicypruneAllPositions in class TermPruningPolicytermPositions - positioned term positions. Implementations MUST NOT
advance this by calling TermPositions methods that advance either
the position pointer (next, skipTo) or term pointer (seek).t - current termjava.io.IOExceptionpublic int pruneTermVectorTerms(int docNumber,
java.lang.String field,
java.lang.String[] terms,
int[] freqs,
org.apache.lucene.index.TermFreqVector v)
throws java.io.IOException
TermPruningPolicypruneTermVectorTerms in class TermPruningPolicydocNumber - document numberfield - field nameterms - array of termsfreqs - array of term frequenciesv - the original term frequency vectorjava.io.IOExceptionpublic int pruneSomePositions(int docNum,
int[] positions,
org.apache.lucene.index.Term curTerm)
TermPruningPolicypruneSomePositions in class TermPruningPolicydocNum - current document numberpositions - original term positions in the document (and indirectly
term frequency)curTerm - current termCopyright © 2000-2022 Apache Software Foundation. All Rights Reserved.