java.lang.Object
org.apache.lucene.misc.index.BPIndexReorderer
Implementation of "recursive graph bisection", also called "bipartite graph partitioning" and
often abbreviated BP, an approach to doc ID assignment that aims at reducing the sum of the log
gap between consecutive postings. While originally targeted at reducing the size of postings,
this algorithm has been observed to also speed up queries significantly by clustering documents
that have similar sets of terms together.
This algorithm was initially described by Dhulipala et al. in "Compressing graphs and inverted indexes with recursive graph bisection". This implementation takes advantage of some optimizations suggested by Mackenzie et al. in "Tradeoff Options for Bipartite Graph Partitioning".
Typical usage would look like this:
LeafReader reader; // reader to reorder
Directory targetDir; // Directory where to write the reordered index
Directory targetDir = FSDirectory.open(targetPath);
BPIndexReorderer reorderer = new BPIndexReorderer();
ForkJoinPool pool = new ForkJoinPool(Runtime.getRuntime().availableProcessors(),
p -> new ForkJoinWorkerThread(p) {}, null, false);
reorderer.setForkJoinPool(pool);
reorderer.setFields(Collections.singleton("body"));
CodecReader reorderedReaderView = reorderer.reorder(SlowCodecReaderWrapper.wrap(reader), targetDir);
try (IndexWriter w = new IndexWriter(targetDir, new IndexWriterConfig().setOpenMode(OpenMode.CREATE))) {
w.addIndexes(reorderedReaderView);
}
DirectoryReader reorderedReader = DirectoryReader.open(targetDir);
Note: This is a slow operation that consumes O(maxDoc + numTerms * numThreads) memory.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprivate classprivate classprivate static final classA forward index.(package private) static classUse a LSB Radix Sorter to sort the (docID, termID) entries.private class(package private) static interfacestatic classException that is thrown when not enough RAM is available.private static class -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intDefault maximum number of iterations per recursion level: 20.static final intMinimum required document frequency for terms to be considered: 4,096.static final intMinimum size of partitions.private static final intMinimum problem size that will result in tasks being splitted.private ForkJoinPoolprivate static final float[]private floatprivate intprivate intprivate intprivate doubleprivate static final intBlock size for terms in the forward index -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate BPIndexReorderer.ForwardIndexbuildForwardIndex(Directory tempDir, String postingsFileName, int maxDoc, int maxTerm) computeDocMap(CodecReader reader, Directory tempDir) Expert: Compute theSorter.DocMapthat holds the new doc ID numbering.private int[]computePermutation(CodecReader reader, Set<String> fields, Directory dir) Compute a permutation of the doc ID space that reduces log gaps between consecutive postings.private static longdocRAMRequirements(int maxDoc) (package private) static floatfastLog2(int i) An approximate log() function in base 2 which trades accuracy for much better performance.private int(package private) static intreadMonotonicInts(DataInput in, int[] ints) Decoding logic forwriteMonotonicInts(int[], int, DataOutput).reorder(CodecReader reader, Directory tempDir) Reorder the givenCodecReaderinto a reader that tries to minimize the log gap between consecutive documents in postings, which usually helps improve space efficiency and query evaluation efficiency.voidSets the fields to use to perform partitioning.voidsetForkJoinPool(ForkJoinPool forkJoinPool) Set theForkJoinPoolto run graph partitioning concurrently.voidsetMaxDocFreq(float maxDocFreq) Set the maximum document frequency for terms to be considered, as a ratio ofmaxDoc.voidsetMaxIters(int maxIters) Set the maximum number of iterations on each recursion level, 20 by default.voidsetMinDocFreq(int minDocFreq) Set the minimum document frequency for terms to be considered, 4096 by default.voidsetMinPartitionSize(int minPartitionSize) Set the minimum partition size, when the algorithm stops recursing, 32 by default.voidsetRAMBudgetMB(double ramBudgetMB) Set the amount of RAM that graph partitioning is allowed to use.private static booleanReturns true if, and only if, the givenIntsRefis sorted.private static long(package private) static voidwriteMonotonicInts(int[] ints, int len, DataOutput out) Simple bit packing that focuses on the common / efficient case when term IDs can be encoded on 16 bits.private intwritePostings(CodecReader reader, Set<String> fields, Directory tempDir, DataOutput postingsOut)
-
Field Details
-
TERM_IDS_BLOCK_SIZE
private static final int TERM_IDS_BLOCK_SIZEBlock size for terms in the forward index- See Also:
-
FORK_THRESHOLD
private static final int FORK_THRESHOLDMinimum problem size that will result in tasks being splitted.- See Also:
-
DEFAULT_MIN_DOC_FREQ
public static final int DEFAULT_MIN_DOC_FREQMinimum required document frequency for terms to be considered: 4,096.- See Also:
-
DEFAULT_MIN_PARTITION_SIZE
public static final int DEFAULT_MIN_PARTITION_SIZEMinimum size of partitions. The algorithm will stop recursing when reaching partitions below this number of documents: 32.- See Also:
-
DEFAULT_MAX_ITERS
public static final int DEFAULT_MAX_ITERSDefault maximum number of iterations per recursion level: 20. Higher numbers of iterations typically don't help significantly.- See Also:
-
minDocFreq
private int minDocFreq -
maxDocFreq
private float maxDocFreq -
minPartitionSize
private int minPartitionSize -
maxIters
private int maxIters -
forkJoinPool
-
ramBudgetMB
private double ramBudgetMB -
fields
-
LOG2_TABLE
private static final float[] LOG2_TABLE
-
-
Constructor Details
-
BPIndexReorderer
public BPIndexReorderer()Constructor.
-
-
Method Details
-
setMinDocFreq
public void setMinDocFreq(int minDocFreq) Set the minimum document frequency for terms to be considered, 4096 by default. -
setMaxDocFreq
public void setMaxDocFreq(float maxDocFreq) Set the maximum document frequency for terms to be considered, as a ratio ofmaxDoc. This is useful because very frequent terms (stop words) add significant overhead to the reordering logic while not being very relevant for ordering. This value must be in (0, 1]. Default value is 1. -
setMinPartitionSize
public void setMinPartitionSize(int minPartitionSize) Set the minimum partition size, when the algorithm stops recursing, 32 by default. -
setMaxIters
public void setMaxIters(int maxIters) Set the maximum number of iterations on each recursion level, 20 by default. Experiments suggests that values above 20 do not help much. However, values below 20 can be used to trade effectiveness for faster reordering. -
setForkJoinPool
Set theForkJoinPoolto run graph partitioning concurrently.NOTE: A value of
nullcan be used to run in the current thread, which is the default. -
getParallelism
private int getParallelism() -
setRAMBudgetMB
public void setRAMBudgetMB(double ramBudgetMB) Set the amount of RAM that graph partitioning is allowed to use. More RAM allows running faster. If not enough RAM is provided, aBPIndexReorderer.NotEnoughRAMExceptionwill be thrown. This is 10% of the total heap size by default. -
setFields
Sets the fields to use to perform partitioning. Anullvalue indicates that all indexed fields should be used. -
writePostings
private int writePostings(CodecReader reader, Set<String> fields, Directory tempDir, DataOutput postingsOut) throws IOException - Throws:
IOException
-
buildForwardIndex
private BPIndexReorderer.ForwardIndex buildForwardIndex(Directory tempDir, String postingsFileName, int maxDoc, int maxTerm) throws IOException - Throws:
IOException
-
computeDocMap
Expert: Compute theSorter.DocMapthat holds the new doc ID numbering. This is exposed to enable integration intoBPReorderingMergePolicy,reorder(CodecReader, Directory)should be preferred in general.- Throws:
IOException
-
reorder
Reorder the givenCodecReaderinto a reader that tries to minimize the log gap between consecutive documents in postings, which usually helps improve space efficiency and query evaluation efficiency. Note that the returnedCodecReaderis slow and should typically be used in a call toIndexWriter.addIndexes(CodecReader...).- Throws:
BPIndexReorderer.NotEnoughRAMException- if not enough RAM is providedIOException
-
computePermutation
private int[] computePermutation(CodecReader reader, Set<String> fields, Directory dir) throws IOException Compute a permutation of the doc ID space that reduces log gaps between consecutive postings.- Throws:
IOException
-
sorted
Returns true if, and only if, the givenIntsRefis sorted. -
docRAMRequirements
private static long docRAMRequirements(int maxDoc) -
termRAMRequirementsPerThreadPerTerm
private static long termRAMRequirementsPerThreadPerTerm() -
fastLog2
static float fastLog2(int i) An approximate log() function in base 2 which trades accuracy for much better performance. -
writeMonotonicInts
Simple bit packing that focuses on the common / efficient case when term IDs can be encoded on 16 bits.- Throws:
IOException
-
readMonotonicInts
Decoding logic forwriteMonotonicInts(int[], int, DataOutput). It should get auto-vectorized.- Throws:
IOException
-