Class AnalyzingSuggester
- All Implemented Interfaces:
Accountable
- Direct Known Subclasses:
FuzzySuggester
This can result in powerful suggester functionality. For example, if you use an analyzer
removing stop words, then the partial text "ghost chr..." could see the suggestion "The Ghost of
Christmas Past". Note that position increments MUST NOT be preserved for this example to work, so
you should call the constructor with preservePositionIncrements parameter set to
false
If SynonymFilter is used to map wifi and wireless network to hotspot then the partial text "wirele..." could suggest "wifi router". Token normalization like stemmers, accent removal, etc., would allow suggestions to ignore such variations.
When two matching suggestions have the same weight, they are tie-broken by the analyzed form. If their analyzed form is the same then the order is undefined.
There are some limitations:
- A lookup from a query like "net" in English won't be any different than "net " (ie, user added a trailing space) because analyzers don't reflect when they've seen a token separator and when they haven't.
- If you're using
StopFilter, and the user will type "fast apple", but so far all they've typed is "fast a", again because the analyzer doesn't convey whether it's seen a token separator after the "a",StopFilterwill remove that "a" causing far more matches than you'd expect. - Lookups with the empty string return no results instead of all results.
-
Nested Class Summary
Nested ClassesNested classes/interfaces inherited from class org.apache.lucene.search.suggest.Lookup
Lookup.LookupPriorityQueue, Lookup.LookupResult -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate longNumber of entries the lookup was built withprivate static final intMarks end of the analyzed input and start of dedup byte.static final intInclude this flag in the options parameter toAnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)to always return the exact match first, regardless of score.private final booleanTrue if exact match suggestions should always be returned first.private FST<PairOutputs.Pair<Long, BytesRef>> FST<Weight,Surface>: input is the analyzed form, with a null byte between terms weights are encoded as costs: (Integer.MAX_VALUE-weight) surface is the original, unanalyzed form.private booleanprivate final AnalyzerAnalyzer that will be used for analyzing suggestions at index time.private intHighest number of analyzed paths we saw for any single input surface form.private final intMaximum graph paths to index for a single analyzed surface form.private final intMaximum number of dup surface forms (different surface forms for the same analyzed form).private static final intstatic final intInclude this flag in the options parameter toAnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)to preserve token separators when matching.private booleanWhether position holes should appear in the automaton.private final booleanTrue if separator between tokens should be preserved.private final AnalyzerAnalyzer that will be used for analyzing suggestions at query time.private static final intRepresents the separation between tokens, if PRESERVE_SEP was specifiedprivate final Directoryprivate final String(package private) static final Comparator<PairOutputs.Pair<Long, BytesRef>> Fields inherited from class org.apache.lucene.search.suggest.Lookup
CHARSEQUENCE_COMPARATORFields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE -
Constructor Summary
ConstructorsConstructorDescriptionAnalyzingSuggester(Directory tempDir, String tempFileNamePrefix, Analyzer analyzer) AnalyzingSuggester(Directory tempDir, String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer) AnalyzingSuggester(Directory tempDir, String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer, int options, int maxSurfaceFormsPerAnalyzedForm, int maxGraphExpansions, boolean preservePositionIncrements) Creates a new suggester. -
Method Summary
Modifier and TypeMethodDescriptionvoidbuild(InputIterator iterator) Builds up a new internalLookuprepresentation based on the givenInputIterator.protected AutomatonUsed by subclass to change the lookup automaton, if necessary.private static intdecodeWeight(long encoded) cost -> weightprivate static intencodeWeight(long value) weight -> costget(CharSequence key) Returns the weight associated with an input string, or null if it does not exist.Returns nested resources of this class.longgetCount()Get the number of entries the lookup was built withprotected List<FSTUtil.Path<PairOutputs.Pair<Long, BytesRef>>> getFullPrefixPaths(List<FSTUtil.Path<PairOutputs.Pair<Long, BytesRef>>> prefixPaths, Automaton lookupAutomaton, FST<PairOutputs.Pair<Long, BytesRef>> fst) Returns all prefix paths to initialize the search.private Lookup.LookupResultgetLookupResult(Long output1, BytesRef output2, CharsRefBuilder spare) (package private) TokenStreamToAutomatonbooleanDiscard current lookup data and load it from a previously saved copy.lookup(CharSequence key, Set<BytesRef> contexts, boolean onlyMorePopular, int num) Look up a key and return possible completion for this key.longReturns byte size of the underlying FST.private Automatonprivate booleansameSurfaceForm(BytesRef key, BytesRef output2) booleanstore(DataOutput output) Persist the constructed lookup data to a directory.(package private) final AutomatontoAutomaton(BytesRef surfaceForm, TokenStreamToAutomaton ts2a) (package private) final Automaton
-
Field Details
-
fst
FST<Weight,Surface>: input is the analyzed form, with a null byte between terms weights are encoded as costs: (Integer.MAX_VALUE-weight) surface is the original, unanalyzed form. -
indexAnalyzer
Analyzer that will be used for analyzing suggestions at index time. -
queryAnalyzer
Analyzer that will be used for analyzing suggestions at query time. -
exactFirst
private final boolean exactFirstTrue if exact match suggestions should always be returned first. -
preserveSep
private final boolean preserveSepTrue if separator between tokens should be preserved. -
EXACT_FIRST
public static final int EXACT_FIRSTInclude this flag in the options parameter toAnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)to always return the exact match first, regardless of score. This has no performance impact but could result in low-quality suggestions.- See Also:
-
PRESERVE_SEP
public static final int PRESERVE_SEPInclude this flag in the options parameter toAnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)to preserve token separators when matching.- See Also:
-
SEP_LABEL
private static final int SEP_LABELRepresents the separation between tokens, if PRESERVE_SEP was specified- See Also:
-
END_BYTE
private static final int END_BYTEMarks end of the analyzed input and start of dedup byte.- See Also:
-
maxSurfaceFormsPerAnalyzedForm
private final int maxSurfaceFormsPerAnalyzedFormMaximum number of dup surface forms (different surface forms for the same analyzed form). -
maxGraphExpansions
private final int maxGraphExpansionsMaximum graph paths to index for a single analyzed surface form. This only matters if your analyzer makes lots of alternate paths (e.g. contains SynonymFilter). -
tempDir
-
tempFileNamePrefix
-
maxAnalyzedPathsForOneInput
private int maxAnalyzedPathsForOneInputHighest number of analyzed paths we saw for any single input surface form. For analyzers that never create graphs this will always be 1. -
hasPayloads
private boolean hasPayloads -
PAYLOAD_SEP
private static final int PAYLOAD_SEP- See Also:
-
preservePositionIncrements
private boolean preservePositionIncrementsWhether position holes should appear in the automaton. -
count
private volatile long countNumber of entries the lookup was built with -
weightComparator
-
-
Constructor Details
-
AnalyzingSuggester
-
AnalyzingSuggester
-
AnalyzingSuggester
public AnalyzingSuggester(Directory tempDir, String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer, int options, int maxSurfaceFormsPerAnalyzedForm, int maxGraphExpansions, boolean preservePositionIncrements) Creates a new suggester.- Parameters:
indexAnalyzer- Analyzer that will be used for analyzing suggestions while building the index.queryAnalyzer- Analyzer that will be used for analyzing query text during lookupoptions- seeEXACT_FIRST,PRESERVE_SEPmaxSurfaceFormsPerAnalyzedForm- Maximum number of surface forms to keep for a single analyzed form. When there are too many surface forms we discard the lowest weighted ones.maxGraphExpansions- Maximum number of graph paths to expand from the analyzed form. Set this to -1 for no limit.preservePositionIncrements- Whether position holes should appear in the automata
-
-
Method Details
-
ramBytesUsed
public long ramBytesUsed()Returns byte size of the underlying FST. -
getChildResources
Description copied from interface:AccountableReturns nested resources of this class. The result should be a point-in-time snapshot (to avoid race conditions).- See Also:
-
replaceSep
-
convertAutomaton
Used by subclass to change the lookup automaton, if necessary. -
getTokenStreamToAutomaton
TokenStreamToAutomaton getTokenStreamToAutomaton() -
build
Description copied from class:LookupBuilds up a new internalLookuprepresentation based on the givenInputIterator. The implementation might re-sort the data internally.- Specified by:
buildin classLookup- Throws:
IOException
-
store
Description copied from class:LookupPersist the constructed lookup data to a directory. Optional operation.- Specified by:
storein classLookup- Parameters:
output-DataOutputto write the data to.- Returns:
- true if successful, false if unsuccessful or not supported.
- Throws:
IOException- when fatal IO error occurs.
-
load
Description copied from class:LookupDiscard current lookup data and load it from a previously saved copy. Optional operation.- Specified by:
loadin classLookup- Parameters:
input- theDataInputto load the lookup data.- Returns:
- true if completed successfully, false if unsuccessful or not supported.
- Throws:
IOException- when fatal IO error occurs.
-
getLookupResult
-
sameSurfaceForm
-
lookup
public List<Lookup.LookupResult> lookup(CharSequence key, Set<BytesRef> contexts, boolean onlyMorePopular, int num) Description copied from class:LookupLook up a key and return possible completion for this key.- Specified by:
lookupin classLookup- Parameters:
key- lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.contexts- contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a matchonlyMorePopular- return only more popular resultsnum- maximum number of results to return- Returns:
- a list of possible completions, with their relative weight (e.g. popularity)
-
getCount
public long getCount()Description copied from class:LookupGet the number of entries the lookup was built with -
getFullPrefixPaths
protected List<FSTUtil.Path<PairOutputs.Pair<Long,BytesRef>>> getFullPrefixPaths(List<FSTUtil.Path<PairOutputs.Pair<Long, BytesRef>>> prefixPaths, Automaton lookupAutomaton, FST<PairOutputs.Pair<Long, throws IOExceptionBytesRef>> fst) Returns all prefix paths to initialize the search.- Throws:
IOException
-
toAutomaton
- Throws:
IOException
-
toLookupAutomaton
- Throws:
IOException
-
get
Returns the weight associated with an input string, or null if it does not exist. -
decodeWeight
private static int decodeWeight(long encoded) cost -> weight -
encodeWeight
private static int encodeWeight(long value) weight -> cost
-