Class Dictionary
java.lang.Object
org.apache.lucene.analysis.hunspell.Dictionary
In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescription(package private) static classPossible word breaks according to BREAK directivesprivate static classUsed to read flags as UTF-8 even if the rest of the file is in the default (8-bit) encodingprivate static classImplementation ofDictionary.FlagParsingStrategythat assumes each flag is encoded as two ASCII characters whose codes must be combined into a single character.(package private) static classAbstraction of the process of parsing flags taken from the affix and dic filesprivate static classImplementation ofDictionary.FlagParsingStrategythat assumes each flag is encoded in its numerical form.private static classSimple implementation ofDictionary.FlagParsingStrategythat treats the chars in each String as a individual flags. -
Field Summary
FieldsModifier and TypeFieldDescription(package private) static final intprivate static final int(package private) static final int(package private) static final int(package private) char[]private intprivate String[]private booleanprivate static final byte[](package private) Dictionary.Breaks(package private) boolean(package private) boolean(package private) List<CheckCompoundPattern> (package private) boolean(package private) boolean(package private) boolean(package private) char(package private) boolean(package private) char(package private) char(package private) char(package private) char(package private) int(package private) char(package private) int(package private) char(package private) CompoundRule[]private int(package private) CharsetDecoder(package private) static final Charsetprivate static final int(package private) booleanprivate static final char(package private) static final char(package private) final FlagEnumerator.LookupThe list of unique flagsets (wordforms).(package private) Dictionary.FlagParsingStrategy(package private) char(package private) char(package private) boolean(package private) booleanwe set this during sorting, so we know to add an extra int (index inmorphData) to FST output(package private) static final char(package private) ConvTableprivate char[](package private) boolean(package private) char(package private) String(package private) static final int(package private) int(package private) intprivate static final charprivate intprivate String[](package private) char(package private) String[](package private) static final char[](package private) char(package private) ConvTable(package private) char(package private) boolean(package private) ArrayList<AffixCondition> All condition checks used by prefixes and suffixes.private char[]All flags used in affix continuation classes.private char[]All flags used in affix continuation classes.(package private) boolean(package private) char[](package private) int[](package private) char(package private) String(package private) String(package private) WordStorageThe entries in the .dic file, mapping to their set of flags -
Constructor Summary
ConstructorsConstructorDescriptionDictionary(InputStream affix, List<InputStream> dictionaries, boolean ignoreCase, SortingStrategy sortingStrategy) Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, InputStream dictionary) Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, List<InputStream> dictionaries, boolean ignoreCase) Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. -
Method Summary
Modifier and TypeMethodDescriptionprivate voidaddHiddenCapitalizedWord(StringBuilder reuse, SortingStrategy.EntryAccumulator acc, String word, String afterSep) private intaddMorphFields(Map<String, Integer> indices, String morphFields) private voidaddPhoneticRepEntries(String word, String ph) (package private) charaffixData(int affixIndex, int offset) affixFST(TreeMap<String, IntArrayList> affixes) (package private) char[](package private) charcaseFold(char c) folds single character (according to LANG if present)private voidcheckCriticalDirectiveSame(String directive, LineNumberReader reader, Object expected, Object actual) (package private) CharSequencecleanInput(CharSequence input, StringBuilder reuse) (package private) DictEntry(package private) static StringextractLanguageCode(String isoCode) private StringfirstArgument(LineNumberReader reader, String line) (package private) intformStep()(package private) intgetAffixCondition(int affix) private StringgetAliasValue(int id) private CharsetDecodergetDecoder(String encoding) Retrieves the CharsetDecoder for the given encoding.(package private) static PathReturns the default temporary directory pointed to byjava.io.tmpdir.(package private) static Dictionary.FlagParsingStrategygetFlagParsingStrategy(String flagLine, Charset charset) Determines the appropriateDictionary.FlagParsingStrategybased on the FLAG definition line taken from the affix filebooleanReturns true if this dictionary was constructed with theignoreCaseoption(package private) booleanhasFlag(int entryId, char flag) (package private) booleanprotected doubleThe factor determining the size of the internal hash table used for storing the entries.(package private) booleanhasLanguage(String... langCodes) (package private) static intindexOfSpaceOrTab(String text, int start) (package private) booleanisCrossProduct(int affix) (package private) booleanisDotICaseChangeDisallowed(char[] word) (package private) booleanisFlagAppendedByAffix(int affixId, char flag) (package private) booleanisSecondStagePrefix(char flag) (package private) booleanisSecondStageSuffix(char flag) private IntsReflookupEntries(String root) (package private) IntsReflookupPrefix(char[] word) (package private) IntsReflookupSuffix(char[] word) (package private) IntsReflookupWord(char[] word, int offset, int length) Looks up Hunspell word forms from the dictionaryprivate static booleanmaybeConsume(BufferedInputStream stream, byte[] bytes) Consume the provided byte sequence in full, if present.(package private) booleanprivate voidmergeDictionaries(List<InputStream> dictionaries, CharsetDecoder decoder, SortingStrategy.EntryAccumulator acc) private static intmorphBoundary(String line) (package private) booleanneedsInputCleaning(CharSequence input) (package private) static IntsRefprivate voidparseAffix(TreeMap<String, IntArrayList> affixes, CharHashSet secondStageFlags, String header, LineNumberReader reader, AffixKind kind, Map<String, Integer> seenPatterns, Map<String, Integer> seenStrips, FlagEnumerator flags) Parses a specific affix rule putting the result into the provided affix mapprivate voidparseAlias(String line) private Dictionary.BreaksparseBreaks(LineNumberReader reader, String line) private CompoundRule[]parseCompoundRules(LineNumberReader reader, int num) private ConvTableparseConversions(LineNumberReader reader, int num) parseMapEntry(LineNumberReader reader, String line) private voidparseMorphAlias(String line) private intparseNum(LineNumberReader reader, String line) private voidreadAffixFile(InputStream affixStream, CharsetDecoder decoder, FlagEnumerator flags) Reads the affix file through the provided InputStream, building up the prefix and suffix mapsprivate voidreadConfig(InputStream stream, Charset streamCharset) Parses the encoding and flag format specified in the provided InputStreamreadMorphFields(String word, String unparsed) private WordStoragereadSortedDictionaries(FlagEnumerator flags, SortingStrategy.EntrySupplier sorted) private static CharsetDecoderreplacingDecoder(Charset charset) private static booleanshouldSkipEscapedChar(char ch) private StringsingleArgument(LineNumberReader reader, String line) private String[]splitBySpace(LineNumberReader reader, String line, int expectedParts) private String[]splitBySpace(LineNumberReader reader, String line, int minParts, int maxParts) splitMorphData(String morphData) protected booleanWhether incorrect PFX/SFX rule counts should be silently ignored.protected booleanWhether duplicate ICONV/OCONV lines should be silently ignored.(package private) StringtoLowerCase(String word) (package private) static char[](package private) StringtoTitleCase(String word) private StringunescapeEntry(String entry) private voidwriteNormalizedWordEntry(StringBuilder reuse, String line, SortingStrategy.EntryAccumulator acc)
-
Field Details
-
MAX_PROLOGUE_SCAN_WINDOW
static final int MAX_PROLOGUE_SCAN_WINDOW- See Also:
-
NOFLAGS
static final char[] NOFLAGS -
FLAG_UNSET
static final char FLAG_UNSET- See Also:
-
DEFAULT_FLAGS
private static final int DEFAULT_FLAGS- See Also:
-
HIDDEN_FLAG
static final char HIDDEN_FLAG- See Also:
-
DEFAULT_CHARSET
-
decoder
CharsetDecoder decoder -
prefixes
-
suffixes
-
breaks
Dictionary.Breaks breaks -
patterns
ArrayList<AffixCondition> patternsAll condition checks used by prefixes and suffixes. these are typically re-used across many affix stripping rules. so these are deduplicated, to save RAM. -
words
WordStorage wordsThe entries in the .dic file, mapping to their set of flags -
flagLookup
The list of unique flagsets (wordforms). theoretically huge, but practically small (for Polish this is 756), otherwise humans wouldn't be able to deal with it either. -
stripData
char[] stripData -
stripOffsets
int[] stripOffsets -
wordChars
String wordChars -
affixData
char[] affixData -
currentAffix
private int currentAffix -
AFFIX_FLAG
static final int AFFIX_FLAG- See Also:
-
AFFIX_STRIP_ORD
static final int AFFIX_STRIP_ORD- See Also:
-
AFFIX_CONDITION
private static final int AFFIX_CONDITION- See Also:
-
AFFIX_APPEND
static final int AFFIX_APPEND- See Also:
-
flagParsingStrategy
Dictionary.FlagParsingStrategy flagParsingStrategy -
aliases
-
aliasCount
private int aliasCount -
morphAliases
-
morphAliasCount
private int morphAliasCount -
morphData
-
hasCustomMorphData
boolean hasCustomMorphDatawe set this during sorting, so we know to add an extra int (index inmorphData) to FST output -
ignoreCase
boolean ignoreCase -
checkSharpS
boolean checkSharpS -
complexPrefixes
boolean complexPrefixes -
secondStagePrefixFlags
private char[] secondStagePrefixFlagsAll flags used in affix continuation classes. If an outer affix's flag isn't here, there's no need to do 2-level affix stripping with it. -
secondStageSuffixFlags
private char[] secondStageSuffixFlagsAll flags used in affix continuation classes. If an outer affix's flag isn't here, there's no need to do 2-level affix stripping with it. -
circumfix
char circumfix -
keepcase
char keepcase -
forceUCase
char forceUCase -
needaffix
char needaffix -
forbiddenword
char forbiddenword -
onlyincompound
char onlyincompound -
compoundBegin
char compoundBegin -
compoundMiddle
char compoundMiddle -
compoundEnd
char compoundEnd -
compoundFlag
char compoundFlag -
compoundPermit
char compoundPermit -
compoundForbid
char compoundForbid -
checkCompoundCase
boolean checkCompoundCase -
checkCompoundDup
boolean checkCompoundDup -
checkCompoundRep
boolean checkCompoundRep -
checkCompoundTriple
boolean checkCompoundTriple -
simplifiedTriple
boolean simplifiedTriple -
compoundMin
int compoundMin -
compoundMax
int compoundMax -
compoundRules
CompoundRule[] compoundRules -
checkCompoundPatterns
List<CheckCompoundPattern> checkCompoundPatterns -
ignore
private char[] ignore -
tryChars
String tryChars -
neighborKeyGroups
String[] neighborKeyGroups -
enableSplitSuggestions
boolean enableSplitSuggestions -
repTable
-
mapTable
-
maxDiff
int maxDiff -
maxNGramSuggestions
int maxNGramSuggestions -
onlyMaxDiff
boolean onlyMaxDiff -
noSuggest
char noSuggest -
subStandard
char subStandard -
iconv
ConvTable iconv -
oconv
ConvTable oconv -
fullStrip
boolean fullStrip -
language
String language -
alternateCasing
private boolean alternateCasing -
BOM_UTF8
private static final byte[] BOM_UTF8 -
CHARSET_ALIASES
-
FLAG_SEPARATOR
private static final char FLAG_SEPARATOR- See Also:
-
MORPH_SEPARATOR
private static final char MORPH_SEPARATOR- See Also:
-
-
Constructor Details
-
Dictionary
public Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, InputStream dictionary) throws IOException, ParseException Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir- Directory to use for offline sortingtempFileNamePrefix- prefix to use to generate temp file namesaffix- InputStream for reading the hunspell affix file (won't be closed).dictionary- InputStream for reading the hunspell dictionary file (won't be closed).- Throws:
IOException- Can be thrown while reading from the InputStreamsParseException- Can be thrown if the content of the files does not meet expected formats
-
Dictionary
public Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, List<InputStream> dictionaries, boolean ignoreCase) throws IOException, ParseException Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir- Directory to use for offline sortingtempFileNamePrefix- prefix to use to generate temp file namesaffix- InputStream for reading the hunspell affix file (won't be closed).dictionaries- InputStream for reading the hunspell dictionary files (won't be closed).- Throws:
IOException- Can be thrown while reading from the InputStreamsParseException- Can be thrown if the content of the files does not meet expected formats
-
Dictionary
public Dictionary(InputStream affix, List<InputStream> dictionaries, boolean ignoreCase, SortingStrategy sortingStrategy) throws IOException, ParseException Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
affix- InputStream for reading the hunspell affix file (won't be closed).dictionaries- InputStream for reading the hunspell dictionary files (won't be closed).sortingStrategy- the entry strategy for the dictionary loading- Throws:
IOException- Can be thrown while reading from the InputStreamsParseException- Can be thrown if the content of the files does not meet expected formats
-
-
Method Details
-
formStep
int formStep() -
lookupWord
Looks up Hunspell word forms from the dictionary -
lookupPrefix
-
lookupSuffix
-
lookup
-
nextArc
-
readAffixFile
private void readAffixFile(InputStream affixStream, CharsetDecoder decoder, FlagEnumerator flags) throws IOException, ParseException Reads the affix file through the provided InputStream, building up the prefix and suffix maps- Parameters:
affixStream- InputStream to read the content of the affix file fromdecoder- CharsetDecoder to decode the content of the file- Throws:
IOException- Can be thrown while reading from the InputStreamParseException
-
checkCriticalDirectiveSame
private void checkCriticalDirectiveSame(String directive, LineNumberReader reader, Object expected, Object actual) throws ParseException - Throws:
ParseException
-
parseMapEntry
- Throws:
ParseException
-
hasLanguage
-
lookupEntries
- Parameters:
root- a string to look up in the dictionary. No case conversion or affix removal is performed. To get the possible roots of any word, you may callHunspell.getRoots(String)- Returns:
- the dictionary entries for the given root, or
nullif there's none
-
dictEntry
-
extractLanguageCode
-
parseNum
- Throws:
ParseException
-
singleArgument
- Throws:
ParseException
-
firstArgument
- Throws:
ParseException
-
splitBySpace
private String[] splitBySpace(LineNumberReader reader, String line, int expectedParts) throws ParseException - Throws:
ParseException
-
splitBySpace
private String[] splitBySpace(LineNumberReader reader, String line, int minParts, int maxParts) throws ParseException - Throws:
ParseException
-
parseCompoundRules
private CompoundRule[] parseCompoundRules(LineNumberReader reader, int num) throws IOException, ParseException - Throws:
IOExceptionParseException
-
parseBreaks
private Dictionary.Breaks parseBreaks(LineNumberReader reader, String line) throws IOException, ParseException - Throws:
IOExceptionParseException
-
affixFST
- Throws:
IOException
-
parseAffix
private void parseAffix(TreeMap<String, IntArrayList> affixes, CharHashSet secondStageFlags, String header, LineNumberReader reader, AffixKind kind, Map<String, throws IOException, ParseExceptionInteger> seenPatterns, Map<String, Integer> seenStrips, FlagEnumerator flags) Parses a specific affix rule putting the result into the provided affix map- Parameters:
affixes- Map where the result of the parsing will be putheader- Header line of the affix rulereader- BufferedReader to read the content of the rule fromseenPatterns- map from condition -> index of patterns, for deduplication.- Throws:
IOException- Can be thrown while reading the ruleParseException
-
affixData
char affixData(int affixIndex, int offset) -
isCrossProduct
boolean isCrossProduct(int affix) -
getAffixCondition
int getAffixCondition(int affix) -
parseConversions
private ConvTable parseConversions(LineNumberReader reader, int num) throws IOException, ParseException - Throws:
IOExceptionParseException
-
readConfig
private void readConfig(InputStream stream, Charset streamCharset) throws IOException, ParseException Parses the encoding and flag format specified in the provided InputStream- Throws:
IOExceptionParseException
-
maybeConsume
Consume the provided byte sequence in full, if present. Otherwise leave the input stream intact.- Returns:
trueif the sequence matched and has been consumed.- Throws:
IOException
-
getDecoder
Retrieves the CharsetDecoder for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed...- Parameters:
encoding- Encoding to retrieve the CharsetDecoder for- Returns:
- CharSetDecoder for the given encoding
-
replacingDecoder
-
getFlagParsingStrategy
Determines the appropriateDictionary.FlagParsingStrategybased on the FLAG definition line taken from the affix file- Parameters:
flagLine- Line containing the flag information- Returns:
- FlagParsingStrategy that handles parsing flags in the way specified in the FLAG definition
-
unescapeEntry
-
shouldSkipEscapedChar
private static boolean shouldSkipEscapedChar(char ch) -
morphBoundary
-
indexOfSpaceOrTab
-
mergeDictionaries
private void mergeDictionaries(List<InputStream> dictionaries, CharsetDecoder decoder, SortingStrategy.EntryAccumulator acc) throws IOException - Throws:
IOException
-
writeNormalizedWordEntry
private void writeNormalizedWordEntry(StringBuilder reuse, String line, SortingStrategy.EntryAccumulator acc) throws IOException - Throws:
IOException
-
addHiddenCapitalizedWord
private void addHiddenCapitalizedWord(StringBuilder reuse, SortingStrategy.EntryAccumulator acc, String word, String afterSep) throws IOException - Throws:
IOException
-
toLowerCase
-
toTitleCase
-
readSortedDictionaries
private WordStorage readSortedDictionaries(FlagEnumerator flags, SortingStrategy.EntrySupplier sorted) throws IOException - Throws:
IOException
-
hashFactor
protected double hashFactor()The factor determining the size of the internal hash table used for storing the entries. The table size isentry_count * hashFactor. The default factor is 1.0. If there are too many hash collisions, the factor can be increased, resulting in faster access, but more memory usage. -
tolerateAffixRuleCountMismatches
protected boolean tolerateAffixRuleCountMismatches()Whether incorrect PFX/SFX rule counts should be silently ignored. False by default: aParseExceptionwill happen. -
tolerateDuplicateConversionMappings
protected boolean tolerateDuplicateConversionMappings()Whether duplicate ICONV/OCONV lines should be silently ignored. False by default: anIllegalStateExceptionwill happen. -
allNonSuggestibleFlags
char[] allNonSuggestibleFlags() -
readMorphFields
-
addMorphFields
-
addPhoneticRepEntries
-
isDotICaseChangeDisallowed
boolean isDotICaseChangeDisallowed(char[] word) -
parseAlias
-
getAliasValue
-
parseMorphAlias
-
splitMorphData
-
hasFlag
-
isFlagAppendedByAffix
boolean isFlagAppendedByAffix(int affixId, char flag) -
hasFlag
boolean hasFlag(int entryId, char flag) -
mayNeedInputCleaning
boolean mayNeedInputCleaning() -
needsInputCleaning
-
cleanInput
-
toSortedCharArray
-
isSecondStagePrefix
boolean isSecondStagePrefix(char flag) -
isSecondStageSuffix
boolean isSecondStageSuffix(char flag) -
caseFold
char caseFold(char c) folds single character (according to LANG if present) -
getIgnoreCase
public boolean getIgnoreCase()Returns true if this dictionary was constructed with theignoreCaseoption -
getDefaultTempDir
Returns the default temporary directory pointed to byjava.io.tmpdir. If not accessible or not available, an IOException is thrown.- Throws:
IOException
-