Search K
Appearance
Appearance
fts-dovecot
) Plugin Dovecot Pro Full Text Search (FTS) is a proprietary, Pro-only FTS plugin. It provides fast and compact indexing of search data.
All Dovecot indexes, including FTS indexes, are stored in the same storage (including object storage) used to store the mail and index data. No separate permanent storage media is needed for the FTS indexes.
The pre and post processing of input data and search terms heavily relies on the upper level fts-plugin and lib-fts. Most of the configuration options affect lib-fts functionality.
The Dovecot FTS indexes are created and queried by a custom FTS engine. The FTS engine component is loaded into the Dovecot FTS plugin as an index backend and it processes text input from the FTS tokenizer and filter chains and search queries constructed by the FTS plugin.
Detailed setting information can be found below.
Example configuration (in dovecot.conf):
# These are assumed below,
# mail_location = obox:%2Mu/%2.3Mu/%u:INDEX=~/:CONTROL=~/
# obox_fs = fscache:512M:/var/cache/mails/%4Nu:s3:http://mails.s3.example.com/
mail_plugins = $mail_plugins fts fts_dovecot
plugin {
fts = dovecot
# Fall back to built in search.
#fts_enforced = no
# Local filesystem example:
# Use local filesystem storing FTS indexes
#fts_dovecot_fs = posix:prefix=%h/fts/
# OBOX example:
# Keep this the same as obox_fs plus the root "directory" in mail_location
# setting. Then append e.g. /fts/
# Example: s3:http://<ip.address.>/%2Mu/%2.3Mu/%u/fts/
fts_dovecot_fs = fts-cache:fscache:512M:/var/cache/fts/%4Nu:s3:http://fts.s3.example.com/%2Mu/%2.3Mu/%u/fts/
# Detected languages. Languages that are not recognized, default to the
# first enumerated language, i.e. en.
fts_languages = en fr # English and French.
# This chain of filters first normalizes and lower cases the text, then
# stems the words and lastly removes stopwords.
fts_filters = normalizer-icu snowball stopwords
# This chain of filters will first lowercase all text, stem the words,
# remove possessive suffixes, and remove stopwords.
fts_filters_en = lowercase snowball english-possessive stopwords
# These tokenizers will preserve addresses as complete search tokens, but
# otherwise tokenize the text into "words".
fts_tokenizers = generic email-address
fts_tokenizer_generic = algorithm=simple
# Proactively index mail as it is delivered or appended, not only when
# searching.
fts_autoindex=yes
# How many \Recent flagged mails a mailbox is allowed to have, before it
# is not autoindexed.
# This setting can be used to exclude mailboxes that are seldom accessed
# from automatic indexing.
fts_autoindex_max_recent_msgs=99
# Exclude mailboxes we do not wish to index automatically.
# These will be indexed on demand, if they are used in a search.
fts_autoindex_exclude = \Junk
fts_autoindex_exclude2 = \Trash
fts_autoindex_exclude3 = .DUMPSTER
}
Note
Dovecot Pro FTS engine relies on Dovecot core FTS libraries (and configuration) for several features, including filtering and tokenization.
See: fts plugin.
fts_dovecot_fs
Default | [None] |
---|---|
Value | string |
See Also |
Define the location for the fts cache and indexes path on remote filesystems.
It must be somewhat synchronized with obox_fs
and
mail_location
.
It is recommended that the FTS and email fscaches point to DIFFERENT locations.
fts_dovecot_mail_flush_interval
Default | 10 |
---|---|
Value | unsigned integer |
Changes |
|
Advanced Setting; this should not normally be changed. |
Upload locally cached FTS indexes to object storage every N new emails. This reduces the number of emails that have to be read after backend failure to update the FTS indexes, but at the cost of doing more writes to object storage.
fts_dovecot_max_triplets
Default | 200 |
---|---|
Value | unsigned integer |
Changes |
|
Advanced Setting; this should not normally be changed. |
FTS lookups will fail and error message will be logged, when the number of
triplets exceeds the threshold specified in the setting. 0
means there
is no maximum number of triplets to be exceeded.
fts_dovecot_message_count_stats
Default | no |
---|---|
Value | string |
Enable tracking per-folder message counts in fts.S stats file. This is
useful for the doveadm fts check fast
command
to return per-folder results. Note that this changes the fts.S file format to
be backwards incompatible, so this should be enabled only after all backends
in the cluster have been upgraded.
Old Dovecot versions won't fail when they see the new fts.S file, but it needs to be regenerated, which can temporarily cause bad performance.
fts_dovecot_min_merge_l_file_size
Default | 128 kB |
---|---|
Value | size |
Advanced Setting; this should not normally be changed. |
The smallest FTS triplet is getting recreated whenever indexing new mails until it reaches this size. Then the triplet becomes merged with the next largest triplet.
When fts-cache is used, this effectively controls how large the fts.L file can become in metacache until the FTS triplet is uploaded to object storage.
fts_dovecot_prefix
Default | no |
---|---|
Value | string |
Specifies how prefix search should be invoked. May not work with some filters.
Options:
Value | Description |
---|---|
yes |
Equivalent to 0-255 |
<num>-[<num>] |
Search strings with that length will be treated as prefixes (e.g. 4- , 3-10 ) |
no |
No prefix searching is performed |
fts_dovecot_too_many_triplets
Emitted when number of triplets exceeds the limit defined by
fts_dovecot_max_triplets
.
Field | Description |
---|---|
duration | Duration of the event (in microseconds) |
reason_code | List of reason code strings why the event happened. See event reasons for possible values. |
triplet_count | Number of triplets found |
user | Username of the user. |
session | Session ID for the storage session. |
service | Name of the service. Examples: Added: 3.0.0 |
In Dovecot Pro FTS backend there are doveadm fts check
commands, which can be used to determine whether rescan is necessary.
The FTS indexes can sometimes become out-of-sync with the actual mailbox. Some messages could be missing and some could be leaked. In theory it should not be possible to have missing mails in FTS, but there still seem to be some bugs left. Leaked messages (i.e. already deleted messages that still appear in FTS) are possible in case of unexpected crashes or storage errors.
The consistency of FTS indexes can be checked using doveadm fts check fast
and doveadm fts check full
commands. These are intended to be run in e.g. nightly batch jobs. The "fast" check is expected to be run nightly for all the users in local metacache, since it doesn't access object storage. However, it might not always have all the information for giving a reliable answer whether the FTS indexes are synced or not, in which case some of the numbers may be either "?" or "123?". There is a --refresh
parameter, which can be used to do the necessary object storage accesses to give reliable results. However, at that point it might be better to just run a "full" check instead.
After all backends in the cluster have been upgraded to the new Dovecot version, the fts_dovecot_message_count_stats
setting should be enabled. This allows per-folder results for the "fast" scan, which makes the scan more reliable and more detailed. After the setting is enabled, all the triplets in fts.S files still need to be refreshed for the per-folder result to work. This happens for newly written triplets automatically, but eventually it is necessary to use the --refresh
parameter (or some other method) to add the missing information for older triplets.
When these checks are run nightly, it's possible to find out quickly when something breaks. This means it's possible to fix the FTS indexes before users notice that search isn't finding some messages. It also makes it easier for Dovecot developers to find and fix any remaining FTS bugs, because we can be sure that the bug happened within the last 24 hours and all the logs are still available during that time.
The idea for the nightly script is to:
doveadm fts check fast
for all users that have recently been accessed in the metacache.doveadm fts check full
for users to find exactly what differences there really are.doveadm fts rescan
followed by doveadm index
to reindex users that have missing mails. This unfortunately for now requires reindexing all of the messages for the user.doveadm fts check fast
fields:
Field | Description |
---|---|
autoindex | "yes" or "no" depending on whether the mailbox matches fts_autoindex settings. |
mailbox uidnext | The expected UID for the next message that is saved to the mailbox. This can be compared against the "fts highest uid"+1. |
fts highest uid | The highest UID in the mailbox that has been FTS indexed. |
mailbox total count | Total number of messages in the mailbox, also including messages that haven't even been attempted to be FTS indexed. |
expected fts count | Expected number of messages in FTS index, based on "fts highest uid" and the current mailbox state. |
fts count | Actual number of messages in FTS index. |
fts expunges | Number of messages marked as expunged in the fts.X file, but not yet purged from the FTS triplets. This is already included in the calculation to produce the "fts count" field, so it's only for informative/debugging purposes. |
doveadm fts check full
states:
State | Description |
---|---|
synced | Message exists in both mailbox and in FTS. |
synced_expunged | Message doesn't exist in mailbox, but it's correctly marked as expunged in FTS (but not yet purged out of the triplets). |
missing | Message exists in mailbox, but is missing from FTS. It needs to be reindexed. |
unexpunged | Message exists in mailbox, but it was already marked as expunged in FTS, although it's not yet purged from triplets. This isn't supposed to happen. |
missing_unexpunged | Message exists in mailbox, but it was already marked as expunged in FTS and already purged from triplets. This really isn't supposed to happen. |
leaked | Message doesn't exist in mailbox, but it exists in FTS. The same message may be leaked multiple times in different triplets (they are not counted as "duplicate"). |
expunge_leaked | Message doesn't exist in mailbox or triplets, but it is marked as expunged in FTS. The messages were never removed from the fts.X file. There were various bugs that caused this to happen. |
duplicate | Message exists in mailbox, and multiple times in FTS. The first time is counted as "synced", "synced_expunged" or "unexpunged" while the other instances are "duplicate". |
See doveadm-fts(1)
for detailed list of parameters and command exit codes.
Each account's mail is indexed into a small set of control files, and one or more triplets of files.
The control files are:
File | Purpose | Description |
---|---|---|
S | 'Stats' cache | Contains information about all of the triplets |
X | 'eXpunge' file | A list of mails to be expunged |
Y | 'expunged' file | A list of mails that have been expunged |
Both X and Y grow by being appended to. When Y grows to sufficient size to indicate that the X file contains old stuff, the contents of Y will be subtracted from X, and Y will be deleted. This is automatic as part of an expunge.
Each triplet contains of the following:
File | Purpose | Description |
---|---|---|
D | 'Docindex', or index of documents | Contains { mailbox_guid, uid, header/mime_part } info |
W | 'Wordlist' | Contains all the indexed words, and offsets into the L file |
L | 'docList' | Ccontaining lists of indices into the D file. |
To perform a lookup of a word, find the L-offset for that word from the W file. From that offset in the L file, read the list of docidx (document index) values. From the D file, look up the { guid, uid, hdr/part } values.
This sounds complicated, but if a word is not found, you don't need to touch the L and D files. If (AND) searching for multiple words, and one of the words is not in the W file, then you don't need to touch the L file. If (AND) searching, and the intersection of the lists in the L file is empty, then you don't need to touch the D file.
These three files can be considered as 2 dimensional data, with W and D being the two axes, and L being the 2D region itself. Preferably in typical use the L files dominate the sizes. However, because deciding what is and isn't a "word" is hard, the W files also can grow very large.
For storage planning, Product decision is to assume that no FTS file will exceed 500MB. Theoretically, they could grow past that size, but allowing non-sparse objects to be used in Scality (for obox) is a valid trade-off for better performance.
Stats for each triplet are cached in the 'S' file - this includes the number of entities (documents (= headers + parts) for D, words for W, and matches for L files).
Maxuid stats for every mailbox_guid in each triplet are also cached in the same file. This helps give fast answers to some common queries.
By default FTS has no read or write caches. When indexing a new mail the FTS indexes are immediately written to the storage. With object storages this means quite a lot of write and delete operations. To optimize this, "fts-cache" was implemented for write caching. The fts-cache causes the last triplet to be kept in local metacache until one of the following happens:
fts_dovecot_min_merge_l_file_size
(default: 128 kB)fts_dovecot_mail_flush_interval
number of mails.FTS is commonly also configured to use fscache, which caches reading of FTS triplets that were already saved to the object storage.
The precise techniques for doing lookups depends on whether it's an AND or an OR query. AND permits early aborts before any of the L file is even touched. OR invites no such optimization.
Note
The kuromoji tokenizer is not distributed as part of the base Dovecot Pro package. This tokenizer requires separate licensing to use. Contact Open-Xchange Support for further information.
This tokenizer is used for Japanese text. This tokenizer utilizes Atilika Kuromoji tokenizer library to tokenize Japanese text.
This tokenizer also does NFKC normalization before tokenization, namely half-width and full-width character normalizations, such as:
We use the predefined set of stopwords which is recommended by Atilika. Those stopwords are reasonable and they have been made by tokenizing Japanese Wikipedia and have been reviewed by us. This set of stopwords is also included in the Apache Lucene and Solr projects and it is used by many Japanese search implementations.
maxlen
Maximum length of token, before an arbitrary cut off is made. The default value for the kuromoji tokenizer is 1024
.
kuromoji_split_compounds
This setting enables "search mode" in the Atilika Kuromoji library.
The setting defaults to enabled (i.e 1
) and should not be changed unless there is a compelling reason. To disable, set the value to 0
.
WARNING
If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.
id
Description of the normalizing/transliterating rules to use. See Normalizer Format for syntax.
Defaults to Any-NFKC
which is quite good for CJK text mixed with Latin alphabet languages. It transforms CJK characters to full-width encoding and transforms Latin ones to half-width. The NFKC transformation is described above.
WARNING
If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.
The kuromoji
tokenizer should be added to fts_tokenizers
. Configuration should be done via the fts_tokenizer_kuromoji
setting.
Example:
fts_tokenizers = generic email-address kuromoji
fts_tokenizers_kuromoji = maxlen=1024