@fs_dictmap_defaults
Default | [None] |
---|---|
Value | Groups Includes |
Allowed Values | cassandra |
See Also |
Appearance
Dictmap is a required component of obox.
Using obox with Cassandra is done via the fs-dictmap wrapper, which translates internal "lib-fs paths" into dict API.
The dict API paths in turn are translated to SQL/CQL queries via dict-sql.
Cassandra requires installing Dovecot Pro Cassandra plugin package and the cpp-driver from 3rdparty repository.
For obox, dictmap requires using the Cassandra dictionary.
TIP
Cassandra support is done via Dovecot's SQL dict, because Cassandra CQL is implemented as a lib-sql driver.
@fs_dictmap_defaults
Default | [None] |
---|---|
Value | Groups Includes |
Allowed Values | cassandra |
See Also |
fs_dictmap_bucket_cache_path
Default | [None]obox { "%{home}/buckets.cache" } |
---|---|
Value | string |
See Also |
Required when fs_dictmap_bucket_size
is set. Bucket counters are
cached in this file. This path should be located under the obox indexes
directory (on the SSD backed cache mount point; e.g.
%{home}/buckets.cache
).
fs_dictmap_bucket_deleted_days
Default | 0obox { 11 } |
---|---|
Value | unsigned integer |
See Also |
Track Cassandra's tombstones in buckets.cache
file to avoid creating
excessively large buckets when a lot of mails are saved and deleted in a
folder. The value should be one day longer than gc_grace_seconds
for the
user_mailbox_objects
table. By default this is 10 days, so in that case
fs_dictmap_bucket_deleted_days = 11
should be used. When determining
whether fs_dictmap_bucket_size
is reached and a new one needs to be
created, with this setting the tombstones are also taken into account. This
tracking is preserved only as long as the buckets.cache
exists.
fs_dictmap_bucket_size
Default | 0obox { 10000 } |
---|---|
Value | unsigned integer |
Dependencies | |
See Also |
Separate email objects into buckets, where each bucket can have a maximum of
this many emails. This should be set to 10000
with Cassandra to avoid
partitions becoming too large when there are a lot of emails.
fs_dictmap_cleanup_uncertain
Default | yes |
---|---|
Value | boolean |
See Also |
If a write to Cassandra fails with uncertainty and this setting is enabled Dovecot attempts to clean it up.
fs_dictmap_delete_dangling_links
Default | no |
---|---|
Value | boolean |
See Also |
If an object exists in dict, but not in storage, delete it automatically from dict when it's noticed.
WARNING
This setting isn't safe to use by default, because storage may return "object doesn't exist" errors only temporarily during split brain.
fs_dictmap_delete_timestamp
Default | 10s |
---|---|
Value | time (milliseconds) |
Increase Cassandra's DELETE
timestamp by this value. This is useful to make
sure the DELETE
isn't ignored because Dovecot backends' times are slightly
different.
WARNING
If the same key is intentionally attempted to be written again soon afterwards,
the write becomes ignored. Dovecot doesn't normally do this, but this can
happen if the user is deleted with doveadm obox user delete
and the same
user is recreated. This can also happen with doveadm backup
that reverts
changes by deleting a mailbox; running the doveadm backup
again will
recreate the mailbox with the same GUID.
fs_dictmap_dict_prefix
Default | [None] |
---|---|
Value | string |
Prefix that is added to all dict keys.
fs_dictmap_diff_table
Default | no metacache { yes } |
---|---|
Value | boolean |
See Also |
Store diff and self index bundle objects to a separate table. This is a Cassandra-backend optimization.
fs_dictmap_lock_path
Default | [None] |
---|---|
Value | string |
See Also |
If fs_dictmap_refcounting_table
is enabled, use this dictionary for
creating lock files to objects while they're being copied or deleted. This
attempts to prevent race conditions where an object copy and delete runs
simultaneously and both succeed, but the copied object no longer exists. This
can't be fully prevented if different servers do this concurrently. If
lazy-expunge plugin is used this setting isn't really needed, because such
race conditions are practically nonexistent. Not using the setting will
improve performance by avoiding a Cassandra SELECT
when copying mails.
fs_dictmap_max_parallel_iter
Default | 10 |
---|---|
Value | unsigned integer |
Changes |
|
Describes how many parallel dict iterations can be created internally. The
default value is 10
. Parallel iterations can especially help speed up
reading huge folders.
fs_dictmap_nlinks_limit
Default | 0obox { 3 } |
---|---|
Value | unsigned integer |
Defines the maximum number of results returned from a dictionary iteration
lookup (i.e. Cassandra CQL query) when checking the number of links to an
object. Limiting this may improve performance. Currently Dovecot only cares
whether the link count is 0
, 1
or "more than 1
" so for a bit of
extra safety we recommend setting it to 3
.
fs_dictmap_refcounting_index
Default | no |
---|---|
Value | boolean |
See Also |
Similar to the fs_dictmap_refcounting_table
setting, but instead of
using a reverse table to track the references, assume that the database has a
reverse index set up.
fs_dictmap_refcounting_table
Default | no obox { yes } |
---|---|
Value | boolean |
See Also |
Enable reference counted objects. Reference counting allows a single mail object to be stored in multiple mailboxes, without the need to create a new copy of the message data in object storage.
fs_dictmap_storage_objectid_migrate
Default | no |
---|---|
Value | boolean |
This is expected to be used with storage-objectid-prefix when adding fs-dictmap
for an existing installation. The newly created object IDs have
<storage-objectid-prefix>/<object-id>
path while the migrated object IDs
have <user>/mailboxes/<mailbox-guid>/<oid>
path. The newly created object
IDs can be detected from the 0x80
bit in the object ID's extra-data
.
Migrated object IDs can't be copied directly within dict - they'll be first
copied to a new object ID using the parent fs.
fs_dictmap_storage_objectid_prefix
Default | [None] |
---|---|
Value | string |
See Also |
Use fake object IDs with object storage that internally uses paths. This makes
their performance much better, since it allows caching object IDs in Dovecot
index files and copying them via dict. This works by storing object in
<prefix>/<objectid>
. This setting should be used inside obox plugin
named filter for storing mails under <prefix>
(but not for
metacache
or fts
).
For example:
fs_dictmap_storage_objectid_prefix = %{user}/mails/
fs_dictmap_storage_passthrough_paths
Default | none |
---|---|
Value | string |
Allowed Values | none full read-only |
See Also |
Use fake object IDs with object storage that internally uses path. Assume that
object ID is the same as the path. Objects can't be copied within the dict.
This setting should be used inside metacache
and
fts_dovecot
named filters, because they don't need to support
copying objects. For mails, use fs_dictmap_storage_objectid_prefix
instead.
Value | Description |
---|---|
none |
Don't use fake object IDs. |
full |
The object ID is written to dict as an empty value, because it's not used. |
read-only |
Useful for backwards compatibility. The path is written to the dict as the object ID even though it is not used (except potentially by an older Dovecot version). |
The fs-dictmap uses the following dict paths:
Main Access
shared/dictmap/<path>
If refcounting-table is used
shared/dictrevmap/<user>/mailboxes/<folder guid>/<object id>
shared/dictrevmap/<object id>/<object name>
shared/dictrevmap/<object id>
If fs_dictmap_diff_table
is used:
shared/dictdiffmap/<user>/idx/<host>
<host>
shared/dictdiffmap/<user>/mailboxes/<folder guid>/idx/<host>
<host>
See Cassandra configuration for all Cassandra-specific settings.
cassandra_hosts = cassandra-host-3 cassandra-host-2 cassandra-host-1
dict_server {
dict mails {
driver = sql
sql_driver = cassandra
cassandra_keyspace = mails
@fs_dictmap_defaults = cassandra
}
}
The Cassandra settings are described in more detail in Cassandra configuration.
The following base tables are always needed by fs-dictmap:
user_index_objects
user_mailbox_index_objects
user_mailbox_objects
user_mailbox_buckets
user_fts_objects
For more details on Cassandra, see:
Cassandra doesn't handle row deletions very efficiently. The more rows are deleted, the larger number of tombstones and the longer it takes to do lookups from the same partition.
Most of the deletions Dovecot does are index diff & self-bundle updates.
Each Dovecot Backend server always writes only a single such object per folder, which allows storing them with (user, folder, host) primary key and updating the rows on changes, instead of inserting & deleting the rows.
The fs-dictmap fs_dictmap_diff_table
setting enables this behavior.
Diff-table requires these additional tables to exist in Cassandra:
user_index_diff_objects
user_mailbox_index_diff_objects
Reference counting allows a single mail object to be stored in multiple mailboxes, without the need to create a new copy of the message data in object storage. There are two downsides to it though:
However, the benefits outweigh the concerns as reference counting exchanges expensive storage operations with relatively cheap Cassandra row updates.
The fs-dictmap fs_dictmap_refcounting_table
setting enables this behavior.
Reference counting requires an additional table:
There are only two configurations that are currently recommended:
Quorum within a single datacenter (default):
cassandra_read_consistency = local-quorum
cassandra_write_consistency = local-quorum
cassandra_delete_consistency = local-quorum
Local-quorum guarantees that reads after writes are always returning the latest data. Dovecot requires strong consistency within a datacenter.
Quorum within multiple datacenters:
cassandra_read_consistency = local-quorum
#cassandra_read_fallback_consistency = quorum
cassandra_write_consistency = each-quorum
cassandra_write_fallback_consistency = local-quorum
cassandra_delete_consistency = each-quorum
cassandra_delete_fallback_consistency = local-quorum
As long as the datacenters are talking to each other, this uses each-quorum for writes. If there's a problem, Cassandra nodes fallback to local-quorum and periodically try to switch back to each-quorum. The main benefit of each-quorum is that in case the local datacenter suddenly dies and loses data, Dovecot will not have responded OK to any mail deliveries that weren't already replicated to the other datacenters. Using local-quorum as fallback ensures that in case of a network split the local Palomar still keeps working. Of course, if the local datacenter dies while the network is also split, there will be data loss.
Using cassandra_read_fallback_consistency = quorum
allows reads to succeed even in cases when multiple Cassandra nodes have failed in the local datacenter. For example:
Note that if there are only a total of 3 Cassandra nodes per datacenter and 2 of them are lost, writes can't succeed with either each-quorum or local-quorum. In this kind of a configuration having read_fallback_consistency=quorum
is not very useful.
Also note that there are no consistency settings that allow Dovecot to reliably continue operating if Cassandra in the local datacenter no longer has quorum, i.e. at least half of its nodes have gone down. In this case writes will always fail. If this happens, all users should be moved to be processed by another datacenter.
Dovecot normally sends the Cassandra queries with the primary consistency setting. If a write fails because either
Dovecot attempts the query again using the fallback consistency. When this happens, Dovecot also switches all the following queries to use the fallback consistency for a while. The consistency will be switched back when a query with the primary consistency level succeeds again.
While fallback consistency is being used, the queries are periodically still retried with primary consistency level. The initial retry happens after 50 ms and the retries are doubled until they reach the maximum of 60 seconds.
Cassandra doesn't perform any rollbacks to writes. When Cassandra reports a write as failed, it only means that it wasn't able to verify that the required consistency level was reached yet. It's still likely/possible that the write was successful to some nodes. If even a single copy was written, Cassandra will eventually be consistent after hinted handoffs or repairs. This means that even though a write may initially have looked like it failed, the data can become visible sooner or later.
Changed: 3.0.0 When this happens, Dovecot attempts to revert the Cassandra write by deleting it. If this deletion was successful, the object is deleted from storage as well. This is indicated as adding - Object ID ... deleted
after the original write error message.
If the deletion was unsuccessful, it logs file write state is uncertain for object ID ...
For some writes the revert isn't possible, and success is uncertain, not deleting object ID ...
is logged. This also happens when no-cleanup-uncertain
parameter is used. In these cases the object is not deleted in storage.
When the revert wasn't performed, the Cassandra write may become visible at some point later (possibly leading to duplicate mails). If it doesn't become visible, the object becomes leaked in the storage. Currently to avoid these situations an external tool has to be monitoring the logs or exported events, and fixing up these uncertain writes when Cassandra is again working normally. See fs_dictmap_dict_write_uncertain
.
fs-dictmap can be used also with object storages which are accessed by paths rather than by object IDs (e.g. S3). This needs special configuration to avoid unnecessary Cassandra lookups.
Use fs_dictmap_storage_objectid_prefix = <prefix>
inside obox
{ ... } filter to enable fake object IDs for email objects. These fake object IDs are stored in Dovecot index files, which can be translated into object paths without doing a Cassandra lookup. The translation is simply <prefix>/<object ID>
, unless the migrate feature is used.
Use fs_dictmap_storage_passthrough_paths = full
inside metacache
{ ... } and fts dovecot
{ ... } filters to enable passthrough object IDs. With these the object ID is the same as the object path. The object ID is written as an empty string into Cassandra. If this setting is used, the object can't be copied (which is fine, because it is not done for index bundle or FTS objects).
Use fs_dictmap_storage_objectid_migrate
to enable migration. Also make sure to disable periodic metacache uploads during migration by setting metacache_upload_interval
to infinite
. Use the storage-objectid-migrate-mails(1)
and storage-objectid-migrate-index(1)
scripts to migrate the indexes and mails. These scripts list all (index bundle, fts and email) objects for the user and add them to Cassandra. Note that the user must be completely inaccessible (imap, pop3, managesieve, mail deliveries) while these scripts are run to avoid data loss.
Before migration the mails are stored in <user>/mailboxes/<mailbox_guid>/<oid>
paths. The migration script adds all these mails to Cassandra using <oid>
as the object ID. The obox-raw-id record is also set to <oid>
. The "extra data" byte in the <oid>
for path based object storages is always 0. For all newly written emails when storage-objectid-prefix
is non-empty, the 0x80
bit is set for the "extra data" byte. This allows generating the object ID from the obox-raw-id (<object-id>
) without a Cassandra lookup:
<prefix>/<object-id>
<user>/mailboxes/<mailbox_guid>/<object-id>
Note that listing object IDs with e.g. doveadm fs iter --object-ids
doesn't add the path prefix. It only returns the <object-id>
.
Newly saved mails can be efficiently copied within dictmap, but migrated mails must first be copied from <user>/mailboxes/<mailbox_guid>/<old-object-id>
to <prefix>/<new-object-id>
.
CREATE KEYSPACE IF NOT EXISTS mails
WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
USE mails;
CREATE TABLE IF NOT EXISTS user_index_objects (
u text,
n text,
i blob,
primary key (u, n)
);
CREATE TABLE IF NOT EXISTS user_mailbox_index_objects (
u text,
g blob,
n text,
i blob,
primary key ((u, g), n)
);
CREATE TABLE IF NOT EXISTS user_mailbox_objects (
u text,
g blob,
b int,
n blob,
i blob,
primary key ((u, g, b), n)
);
CREATE TABLE IF NOT EXISTS user_mailbox_buckets (
u text,
g blob,
b int,
primary key ((u, g))
);
CREATE TABLE IF NOT EXISTS user_fts_objects (
u text,
n text,
i blob,
primary key (u, n)
);
CREATE TABLE IF NOT EXISTS user_index_diff_objects (
u text,
h text,
m text,
primary key (u, h)
);
CREATE TABLE IF NOT EXISTS user_mailbox_index_diff_objects (
u text,
g blob,
h text,
m text,
primary key (u, g, h)
);
CREATE TABLE IF NOT EXISTS user_mailbox_objects_reverse (
u text,
g blob,
n blob,
i blob,
primary key (i, n)
);
fs-dictmap works by providing a view to Cassandra that ends up looking like a filesystem, which is compatible with the obox mailbox format.
There are several hardcoded paths necessary to accomplish this.
The mapping between the filesystem and the dict keys is:
Filesystem Path | Dict Keys (shared/ prefix not included) | Files |
---|---|---|
$user | Hardcoded idx/ and mailboxes/ | |
$user/idx/ |
| User root index bundles |
$user/mailboxes/ |
| Folder GUID directories |
$user/mailboxes/$mailbox_guid/ |
| Email objects |
$user/mailboxes/$mailbox_guid/idx/ |
| Folder index bundles |
$user/fts/ |
| Full text search index objects |
The filesystem can be accessed using the doveadm fs
or doveadm mail fs
commands. The config-filter-name
parameter is either obox
or metacache
. You're accessing email objects or index objects.
The included obox-user-objects(1)
and obox-user-iter(1)
scripts can be used to list all objects for a user.
Below is a list of operations that dictmap does for accessing data.
👁️: Cassandra operations | 📦: Object storage operations
Refreshing user root index
Refreshing folder index
Writing user root self/diff index
Writing user root base index
Writing folder diff/self index
Writing folder base index
Delivering a new email via LMTP, or saving a new email via IMAP APPEND
Reading email
Deleting email
Copying email
Moving email
Running "doveadm force-resync"
In the (extremely unlikely) case that all Cassandra (fs-dictmap) data is lost, it is possible to recover this information by iterating through all objects stored in the object store.
A rough overview of the process is as follows:
Per benchmark data, sizing of the Cassandra node can be estimated by assuming 50 bytes/email is required to store each message. Thus, assuming 512 GB total storage per Cassandra node (= 256 GB of usable storage + 256 GB for repairs/rebuilds), this means that each node can store data on up to 5.1 billion emails.
For high availability, a minimum of three nodes is required for each data center.
The Cassandra cpp-driver library requires a lot of VSZ memory. Make sure dict process doesn't immediately die out of memory (it may also be visible as strange crashes at startup) by disabling VSZ limits:
service dict-async {
vsz_limit = 0
}
Usually there should be only a single dict-async process running, because each process creates its own connections to the Cassandra cluster increasing its load. The Cassandra cpp-driver can use multiple IO threads as well. This is controlled by the cassandra_io_thread_count
setting. Each IO thread can handle 32k requests simultaneously, so usually 1 IO thread is enough. Note that each IO thread creates more connections to Cassandra, so again it's better not to creates too many threads unnecessarily. If all the IO threads are full of pending requests, queries start failing with "All connections on all I/O threads are busy" error.
If you encounter Object exists in dict, but not in storage
errors in the Dovecot Pro log file you most likely have resurrected deleted data, which happened because of inconsistencies due to replication. See: