Search K
Appearance
Appearance
This error message means that an object (mail, index bundle or fts) is listed as existing in Cassandra, but the referred object ID doesn't exist in the configured object storage. This situation is not fixed automatically to make sure there's no accidental data loss. There have been various reasons in the past why this has happened, but new instances still seem to happen even after the latest fixes. This may be due to operational issues with Cassandra causing data to become undeleted, since it's not happening in all customer installations.
These errors can also happen temporarily if there are delays with the object storage or Cassandra replication. For example during network splits. In this situation the errors will go away automatically once the object storage and Cassandra replication catches up. This is the only expected reason why these errors should happen.
These errors usually show up when users are moved between backends (e.g. after backend failures). This is because normally Dovecot doesn't do any unnecessary Cassandra work, so it doesn't notice when these problems happen. When a user/folder is accessed for the first time in a new backend, Dovecot lists all the index/mail objects instead of using the locally cached lists. This is to make sure there aren't any changes lost when a backend crashes. This object listing is what usually notices that there are suddenly some new objects listed in Cassandra, and when Dovecot tries to access them they aren't found from the object storage after all. If users rarely move between backends, the original problem that causes the situation could have happened even months earlier.
Cassandra repairs must be run at least once every gc_grace_seconds
(default: 10 days) to prevent already deleted objects from becoming undeleted. From the Cassandra manual in https://cassandra.apache.org/doc/latest/operating/repair.html :
At a minimum, repair should be run often enough that the gc grace period never expires on unrepaired data. Otherwise, deleted data could reappear. With a default gc grace period of 10 days, repairing every node in your cluster at least once every 7 days will prevent this, while providing enough slack to allow for delays.
If a Cassandra node hasn't been connected to the cluster for gc_grace_seconds
(default: 10 days), it must not be added back or many deleted objects will become undeleted.
It's highly recommended to use the Lazy Expunge plugin. However, using short autoexpunge times in the Lazy Expunge folder may cause these problems. It's recommended to use at least 1 day autoexpunge time (although a few hours should usually be sufficient).
If Lazy Expunge plugin is not used, the "lockdir" parameter must be used in obox_fs
setting.
Typically these problems are fixed either manually running a fixup script, or writing automatic fixup scripts. The automation either looks at the log files or fs_dictmap_object_lost
events, and runs commands to fix the found problems.
WARNING
The fixup scripts must not be run if the object storage system is not properly connected/consistent. In some situations if the object storage has internal problems connecting to all of its nodes, it may start returning "404 Object not found" errors for objects that actually still exist, but just aren't accessible at the time. If the fixup script is run at a time this happens, it may wrongly start deleting emails that still exist!
These errors may also temporarily pop up if Cassandra has had connectivity problems and its replication is lagging behind. These problems will become fixed automatically while replication catches up with the pending deletes. However, there's no real harm to run a fixup script in this state, since it just causes the object to be deleted earlier by the fixup script rather than Cassandra replication.
The "Object exists in dict, but not in storage" errors can be found by parsing the log files. Alternatively it's possible to export the fs_dictmap_object_lost
event to a HTTP endpoint:
metric fs_dictmap_object_lost {
filter = event=fs_dictmap_object_lost
exporter = json
}
event_exporter json {
format = json
format_args = time-rfc3339
transport = http-post
transport_args = https://example.com/api/endpoint/object-lost
transport_timeout = 1 sec
}
The ideal automatic solution would be tracking these errors separately for each object, and once the same error has repeated multiple times for the same object for over 1 hour it would be automatically fixed. A simpler solution would simply repair all objects seen over 1 hour ago. In both cases, the script should check the overall object storage state and not repair anything if it doesn't appear to be healthy.
The overall idea for the fixup scripts is to:
This can be done either with direct object storage/Cassandra accesses, using doveadm dict/fs commands, or accessing the objects via regular doveadm mailbox access commands with the delete-dangling-links option enabled. Only the delete-dangling-links method is described below.
Lost email objects can be found in two situations:
Mail object that already exists in mailbox indexes is attempted to be accessed via GET/HEAD. This can show up different ways in logs, for example:
Error: read(userid/mailboxes/48bf220df60f526073fb1b0042497420/ea931f3822b45860408b068342497420) failed: Object exists in dict, but not in storage: b1281fa5311bc3744d578ee8b4847e835b010020 (uid=1496, box=INBOX, read reason=access)
Mail object is attempted to be added to the mailbox index. This requires looking up its GUID, which requires HEADing the object. This shows up in logs as
Warning: Rescanned index in ..., n assumed temporarily lost
where n specifies how many such mails were inaccessible. These errors don't prevent accessing the mailbox, but they cause inefficient behavior in Dovecot.
The script below enables the delete-dangling-links
for a doveadm command, which fetches all the mails' headers in a user's INBOX. Note that it's necessary to trigger a HTTP HEAD or HTTP GET calls to the object storage so Dovecot sees the missing object and is then able to delete it. This fixes the mails that are already in the index, but not necessarily the "n assumed temporarily lost" mails. Those need to be fixed by having the delete-dangling-links
enabled while the index is being rescanned. This is easiest done with force-resync. The problematic emails may of course be also in other folders than INBOX.
obox_fs="`doveconf -h plugin/obox_fs`:delete-dangling-links"
obox_fs_expanded="`doveadm user -e "$obox_fs" user@domain`"
# Fix the "n assumed temporarily lost" (not needed if there aren't those):
doveadm -o plugin/obox_fs="$obox_fs_expanded" force-resync -u user@domain INBOX > /dev/null
# Fix the "object exists in dict, but not in storage" errors:
doveadm -o plugin/obox_fs="$obox_fs_expanded" fetch -u user@domain hdr mailbox INBOX > /dev/null
The above command opens all the email objects in the INBOX. This is of course rather inefficient, and may take a long time if there are a lot of mails. You can also further limit the access only to the emails that are visible in the error messages. For example if the error message says the problem is with uid=1496
then you can use:
doveadm -o plugin/obox_fs="$obox_fs_expanded" fetch -u user@domain hdr mailbox INBOX UID 1496 > /dev/null
The problem with this method is that there may be other lost emails in the same folder that just weren't logged about yet, so you may need to keep monitoring and fixing the same user/folder multiple times with different email UIDs.
Example log line:
Error: Failed to load index files for mailbox 'INBOX': Failed to read index files from storage: Failed to download bundle testuser/mailboxes/c92f64f79f0d1ed01e6d5b314f04886c/idx/bundle.5e6205ad.42fb6.f05e5a10ad05625e0dbf0000b70eb21c.host: Failed to read bundle testuser/mailboxes/c92f64f79f0d1ed01e6d5b314f04886c/idx/bundle.5e6205ad.42fb6.f05e5a10ad05625e0dbf0000b70eb21c.host: read(testuser/mailboxes/c92f64f79f0d1ed01e6d5b314f04886c/idx/bundle.5e6205ad.42fb6.f05e5a10ad05625e0dbf0000b70eb21c.host (object ID: 34c101f80cce917854ae07cfe629f683380f0020)) failed: Object exists in dict, but not in storage: 34c101f80cce917854ae07cfe629f683380f0020
The fixup script adds delete-dangling-links
to the obox_index_fs
setting and then attempts to open the mailbox. Opening the mailbox is enough – there's no need to access any emails. For extra safety, the index bundle list cache should be disabled with metacache_disable_bundle_list_cache = yes
.
obox_index_fs="`doveconf -h plugin/obox_index_fs`:delete-dangling-links"
obox_index_fs_expanded="`doveadm user -e "$obox_index_fs" user@domain`"
doveadm -o plugin/obox_index_fs="$obox_index_fs_expanded" -o plugin/metacache_disable_bundle_list_cache=yes fetch -u user@domain uid mailbox INBOX 1 > /dev/null
The fixup script below fixes INBOX's indexes first and then other folders:
#!/usr/bin/perl
$uid=$ARGV[0];
$obox_index_fS=`doveconf -h plugin/obox_index_fs`;
chomp($obox_index_fS);
$dangle=":delete-dangling-links";
$obox_index_fs=$obox_index_fS . $dangle;
chomp($obox_index_fs);
$obox_index_fs_expanded=`doveadm user -e "$obox_index_fs" $uid `;
chomp($obox_index_fs_expanded);
`doveadm -o plugin/obox_index_fs="$obox_index_fs_expanded" -o plugin/metacache_disable_bundle_list_cache=yes fetch -u $uid uid mailbox "INBOX" 1`;
@mailboxes=`doveadm -o plugin/metacache_disable_bundle_list_cache=yes mailbox list -u $uid`;
foreach $mailbox (@mailboxes) {
chomp($mailbox);
`doveadm -o plugin/obox_index_fs="$obox_index_fs_expanded" -o plugin/metacache_disable_bundle_list_cache=yes fetch -u $uid uid mailbox "$mailbox" 1`;
}
Example log line:
Error: fts-dovecot: fts.D_f8469003906358608b463f0042497420.00000024-00000458.0001: Couldn't read max UID: (-2) read(fts.D_f8469003906358608b463f0042497420.00000024-00000458.0001) failed: Object exists in dict, but not in storage: 98816abe000da349dadef26859c6ae83fa030010
The fixup script adds "delete-dangling-links" to the fts_dovecot_fs
setting and then attempts to access all the FTS objects. The best way is to perform a FTS search that finds as many mails as possible. This hopefully triggers reading the problematic object, although it's not guaranteed. For example "subject" should be found just about everywhere:
fts_dovecot_fs="`doveconf -h plugin/fts_dovecot_fs`:delete-dangling-links"
fts_dovecot_fs_expanded="`doveadm user -e "$fts_dovecot_fs" user@domain`"
doveadm -o plugin/fts_dovecot_fs="$fts_dovecot_fs_expanded" search -u user@domain mailbox INBOX text "subject" > /dev/null
The above scripts may not work in all situations, and they're more inefficient than is really necessary. An alternative way is to parse the log messages and explicitly delete the objects. Before deletion the script should verify whether the object still exists or not. These can be done rather easily with obox-fs(1)
which is included in the obox package. For example use the stat
command with the obox-fs(1)
script to see if the object exists:
obox-fs.sh stat user@domain userid/mailboxes/2abee7074b597058a52b0000ce9fcb97/b851bd241c5b705818400000ce9fcb97
obox-fs.sh -set obox_index_fs stat user@domain userid/mailboxes/2abee7074b597058a52b0000ce9fcb97/idx/bundle.12345
obox-fs.sh -set fts_dovecot_fs stat user@domain fts.D_f8469003906358608b463f0042497420.00000024-00000458.0001
Verify from the stderr output whether the error is still "Object exists in dict, but not in storage". If yes, use the delete
command with the obox-fs(1)
script to delete the same path.
This requires obox-fs(1)
with the -delete-dangling-links
parameter.
The idea here is that you can feed it log lines and it prints outs the fixup commands. It could be made to run the commands as well.
#!/usr/bin/env perl
sub fts_fix {
my ($userid, $fts_fname, $object_id) = ($1, $2, $3);
print "obox-fs.sh -set fts_dovecot_fs stat $userid $fts_fname\n";
}
sub folder_bundle_fix {
my ($userid, $folder_guid, $bundle_fname, $object_id) = ($1, $2, $3, $4);
print "obox-fs.sh -delete-dangling-links -set obox_index_fs stat $userid $userid/mailboxes/$folder_guid/idx/$bundle_fname\n";
}
sub user_bundle_fix {
my ($userid, $bundle_fname, $object_id) = ($1, $2, $3);
print "obox-fs.sh -delete-dangling-links -set obox_index_fs stat $userid $userid/idx/$bundle_fname\n";
}
sub email_fix {
my ($userid, $folder_guid, $email_oid, $object_id) = ($1, $2, $3, $4);
print "obox-fs.sh -delete-dangling-links -set obox_fs stat $userid $userid/mailboxes/$folder_guid/$email_oid\n";
}
while (<>) {
chop;
# NOTE: This fts regexp is mail_log_prefix setting-specific in catching the username. This likely needs to be
# different for different installations.
if (/(user=[^ ]+).*(fts\.._[0-9a-f_\.-]).*Object exists in dict, but not in storage: ([0-9a-f]+)/) {
my ($userid, $fts_fname, $object_id) = ($1, $2, $3);
fts_fix($userid, $fts_fname, $object_id);
} elsif (/([^\/ (]+)\/mailboxes\/([0-9a-f]+)\/idx\/(bundle[^):]+).*Object exists in dict, but not in storage: ([0-9a-f]+)/) {
my ($userid, $folder_guid, $bundle_fname, $object_id) = ($1, $2, $3, $4);
folder_bundle_fix($userid, $folder_guid, $bundle_fname, $object_id);
} elsif (/([^\/ (]+)\/idx\/(bundle[^):]+).*Object exists in dict, but not in storage: ([0-9a-f]+)/) {
my ($userid, $bundle_fname, $object_id) = ($1, $2, $3, $4);
user_bundle_fix($userid, $bundle_fname, $object_id);
} elsif (/([^\/ (]+)\/mailboxes\/([0-9a-f]+)\/([0-9a-f]+).*Object exists in dict, but not in storage: ([0-9a-f]+)/) {
my ($userid, $folder_guid, $email_oid, $object_id) = ($1, $2, $3, $4);
email_fix($userid, $folder_guid, $email_oid, $object_id);
} elsif (/Object exists in dict, but not in storage/) {
print STDERR "Couldn't match line: $_\n";
}
}
When trying to debug why this still happens, it's important to find out:
%{storage_id}
to deliver_log_format
setting, which expands to the appropriate storage's object ID. Although this applies only to new mail deliveries, not to IMAP APPENDs.It's possible to find out the "save time" of the object based on its filename. Use the fname-parse(1)
script that comes with the obox package to find out the timestamp:
./fname-parse.pl userid/mailboxes/50a4dd1e5443a750244a0000527f2970/aaae550815ac2e5873660000e7abd5f6
userid/mailboxes/50a4dd1e5443a750244a0000527f2970/aaae550815ac2e5873660000e7abd5f6 mail 2016-11-18 09:21:57
This is unfortunately not 100% reliable for mail objects, because when copying/moving mails the timestamp points to the copy/move time. The copy/move usually succeeds even if the email object was already lost in the object storage. So such new timestamps in other folders than INBOX/Sent/Drafts are possibly not a correct indicator that new objects are being lost.