Object exists in dict, but not in storage

This error message means that an object (mail, index bundle or fts) is listed as existing in Cassandra, but the referred object ID doesn't exist in the configured object storage. This situation is not fixed automatically to make sure there's no accidental data loss. There have been various reasons in the past why this has happened, but new instances still seem to happen even after the latest fixes. This may be due to operational issues with Cassandra causing data to become undeleted, since it's not happening in all customer installations.

These errors can also happen temporarily if there are delays with the object storage or Cassandra replication. For example during network splits. In this situation the errors will go away automatically once the object storage and Cassandra replication catches up. This is the only expected reason why these errors should happen.

These errors usually show up when users are moved between backends (e.g. after backend failures). This is because normally Dovecot doesn't do any unnecessary Cassandra work, so it doesn't notice when these problems happen. When a user/folder is accessed for the first time in a new backend, Dovecot lists all the index/mail objects instead of using the locally cached lists. This is to make sure there aren't any changes lost when a backend crashes. This object listing is what usually notices that there are suddenly some new objects listed in Cassandra, and when Dovecot tries to access them they aren't found from the object storage after all. If users rarely move between backends, the original problem that causes the situation could have happened even months earlier.

Known reasons / How to prevent it

Cassandra repairs must be run at least once every gc_grace_seconds (default: 10 days) to prevent already deleted objects from becoming undeleted. From the Cassandra manual in https://cassandra.apache.org/doc/latest/operating/repair.html :

At a minimum, repair should be run often enough that the gc grace period never expires on unrepaired data. Otherwise, deleted data could reappear. With a default gc grace period of 10 days, repairing every node in your cluster at least once every 7 days will prevent this, while providing enough slack to allow for delays.

If a Cassandra node hasn't been connected to the cluster for gc_grace_seconds (default: 10 days), it must not be added back or many deleted objects will become undeleted.

It's highly recommended to use the Lazy Expunge plugin. However, using short autoexpunge times in the Lazy Expunge folder may cause these problems. It's recommended to use at least 1 day autoexpunge time (although a few hours should usually be sufficient).

If Lazy Expunge plugin is not used, the "lockdir" parameter must be used in obox_fs setting.

How to fix it

Typically these problems are fixed either manually running a fixup script, or writing automatic fixup scripts. The automation either looks at the log files or fs_dictmap_object_lost events, and runs commands to fix the found problems.

WARNING

The fixup scripts must not be run if the object storage system is not properly connected/consistent. In some situations if the object storage has internal problems connecting to all of its nodes, it may start returning "404 Object not found" errors for objects that actually still exist, but just aren't accessible at the time. If the fixup script is run at a time this happens, it may wrongly start deleting emails that still exist!

These errors may also temporarily pop up if Cassandra has had connectivity problems and its replication is lagging behind. These problems will become fixed automatically while replication catches up with the pending deletes. However, there's no real harm to run a fixup script in this state, since it just causes the object to be deleted earlier by the fixup script rather than Cassandra replication.

The "Object exists in dict, but not in storage" errors can be found by parsing the log files. Alternatively it's possible to export the fs_dictmap_object_lost event to a HTTP endpoint:

metric fs_dictmap_object_lost {
  filter = event=fs_dictmap_object_lost
  exporter = json
}
event_exporter json {
  format = json
  format_args = time-rfc3339
  transport = http-post
  transport_args = https://example.com/api/endpoint/object-lost
  transport_timeout = 1 sec
}

The ideal automatic solution would be tracking these errors separately for each object, and once the same error has repeated multiple times for the same object for over 1 hour it would be automatically fixed. A simpler solution would simply repair all objects seen over 1 hour ago. In both cases, the script should check the overall object storage state and not repair anything if it doesn't appear to be healthy.

The overall idea for the fixup scripts is to:

See if the object still exists in Cassandra
If yes, see if the object ID exists in the object storage
If not, delete the object also from Cassandra

This can be done either with direct object storage/Cassandra accesses, using doveadm dict/fs commands, or accessing the objects via regular doveadm mailbox access commands with the delete-dangling-links option enabled. Only the delete-dangling-links method is described below.

Fixup script: Email objects

Lost email objects can be found in two situations:

Mail object that already exists in mailbox indexes is attempted to be accessed via GET/HEAD. This can show up different ways in logs, for example:
Error: read(userid/mailboxes/48bf220df60f526073fb1b0042497420/ea931f3822b45860408b068342497420) failed: Object exists in dict, but not in storage: b1281fa5311bc3744d578ee8b4847e835b010020 (uid=1496, box=INBOX, read reason=access)
Mail object is attempted to be added to the mailbox index. This requires looking up its GUID, which requires HEADing the object. This shows up in logs as
Warning: Rescanned index in ..., n assumed temporarily lost
where n specifies how many such mails were inaccessible. These errors don't prevent accessing the mailbox, but they cause inefficient behavior in Dovecot.

The script below enables the delete-dangling-links for a doveadm command, which fetches all the mails' headers in a user's INBOX. Note that it's necessary to trigger a HTTP HEAD or HTTP GET calls to the object storage so Dovecot sees the missing object and is then able to delete it. This fixes the mails that are already in the index, but not necessarily the "n assumed temporarily lost" mails. Those need to be fixed by having the delete-dangling-links enabled while the index is being rescanned. This is easiest done with force-resync. The problematic emails may of course be also in other folders than INBOX.

obox_fs="`doveconf -h plugin/obox_fs`:delete-dangling-links"
obox_fs_expanded="`doveadm user -e "$obox_fs" user@domain`"
# Fix the "n assumed temporarily lost" (not needed if there aren't those):
doveadm -o plugin/obox_fs="$obox_fs_expanded" force-resync -u user@domain INBOX > /dev/null
# Fix the "object exists in dict, but not in storage" errors:
doveadm -o plugin/obox_fs="$obox_fs_expanded" fetch -u user@domain hdr mailbox INBOX > /dev/null

The above command opens all the email objects in the INBOX. This is of course rather inefficient, and may take a long time if there are a lot of mails. You can also further limit the access only to the emails that are visible in the error messages. For example if the error message says the problem is with uid=1496 then you can use:

doveadm -o plugin/obox_fs="$obox_fs_expanded" fetch -u user@domain hdr mailbox INBOX UID 1496 > /dev/null

The problem with this method is that there may be other lost emails in the same folder that just weren't logged about yet, so you may need to keep monitoring and fixing the same user/folder multiple times with different email UIDs.

Fixup script: Index bundle objects

Example log line:

Error: Failed to load index files for mailbox 'INBOX': Failed to read index files from storage: Failed to download bundle testuser/mailboxes/c92f64f79f0d1ed01e6d5b314f04886c/idx/bundle.5e6205ad.42fb6.f05e5a10ad05625e0dbf0000b70eb21c.host: Failed to read bundle testuser/mailboxes/c92f64f79f0d1ed01e6d5b314f04886c/idx/bundle.5e6205ad.42fb6.f05e5a10ad05625e0dbf0000b70eb21c.host: read(testuser/mailboxes/c92f64f79f0d1ed01e6d5b314f04886c/idx/bundle.5e6205ad.42fb6.f05e5a10ad05625e0dbf0000b70eb21c.host (object ID: 34c101f80cce917854ae07cfe629f683380f0020)) failed: Object exists in dict, but not in storage: 34c101f80cce917854ae07cfe629f683380f0020

The fixup script adds delete-dangling-links to the obox_index_fs setting and then attempts to open the mailbox. Opening the mailbox is enough – there's no need to access any emails. For extra safety, the index bundle list cache should be disabled with metacache_disable_bundle_list_cache = yes.

obox_index_fs="`doveconf -h plugin/obox_index_fs`:delete-dangling-links"
obox_index_fs_expanded="`doveadm user -e "$obox_index_fs" user@domain`"
doveadm -o plugin/obox_index_fs="$obox_index_fs_expanded" -o plugin/metacache_disable_bundle_list_cache=yes fetch -u user@domain uid mailbox INBOX 1 > /dev/null

The fixup script below fixes INBOX's indexes first and then other folders:

perl

#!/usr/bin/perl

$uid=$ARGV[0];
$obox_index_fS=`doveconf -h plugin/obox_index_fs`;
chomp($obox_index_fS);
$dangle=":delete-dangling-links";
$obox_index_fs=$obox_index_fS . $dangle;
chomp($obox_index_fs);
$obox_index_fs_expanded=`doveadm user -e "$obox_index_fs" $uid `;
chomp($obox_index_fs_expanded);
`doveadm -o plugin/obox_index_fs="$obox_index_fs_expanded" -o plugin/metacache_disable_bundle_list_cache=yes fetch -u $uid uid mailbox "INBOX" 1`;
@mailboxes=`doveadm -o plugin/metacache_disable_bundle_list_cache=yes mailbox list -u $uid`;
foreach $mailbox (@mailboxes) {
  chomp($mailbox);
  `doveadm -o plugin/obox_index_fs="$obox_index_fs_expanded" -o plugin/metacache_disable_bundle_list_cache=yes fetch -u $uid uid mailbox "$mailbox" 1`;
}

Fixup script: FTS objects

Example log line:

Error: fts-dovecot: fts.D_f8469003906358608b463f0042497420.00000024-00000458.0001: Couldn't read max UID: (-2) read(fts.D_f8469003906358608b463f0042497420.00000024-00000458.0001) failed: Object exists in dict, but not in storage: 98816abe000da349dadef26859c6ae83fa030010

The fixup script adds "delete-dangling-links" to the fts_dovecot_fs setting and then attempts to access all the FTS objects. The best way is to perform a FTS search that finds as many mails as possible. This hopefully triggers reading the problematic object, although it's not guaranteed. For example "subject" should be found just about everywhere:

fts_dovecot_fs="`doveconf -h plugin/fts_dovecot_fs`:delete-dangling-links"
fts_dovecot_fs_expanded="`doveadm user -e "$fts_dovecot_fs" user@domain`"
doveadm -o plugin/fts_dovecot_fs="$fts_dovecot_fs_expanded" search -u user@domain mailbox INBOX text "subject" > /dev/null

Fixup script: Deleting specific objects

The above scripts may not work in all situations, and they're more inefficient than is really necessary. An alternative way is to parse the log messages and explicitly delete the objects. Before deletion the script should verify whether the object still exists or not. These can be done rather easily with obox-fs(1) which is included in the obox package. For example use the stat command with the obox-fs(1) script to see if the object exists:

obox-fs.sh stat user@domain userid/mailboxes/2abee7074b597058a52b0000ce9fcb97/b851bd241c5b705818400000ce9fcb97
obox-fs.sh -set obox_index_fs stat user@domain userid/mailboxes/2abee7074b597058a52b0000ce9fcb97/idx/bundle.12345
obox-fs.sh -set fts_dovecot_fs stat user@domain fts.D_f8469003906358608b463f0042497420.00000024-00000458.0001

Verify from the stderr output whether the error is still "Object exists in dict, but not in storage". If yes, use the delete command with the obox-fs(1) script to delete the same path.

Generic fixup script

This requires obox-fs(1) with the -delete-dangling-links parameter.

The idea here is that you can feed it log lines and it prints outs the fixup commands. It could be made to run the commands as well.

perl

#!/usr/bin/env perl

sub fts_fix {
  my ($userid, $fts_fname, $object_id) = ($1, $2, $3);

  print "obox-fs.sh -set fts_dovecot_fs stat $userid $fts_fname\n";
}

sub folder_bundle_fix {
  my ($userid, $folder_guid, $bundle_fname, $object_id) = ($1, $2, $3, $4);

  print "obox-fs.sh -delete-dangling-links -set obox_index_fs stat $userid $userid/mailboxes/$folder_guid/idx/$bundle_fname\n";
}

sub user_bundle_fix {
  my ($userid, $bundle_fname, $object_id) = ($1, $2, $3);

  print "obox-fs.sh -delete-dangling-links -set obox_index_fs stat $userid $userid/idx/$bundle_fname\n";
}

sub email_fix {
  my ($userid, $folder_guid, $email_oid, $object_id) = ($1, $2, $3, $4);

  print "obox-fs.sh -delete-dangling-links -set obox_fs stat $userid $userid/mailboxes/$folder_guid/$email_oid\n";
}

while (<>) {
  chop;
  # NOTE: This fts regexp is mail_log_prefix setting-specific in catching the username. This likely needs to be
  # different for different installations.
  if (/(user=[^ ]+).*(fts\.._[0-9a-f_\.-]).*Object exists in dict, but not in storage: ([0-9a-f]+)/) {
    my ($userid, $fts_fname, $object_id) = ($1, $2, $3);
    fts_fix($userid, $fts_fname, $object_id);
  } elsif (/([^\/ (]+)\/mailboxes\/([0-9a-f]+)\/idx\/(bundle[^):]+).*Object exists in dict, but not in storage: ([0-9a-f]+)/) {
    my ($userid, $folder_guid, $bundle_fname, $object_id) = ($1, $2, $3, $4);
    folder_bundle_fix($userid, $folder_guid, $bundle_fname, $object_id);
  } elsif (/([^\/ (]+)\/idx\/(bundle[^):]+).*Object exists in dict, but not in storage: ([0-9a-f]+)/) {
    my ($userid, $bundle_fname, $object_id) = ($1, $2, $3, $4);
    user_bundle_fix($userid, $bundle_fname, $object_id);
  } elsif (/([^\/ (]+)\/mailboxes\/([0-9a-f]+)\/([0-9a-f]+).*Object exists in dict, but not in storage: ([0-9a-f]+)/) {
    my ($userid, $folder_guid, $email_oid, $object_id) = ($1, $2, $3, $4);
    email_fix($userid, $folder_guid, $email_oid, $object_id);
  } elsif (/Object exists in dict, but not in storage/) {
    print STDERR "Couldn't match line: $_\n";
  }
}

Debugging

When trying to debug why this still happens, it's important to find out:

When was the object created?
- One rather reliable method for emails is to add %{storage_id} to deliver_log_format setting, which expands to the appropriate storage's object ID. Although this applies only to new mail deliveries, not to IMAP APPENDs.
When was the object deleted?
- For emails, mail_log plugin is helpful here.
Was the object actually deleted from Cassandra?
- Best if using Cassandra's debug_queries=y so you can see clearly from the logs whether DELETE is issued for the row. Also verify that the timestamp in the DELETE query is newer than what exists now in the unwanted row. Of course, this setting causes a lot of logging in production.
- It might be possible to determine from the logs in other ways whether the object should have been accessed. If the user is moved from backend to another, that triggers listing of all mail objects in Cassandra. This results in the unwanted row being added back to Dovecot index as a new mail, which clients typically try to download. If after the message expunging this has happened and user has successfully accessed all mails, then it means the Cassandra DELETE succeeded but the row just somehow came back.
Were there any Dovecot restarts or crashes between the deletion and when the error started happening?
Was the user moved from one backend to another between the deletion and when the error started happening?
- As explained above, the error often becomes visible when user is moved to another backend, even though the problem happened already much earlier.
Were there any Cassandra node restarts or crashes between the deletion and when the error started happening? Or did any new Cassandra nodes join the cluster in that time?
- If node was away for longer than gc_grace_seconds (default 10 days) and came back, this causes rows to become undeleted.
Has Cassandra's repairs been running properly?

It's possible to find out the "save time" of the object based on its filename. Use the fname-parse(1) script that comes with the obox package to find out the timestamp:

./fname-parse.pl userid/mailboxes/50a4dd1e5443a750244a0000527f2970/aaae550815ac2e5873660000e7abd5f6
userid/mailboxes/50a4dd1e5443a750244a0000527f2970/aaae550815ac2e5873660000e7abd5f6  mail 2016-11-18 09:21:57

This is unfortunately not 100% reliable for mail objects, because when copying/moving mails the timestamp points to the copy/move time. The copy/move usually succeeds even if the email object was already lost in the object storage. So such new timestamps in other folders than INBOX/Sent/Drafts are possibly not a correct indicator that new objects are being lost.

Guides

Lua Support

Authentication

Databases

Mechanisms

Events

Guides

Mail Delivery

Mailbox

Formats

SQL

Sieve

Extensions

Users

Object exists in dict, but not in storage

Known reasons / How to prevent it

How to fix it

Fixup script: Email objects

Fixup script: Index bundle objects

Fixup script: FTS objects

Fixup script: Deleting specific objects

Generic fixup script

Debugging

Databases

Mechanisms

Formats

Extensions

Object exists in dict, but not in storage ​

Known reasons / How to prevent it ​

How to fix it ​

Fixup script: Email objects ​

Fixup script: Index bundle objects ​

Fixup script: FTS objects ​

Fixup script: Deleting specific objects ​

Generic fixup script ​

Debugging ​

Object exists in dict, but not in storage

Known reasons / How to prevent it

How to fix it

Fixup script: Email objects

Fixup script: Index bundle objects

Fixup script: FTS objects

Fixup script: Deleting specific objects

Generic fixup script

Debugging