"Success is uncertain, not deleting object ID" cleanup

These errors happen on Cassandra write timeouts. If Dovecot can't be sure that the write succeeded, it will log this error and keep the object in the object storage. Eventually these should be deleted though when Cassandra is having less problems.

For this find the uncertain-delete(1) script in the obox package.

NOTE

The error and the description on this page should generally not happen on a default setup. Only if the no-cleanup-uncertain Dictmap Parameters is explicitly given should this page be relevant.

TIP

Monitor the fs_dictmap_dict_write_uncertain event and especially whether the cleanup field contains failed. And inspect the logs for messages like file write state is uncertain for object ID.

A detailed explanation:

When Cassandra can't achieve the requested consistency (each-quorum in this case) during the write time, it returns a failure. However, the write may still have partially succeeded and Cassandra may eventually repair/replicate it enough times so that the write did actually succeed. In these cases Dovecot can't be sure whether the write will eventually succeed or not, and it logs the "success is uncertain, not deleting object ID" error. If the failure was about an email object, it means that Dovecot replies to the IMAP/LMTP client that the mail couldn't be saved, so typically the client will re-deliver the mail again at a later time. There are two possible outcomes after this:

Cassandra write will eventually become visible, and Dovecot will see the write. Whenever Dovecot next lists the email objects, it sees that there's a new email and adds it to the Dovecot indexes. Since the client most likely had already re-delivered the mail this new mail shows up as a duplicate. Or if the user had already expunged the mail, it seems as if the expunged mail becomes un-expunged.
Cassandra write will never finish. The storage object never becomes deleted (automatically), so it just wastes disk space. There are no user-visible issues with this.

After an uncertain write, Dovecot immediately attempts to delete the uncertain data. The deletion of course may again fail with uncertainty, but unless it completely failed it is still eventually going to work.

The purpose of the uncertain-delete(1) script is to find out these leaked storage objects and delete them to avoid wasting disk space - there is no user visible impact. So the script should be run a long time (e.g. 1 day) after the "success is uncertain, not deleting object ID" errors to give Cassandra some time to finish its repairs/replication and find out whether the write actually succeeded or not. If the uncertain-delete(1) is run too early, Cassandra could repair the write and it would point to a storage object that no longer exists, resulting in Obox Troubleshooting: Object exists in dict, but not in storage errors.

It's not safe to run a script that just deletes uncertain writes a long time after they happen. Otherwise mails could get lost:

Mail is delivered to user, which fails and logs "uncertain write" ("mail A").
A few minutes later the mail is re-delivered and it succeeds ("mail B").
Cassandra repairs the uncertain write and makes the "mail A" visible.
Dovecot re-syncs the INBOX by listing mails in Cassandra (this happens e.g. if user moves from one backend to another, or if the INBOX has been cleaned from metacache).
The re-sync sees the "mail A" and adds it to index. Now both mails A and B are visible to user.
User wonders why there are two mails in INBOX and happens to delete "mail B".
Script that deletes uncertain writes is run, which deletes "mail A".
Both A and B are now deleted.
(If user had deleted mail A instead of mail B there wouldn't have been a problem, but users will likely randomly decide to delete either one.)

Guides

Lua Support

Authentication

Databases

Mechanisms

Events

Guides

Mail Delivery

Mailbox

Formats

SQL

Sieve

Extensions

Users

"Success is uncertain, not deleting object ID" cleanup

Databases

Mechanisms

Formats

Extensions

"Success is uncertain, not deleting object ID" cleanup ​

"Success is uncertain, not deleting object ID" cleanup