Dovecot Pro Palomar Architecture

The Dovecot Pro Palomar Architecture in v3.x replaces the director component in older Dovecot versions with a new cluster service.

The cluster service provides:

Load balancer between Dovecot Proxies and Backends
User-stickiness to backends
High availability for backends and other sites
Load redistribution / user moving
Multi-site support

Users are always attempted to be accessed only via a single backend at a time. This allows caching to work efficiently. When using object storage and multiple sites, it's possible that the user is accessed simultaneously by multiple sites when the sites' networks don't see each others (split brain). The obox format handles this by eventually merging the changes and moving the user handling back to a single site soon after the split brain is over.

Overview

User Groups

Users are assigned to groups. Only groups can be moved between backends, not individual users. The number of groups should be sized approximately by the number of backends. For example counting 100 groups per backend can allow changing the backend load by 1% increments when moving groups.

Ideally all the groups have equal "load", i.e. moving any of the groups in a backend elsewhere would reduce the backend's load the same amount. A later Palomar version will support automatically moving users between groups to try to make them more balanced.

Group Moving

doveadm cluster group move and groups moves initiated by the controller work by updating the user group fields in GeoDB:

alt_backend_id is set to the destination backend
moving is set to non-empty value

This is followed by (automatically) accessing any user in the group, which starts the actual moving in the source backend. This means any kind of a mail user access, such as mail delivery, IMAP login or doveadm access. It is the source backend's responsibility to finish the move.

While a group is moving, its users are first forwarded to the source backend. The source backend tracks in its LocalDB which users have already been moved to destination. If the logging in user is already moved, the login is rejected with a referral to the destination backend.

Once the group is fully moved, the user group is updated in GeoDB:

backend_id is set to the destination backend
alt_backend_id is set to a new backend
moving is cleared

After this the users start logging in directly to the destination backend. Due to race conditions and GeoDB caches it's still possible that some proxies forward the connection to the source backend. The source backend remembers for up to 1 hour that the group move happened, and will reject the logins with referral to the destination backend.

If the source backend is marked offline/standby before the move is finished, the next user access immediately marks the move as finished. Because all proxies know the destination backend (alt_backend_id), this can be done safely even if multiple proxies do it simultaneously.

Automatic Load Rebalancing

If enabled, controller can perform automatic load balancing based on collected data from backends. In a nutshell, a load index is assigned to all backends called Z-score # TODO: link. If backends have too big variation in load, a group move between the backend with the highest and the backend with the lowest load is triggered. The difference in load that triggers the balancing is set by HOST_LOAD_BALANCE_SCORE_DELTA_THRESHOLD_RATIO controller setting.

A group move is an operation where controller decides that all users of a group should be routed to a different dovecot backend. When this decision is made, controller updates the group in GeoDB with the new backend. The actual moving starts the next time a user that belongs to the group logs in or receives an email via LMTP.

The process of monitoring and adjusting backend load is a continuous operation that is periodically done by controller. By default 1 group per hour is moved. At the start and end of group move Dovecot emits cluster_user_group_move_started and cluster_user_group_move_finished respectively.

To safeguard the site from making too early decisions about load, automatic load balancing is deferred until sufficient amount of data is collected by Prometheus. The value of this minimum number of data samples can be tweaked with LOAD_BALANCE_MIN_SAMPLES setting. By default, it's set to 3000 samples which would roughly take 12h hours to collect.

Automatic Group Size Rebalancing

Controller can also normalize the size of groups assigned to backends by moving users between them. If enabled (GROUP_BALANCE_ENABLED), this will result in groups on each backend to converge to be roughly the same size.

Group rebalancing is performed in two stages.

In the first step, user groups for each backend is analyzed and if any group size difference is higher than GROUP_BALANCE_GROUP_SIZE_SLACK_PERCENT a plan is generated and stored on redis for later execution. This step is done once per day at midnight (UTC timezone). At this stage, it is decided which specific users should be moved to which groups. The number of users moved is capped to the value of GROUP_BALANCE_MAX_USER_MOVE_BETWEEN_GROUPS.
In the second step, which is done throughout the day, the plan is read periodically and users are reassigned to new groups based on the plan. This step is done every 30 minutes and at each round a maximum number of users set by GROUP_BALANCE_MAX_USER_MOVES_PER_PASS is moved to new groups.

Guides

Lua Support

Authentication

Databases

Mechanisms

Events

Guides

Mail Delivery

Mailbox

Formats

SQL

Sieve

Extensions

Users

Dovecot Pro Palomar Architecture

Overview

User Groups

Group Moving

Automatic Load Rebalancing

Automatic Group Size Rebalancing

Databases

Mechanisms

Formats

Extensions

Dovecot Pro Palomar Architecture ​

Overview ​

User Groups ​

Group Moving ​

Automatic Load Rebalancing ​

Automatic Group Size Rebalancing ​

Dovecot Pro Palomar Architecture

Overview

User Groups

Group Moving

Automatic Load Rebalancing

Automatic Group Size Rebalancing