Search K
Appearance
Appearance
The Dovecot Pro Palomar Architecture in v3.x replaces the director component in older Dovecot versions with a new cluster service.
The cluster service provides:
Users are always attempted to be accessed only via a single backend at a time. This allows caching to work efficiently. When using object storage and multiple sites, it's possible that the user is accessed simultaneously by multiple sites when the sites' networks don't see each others (split brain). The obox format handles this by eventually merging the changes and moving the user handling back to a single site soon after the split brain is over.
Users are assigned to groups. Only groups can be moved between backends, not individual users. The number of groups should be sized approximately by the number of backends. For example counting 100 groups per backend can allow changing the backend load by 1% increments when moving groups.
Ideally all the groups have equal "load", i.e. moving any of the groups in a backend elsewhere would reduce the backend's load the same amount. A later Palomar version will support automatically moving users between groups to try to make them more balanced.
doveadm cluster group move
and groups moves initiated by the controller work by updating the user group fields in GeoDB:
alt_backend_id
is set to the destination backendmoving
is set to non-empty valueThis is followed by (automatically) accessing any user in the group, which starts the actual moving in the source backend. This means any kind of a mail user access, such as mail delivery, IMAP login or doveadm access. It is the source backend's responsibility to finish the move.
While a group is moving, its users are first forwarded to the source backend. The source backend tracks in its LocalDB which users have already been moved to destination. If the logging in user is already moved, the login is rejected with a referral to the destination backend.
Once the group is fully moved, the user group is updated in GeoDB:
backend_id
is set to the destination backendalt_backend_id
is set to a new backendmoving
is clearedAfter this the users start logging in directly to the destination backend. Due to race conditions and GeoDB caches it's still possible that some proxies forward the connection to the source backend. The source backend remembers for up to 1 hour that the group move happened, and will reject the logins with referral to the destination backend.
If the source backend is marked offline/standby before the move is finished, the next user access immediately marks the move as finished. Because all proxies know the destination backend (alt_backend_id
), this can be done safely even if multiple proxies do it simultaneously.
If enabled, controller can perform automatic load balancing based on collected data from backends. In a nutshell, a load index is assigned to all backends called Z-score # TODO: link. If backends have too big variation in load, a group move between the backend with the highest and the backend with the lowest load is triggered. The difference in load that triggers the balancing is set by HOST_LOAD_BALANCE_SCORE_DELTA_THRESHOLD_RATIO controller setting.
A group move
is an operation where controller decides that all users of a group should be routed to a different dovecot backend. When this decision is made, controller updates the group in GeoDB with the new backend. The actual moving starts the next time a user that belongs to the group logs in or receives an email via LMTP.
The process of monitoring and adjusting backend load is a continuous operation that is periodically done by controller. By default 1 group per hour is moved. At the start and end of group move Dovecot emits cluster_user_group_move_started and cluster_user_group_move_finished respectively.
To safeguard the site from making too early decisions about load, automatic load balancing is deferred until sufficient amount of data is collected by Prometheus. The value of this minimum number of data samples can be tweaked with LOAD_BALANCE_MIN_SAMPLES setting. By default, it's set to 3000 samples which would roughly take 12h hours to collect.
Controller can also normalize the size of groups assigned to backends by moving users between them. If enabled (GROUP_BALANCE_ENABLED), this will result in groups on each backend to converge to be roughly the same size.
Group rebalancing is performed in two stages.