Cassandra (and Dovecot Pro)

Cassandra is an open source NoSQL distributed database that offers scalability and high availability without compromising performance.

Cassandra is an important component of Dovecot Pro, required for obox and Palomar and optionally used for other features (e.g., last login).

This page contains suggestions and recommendations on how to configure and operate Cassandra in your installation.

WARNING

Documentation on this page is for informational purposes only. These are not requirements, and Open-Xchange/Dovecot cannot provide direct Cassandra support.

For specific assistance with a local Cassandra installation, consultation with a Cassandra expert is recommended.

Dovecot Pro Cassandra Administration

Pre-install deployment checklist:

Turn off swap on every Cassandra node.
Open Cassandra ports in firewall.
Configure system limits.
Implement Network Time Protocol (NTP) daemon.
Install latest Java JVM with G1 garbage collector.
Implement the Dovecot Pro dictmap fs driver on every Dovecot backend server. This is a requirement for Dovecot Pro, as it is necessary for obox operation.

Cassandra Repair

Apache Cassandra is a distributed database with tunable consistency. The normal Dovecot configuration implements quorum consistency.

Quorum provides strong consistency with failure toleration.

With a replication factor of 3, quorum is $2 * (\frac{s u m_o f_r e p l i c a t i o n}{2} + 1)$ . This means it is entirely possible for the data to be inconsistent on one node.

Cassandra nodetool repair is the AntEntropy service that uses Merkle trees to detect and repair inconsistencies in data between replicas.

Another important element is gc_grace_seconds (10 days by default) which is the tombstone time to live marker.

If a node is missing the tombstone after the gc_grace_seconds period, the deleted data will be resurrected.

In the Dovecot Pro log file, if you start seeing Object exists in dict, but not in storage errors, then you most likely have resurrected deleted data. Resurrected deleted data will have to be manually deleted. See Obox Troubleshooting: Object exists in dict, but not in storage.

TIP

To prevent Cassandra data resurrection, you must regularly run nodetool repair within gc_grace_seconds via cron for the entire cluster.

Reducing Tombstones

Dovecot attempts to prevent creating too many tombstones within the same cluster key. Sometimes it may not have worked properly though, and Cassandra queries start failing (timing out) towards a specific cluster key due to too many tombstones. This can be repaired by getting rid of the tombstones:

Run Cassandra repair to make sure all tombstones are replicated.
Change gc_grace_seconds to a smaller value that includes the tombstones (e.g. 1 day).
Run Cassandra compact.
Change gc_grace_seconds back to the original value (10 days).

Other potential changes that may help:

Enable page_size=1000 in dovecot-dict-cql.conf.ext connect setting so large results would be paged into multiple queries.
Increase Cassandra's request timeout.
Increase Cassandra's tombstone_failure_threshold.

Replication Factor

NetworkTopologyStrategy

For multiple data centers, the replication NetworkTopologyStrategy is recommended for production environments.

When the mails keyspace is created, set replication to NetworkTopologyStrategy.

This example sets replication factor to 3 in each data center:

CREATE KEYSPACE 'mails' IF NOT EXIST WITH REPLICATION = {
  'class' : 'NetworkTopologyStrategy',
  'dc1_name' : 3,
  'dc2_name' : 3
}

GossipingPropertyFileSnitch

Note

Every node will only define itself.

A snitch determines which data centers and racks nodes belong to. The replication (in this case NetworkTopologyStrategy) places the replicas based on snitch information.

The GossipingPropertyFileSnitch uses Cassandra's gossip protocol to automatically update snitch information when adding new nodes. This is the recommended snitch for production.

Configure GossipingPropertyFileSnitch on each node with only localhost information. Cassandra discovers all the nodes in the ring via Gossip protocol on start up.

This configuration is set in cassandra-rackdc.properties.

Examples:

Node1, dc1_nameNode 2, dc1_nameNode 1, dc2_name

DC=dc1_name
RAC=rack1

DC=dc1_name
RAC=rack2

DC=dc2_name
RAC=rack1

Endpoint Snitch

Modify the cassandra.yaml file and change "endpoint_snitch" to use "GossipingPropertyFileSnitch".

endpoint_snitch: GossipingPropertyFileSnitch

Cassandra Configuration

Cassandra Server layout

Cassandra nodes should have at least 2 disk and 2 network interfaces.

One disk is for the commit log which should be fast enough to receive all writes as sequential I/O. The size of the commit log is controlled by "commit log_total_space_in_mb" setting in cassandra.yaml file.

The other disk is for data which should be fast enough to satisfy both read and write I/O. The recommendation is to limit space utilization to more than 50% of the total disk size for repair and compaction operations.

File System Layout

Ideally there is a separate set of mirrored OS disk and a separate disk for the Cassandra commit log.

If you are sharing the OS disk with Cassandra commit log, use the following paths:

/var/lib/cassandra/commitlog
/var/lib/cassandra/saved_caches
/var/lib/cassandra/hints_directory

If you have a dedicated disk for Cassandra commit log, you should create a mount point named /var/lib/cassandra and use the path names above.

The Cassandra data disk should be a dedicated SSD drive. You should create the mount point /var/lib/cassandra/data.

Note

Cassandra configuration files are normally located in /etc/cassandra and logs are located in /var/log/cassandra.

The path names above are the default values in the cassandra.yaml file.

# Directory where Cassandra should store hints.
# If not set, the default directory is $CASSANDRA_HOME/data/hints.
hints_directory: /var/lib/cassandra/hints
 
# Directories where Cassandra should store data on disk.  Cassandra
# will spread data evenly across them, subject to the granularity of
# the configured compaction strategy.
# If not set, the default directory is $CASSANDRA_HOME/data/data.
data_file_directories:
  - /var/lib/cassandra/data

# commit log.  when running on magnetic HDD, this should be a
# separate spindle than the data directories.
# If not set, the default directory is $CASSANDRA_HOME/data/commitlog.
commitlog_directory: /var/lib/cassandra/commitlog

# saved caches
# If not set, the default directory is $CASSANDRA_HOME/data/saved_caches.
saved_caches_directory: /var/lib/cassandra/saved_caches

Network

Cassandra gives you the ability to segregate inter-node from client traffic. Most deployments go with a single IP address for both Cassandra inter-node and client communication.

In Cassandra the listen_address: / listen_interface: are used for inter-node cluster communication. Cassandra replication, repair, gossip, and compaction can generate significant traffic from time to time. Isolating cluster traffic onto its own IP address range (cluster vlan), if possible, can improve network performance / reduce network latency within the Cassandra cluster. In the case of multi-site, you may need a vlan over wan solution.

The Cassandra rpc_address: / rpc_interface: are for client communication. This would be the IP address (client access vlan) that Dovecot backends would be configured to use.

Ports

Inter-node (default) ports:

Port	Service
7000	Inter-node
7001	Inter-node (SSL)
7199	JMK

Client (default) ports:

Port	Service
9042	Client Port
9160	Thrift
9142	Native

Memory

Cassandra runs inside JAVA JVM. The JVM settings are defined in "cassandra-env.sh" (older) or jvm.options file in /etc/cassandra configuration directory. The run settings are computed during startup based on system resources. Recommendation is to use default settings and monitor the system.

cassandra.yaml

Various tips to assist in Cassandra configuration.

Cluster Name

# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: cluster1

Seed Nodes

Configure the first and fourth nodes in each data center to be your seed nodes (example IPs only):

- seeds: "172.16.0.1,172.16.0.4,172.16.1.1,172.16.1.4"

Concurrency

Configure the following concurrent_* values:

# For workloads with more data than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. "concurrent_reads" should be set to (16 * number_of_drives) in
# order to allow the operations to enqueue low enough in the stack
# that the OS and drives can reorder them. Same applies to
# "concurrent_counter_writes", since counter writes read the current
# values before incrementing and writing them back.
#
# On the other hand, since writes are almost never IO bound, the ideal
# number of "concurrent_writes" is dependent on the number of cores in
# your system; (8 * number_of_cores) is a good rule of thumb.
concurrent_reads: 32 ← 1 SSD set to 16
concurrent_writes: 32  ← 8 cores set to 64
concurrent_counter_writes: 32

Compactors

Configure the concurrent_compactors values:

# concurrent_compactors defaults to the smaller of (number of disks,
# number of cores), with a minimum of 2 and a maximum of 8.
#
# If your data directories are backed by SSD, you should increase this
# to the number of cores.
concurrent_compactors: 1 ← 8 cores set to 8

Compaction Throughput

Adjust based on actual sstable monitoring ratios:

# Throttles compaction to the given total throughput across the entire
# system. The faster you insert data, the faster you need to compact in
# order to keep the sstable count down, but in general, setting this to
# 16 to 32 times the rate you are inserting data is more than sufficient.
# Setting this to 0 disables throttling. Note that this account for all types
# of compaction, including validation compaction.
compaction_throughput_mb_per_sec: 16

Streaming Throughput

Adjust based on wan capacity:

# Throttles all streaming file transfer between the datacenters,
# this setting allows users to throttle inter dc stream throughput in addition
# to throttling all network stream traffic as configured with
# stream_throughput_outbound_megabits_per_sec
# When unset, the default is 200 Mbps or 25 MB/s
# inter_dc_stream_throughput_outbound_megabits_per_sec: 200

Timeouts

Adjust timeouts:

read_request_timeout_in_ms: 5000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
slow_query_log_timeout_in_ms: 500

Internode Compression

# internode_compression controls whether traffic between nodes is compressed.
# Can be:
#
# all
#   all traffic is compressed
#
# dc
#   traffic between different datacenters is compressed
#
# none
#   nothing is compressed.
internode_compression: dc

Garbage Collection

Long stop-the-world GC pauses are bad, may want to adjust:

# GC Pauses greater than gc_warn_threshold_in_ms will be logged at WARN level
# Adjust the threshold based on your application throughput requirement
# By default, Cassandra logs GC Pauses greater than 200 ms at INFO level
gc_warn_threshold_in_ms: 1000

JVM

Configure your JVM (in jvm.options) to use G1 garbage collector. By default, jvm.options uses the older CMS garbage collector.

Details

#################
#  GC SETTINGS  #
#################
 
 
### CMS Settings
 
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
# some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541
-XX:+CMSClassUnloadingEnabled
 
### G1 Settings (experimental, comment previous section and uncomment section below to enable)

## Use the Hotspot garbage-first collector.
#-XX:+UseG1GC
#
## Have the JVM do less remembered set work during STW, instead
## preferring concurrent GC. Reduces p99.9 latency.
#-XX:G1RSetUpdatingPauseTimePercent=5
#
## Main G1GC tunable: lowering the pause target will lower throughput and vise versa.
## 200ms is the JVM default and lowest viable setting
## 1000ms increases throughput. Keep it smaller than the timeouts in cassandra.yaml.
#-XX:MaxGCPauseMillis=500
   
## Optional G1 Settings
# Save CPU time on large (>= 16GB) heaps by delaying region scanning
# until the heap is 70% full. The default in Hotspot 8u40 is 40%.
#-XX:InitiatingHeapOccupancyPercent=70

# For systems with > 8 cores, the default ParallelGCThreads is 5/8 the number of logical cores.
# Otherwise equal to the number of cores when 8 or less.
# Machines with > 10 cores should try setting these to <= full cores.
#-XX:ParallelGCThreads=16
# By default, ConcGCThreads is 1/4 of ParallelGCThreads.
# Setting both to the same value can reduce STW durations.
#-XX:ConcGCThreads=16

### GC logging options -- uncomment to enable

-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
#-XX:PrintFLSStatistics=1
#Xloggc:/var/log/cassandra/gc.log ← recommend you enable gc logging and rotate the logs
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M

Cassandra Tuning

The following is an OS tuning script provided by DataStax.

Note

This was created for use with CentOS 6; it may not reflect settings needed for more modern versions of Linux.

Details

bash

#!/usr/bin/env bash
 
echo deadline > /sys/block/vda/queue/scheduler
echo 0 > /proc/sys/vm/zone_reclaim_mode
 
# readahead to 64
blockdev --setra 64 /dev/vda1
 
# swap off
echo 0 > /proc/sys/vm/swappiness
swapoff --all
 
# user limits
grep -q -F '* - nproc 32768' /etc/security/limits.d/90-nproc.conf || echo '* - nproc 32768' >> /etc/security/limits.d/90-nproc.conf
grep -q -F 'vm.max_map_count = 131072' /etc/sysctl.conf || echo 'vm.max_map_count = 131072' >> /etc/sysctl.conf

Cassandra Monitoring

Cassandra Metrics JSON Output

This describes the format of the JSON output produced when the metrics configuration option is activated.

Source: https://docs.datastax.com/en/developer/cpp-driver/2.17/api/struct.CassMetrics/

Details

json

{
  "Requests": {
    # Minimum in microseconds
    "min": [Number: integer],
    # Maximum in microseconds
    "max": [Number: integer],
    # Mean in microseconds
    "mean": [Number: integer],
    # Standard deviation in microseconds
    "stddev": [Number: integer],
    # Median in microseconds
    "median": [Number: integer],
    # 75th percentile in microseconds
    "percentile_75th": [Number: integer],
    # 95th percentile in microseconds
    "percentile_95th": [Number: integer],
    # 98th percentile in microseconds
    "percentile_98th": [Number: integer],
    # 99th percentile in microseconds
    "percentile_99th": [Number: integer],
    # 99.9th percentile in microseconds
    "percentile_999th": [Number: integer],
    # Mean rate in requests per second
    "mean_rate": [Number: fraction],
    # 1 minute rate in requests per second
    "one_minute_rate": [Number: fraction],
    # 5 minute rate in requests per second
    "five_minute_rate": [Number: fraction],
    # 15 minute rate in requests per second
    "fifteen_minute_rate": [Number: fraction]
  },
  "stats": {
    # The total number of connections
    "total_connections": [Number: integer],
    # The number of connections available to take requests
    "available_connections": [Number: integer],
    # Occurrences when requests exceeded a pool's water mark
    "exceeded_pending_requests_water_mark": [Number: integer],
    # Occurrences when number of bytes exceeded a connection's water mark
    "exceeded_write_bytes_water_mark": [Number: integer]
  },
  "queries": {
    # Number of queries sent to Cassandra
    "sent": [Number: integer],
    # Number of successful responses
    "recv_ok": [Number: integer],
    # Number of requests that couldn’t be sent, because the local
    # Cassandra driver’s queue was full.
    "recv_err_queue_full": [Number: integer],
    # Number of requests that didn’t succeed because the Cassandra
    # driver couldn’t connect to the server.
    "recv_err_no_hosts": [Number: integer],
    # Number of requests that didn’t succeed because the Cassandra
    # driver timed out while waiting for response from server.
    "recv_err_client_timeout": [Number: integer],
    # Number of requests that didn’t succeed because the Cassandra
    # server reported a timeout communicating with other nodes.
    "recv_err_server_timeout": [Number: integer],
    # Number of requests which couldn’t succeed, because not enough
    # Cassandra nodes were available for the consistency level.
    "recv_err_server_unavailable": [Number: integer]
    # Number of requests which couldn’t succeed for other reasons.
    "recv_err_other": [Number: integer]
  },
  "errors": {
    # Occurrences of a connection timeout
    "connection_timeouts": [Number: integer],
    # [No description provided]
    "pending_request_timeouts": [Number: integer],
    # Occurrences of requests that timed out waiting for a connection
    "request_timeouts": [Number: integer]
  }
}

Databases

Mechanisms

Formats

Extensions

Cassandra (and Dovecot Pro) ​

Dovecot Pro Cassandra Administration ​

Cassandra Repair ​

Reducing Tombstones ​

Replication Factor ​

NetworkTopologyStrategy ​

GossipingPropertyFileSnitch ​

Endpoint Snitch ​

Cassandra Configuration ​

Cassandra Server layout ​

File System Layout ​

Network ​

Ports ​

Memory ​

cassandra.yaml ​

Cluster Name ​

Seed Nodes ​

Concurrency ​

Compactors ​

Compaction Throughput ​

Streaming Throughput ​

Timeouts ​

Internode Compression ​

Garbage Collection ​

JVM ​

Cassandra Tuning ​

Cassandra Monitoring ​

Cassandra Metrics JSON Output ​

Cassandra (and Dovecot Pro)

Dovecot Pro Cassandra Administration

Cassandra Repair

Reducing Tombstones

Replication Factor

NetworkTopologyStrategy

GossipingPropertyFileSnitch

Endpoint Snitch

Cassandra Configuration

Cassandra Server layout

File System Layout

Network

Ports

Memory

cassandra.yaml

Cluster Name

Seed Nodes

Concurrency

Compactors

Compaction Throughput

Streaming Throughput

Timeouts

Internode Compression

Garbage Collection

JVM

Cassandra Tuning

Cassandra Monitoring

Cassandra Metrics JSON Output