Search K
Appearance
Appearance
Cassandra is an open source NoSQL distributed database that offers scalability and high availability without compromising performance.
Cassandra is an important component of Dovecot Pro, required for obox and Palomar and optionally used for other features (e.g., last login).
This page contains suggestions and recommendations on how to configure and operate Cassandra in your installation.
WARNING
Documentation on this page is for informational purposes only. These are not requirements, and Open-Xchange/Dovecot cannot provide direct Cassandra support.
For specific assistance with a local Cassandra installation, consultation with a Cassandra expert is recommended.
Pre-install deployment checklist:
Apache Cassandra is a distributed database with tunable consistency. The normal Dovecot configuration implements quorum
consistency.
Quorum provides strong consistency with failure toleration.
With a replication factor of 3, quorum
is
Cassandra nodetool repair
is the AntEntropy service that uses Merkle trees to detect and repair inconsistencies in data between replicas.
Another important element is gc_grace_seconds
(10 days
by default) which is the tombstone time to live marker.
If a node is missing the tombstone after the gc_grace_seconds
period, the deleted data will be resurrected.
In the Dovecot Pro log file, if you start seeing Object exists in dict, but not in storage
errors, then you most likely have resurrected deleted data. Resurrected deleted data will have to be manually deleted. See Obox Troubleshooting: Object exists in dict, but not in storage.
TIP
To prevent Cassandra data resurrection, you must regularly run nodetool repair
within gc_grace_seconds
via cron for the entire cluster.
Dovecot attempts to prevent creating too many tombstones within the same cluster key. Sometimes it may not have worked properly though, and Cassandra queries start failing (timing out) towards a specific cluster key due to too many tombstones. This can be repaired by getting rid of the tombstones:
gc_grace_seconds
to a smaller value that includes the tombstones (e.g. 1 day).gc_grace_seconds
back to the original value (10 days).Other potential changes that may help:
page_size=1000
in dovecot-dict-cql.conf.ext
connect setting so large results would be paged into multiple queries.tombstone_failure_threshold
.For multiple data centers, the replication NetworkTopologyStrategy is recommended for production environments.
When the mails
keyspace is created, set replication to NetworkTopologyStrategy.
This example sets replication factor to 3 in each data center:
CREATE KEYSPACE 'mails' IF NOT EXIST WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'dc1_name' : 3,
'dc2_name' : 3
}
Note
Every node will only define itself.
A snitch determines which data centers and racks nodes belong to. The replication (in this case NetworkTopologyStrategy) places the replicas based on snitch information.
The GossipingPropertyFileSnitch uses Cassandra's gossip protocol to automatically update snitch information when adding new nodes. This is the recommended snitch for production.
Configure GossipingPropertyFileSnitch on each node with only localhost information. Cassandra discovers all the nodes in the ring via Gossip protocol on start up.
This configuration is set in cassandra-rackdc.properties
.
Examples:
DC=dc1_name
RAC=rack1
DC=dc1_name
RAC=rack2
DC=dc2_name
RAC=rack1
Modify the cassandra.yaml
file and change "endpoint_snitch" to use "GossipingPropertyFileSnitch".
endpoint_snitch: GossipingPropertyFileSnitch
Cassandra nodes should have at least 2 disk and 2 network interfaces.
One disk is for the commit log which should be fast enough to receive all writes as sequential I/O. The size of the commit log is controlled by "commit log_total_space_in_mb" setting in cassandra.yaml file.
The other disk is for data which should be fast enough to satisfy both read and write I/O. The recommendation is to limit space utilization to more than 50% of the total disk size for repair and compaction operations.
Ideally there is a separate set of mirrored OS disk and a separate disk for the Cassandra commit log.
If you are sharing the OS disk with Cassandra commit log, use the following paths:
/var/lib/cassandra/commitlog
/var/lib/cassandra/saved_caches
/var/lib/cassandra/hints_directory
If you have a dedicated disk for Cassandra commit log, you should create a mount point named /var/lib/cassandra
and use the path names above.
The Cassandra data disk should be a dedicated SSD drive. You should create the mount point /var/lib/cassandra/data
.
Note
Cassandra configuration files are normally located in /etc/cassandra
and logs are located in /var/log/cassandra
.
The path names above are the default values in the cassandra.yaml
file.
# Directory where Cassandra should store hints.
# If not set, the default directory is $CASSANDRA_HOME/data/hints.
hints_directory: /var/lib/cassandra/hints
# Directories where Cassandra should store data on disk. Cassandra
# will spread data evenly across them, subject to the granularity of
# the configured compaction strategy.
# If not set, the default directory is $CASSANDRA_HOME/data/data.
data_file_directories:
- /var/lib/cassandra/data
# commit log. when running on magnetic HDD, this should be a
# separate spindle than the data directories.
# If not set, the default directory is $CASSANDRA_HOME/data/commitlog.
commitlog_directory: /var/lib/cassandra/commitlog
# saved caches
# If not set, the default directory is $CASSANDRA_HOME/data/saved_caches.
saved_caches_directory: /var/lib/cassandra/saved_caches
Cassandra gives you the ability to segregate inter-node from client traffic. Most deployments go with a single IP address for both Cassandra inter-node and client communication.
In Cassandra the listen_address: / listen_interface:
are used for inter-node cluster communication. Cassandra replication, repair, gossip, and compaction can generate significant traffic from time to time. Isolating cluster traffic onto its own IP address range (cluster vlan), if possible, can improve network performance / reduce network latency within the Cassandra cluster. In the case of multi-site, you may need a vlan over wan solution.
The Cassandra rpc_address: / rpc_interface:
are for client communication. This would be the IP address (client access vlan) that Dovecot backends would be configured to use.
Inter-node (default) ports:
Port | Service |
---|---|
7000 | Inter-node |
7001 | Inter-node (SSL) |
7199 | JMK |
Client (default) ports:
Port | Service |
---|---|
9042 | Client Port |
9160 | Thrift |
9142 | Native |
Cassandra runs inside JAVA JVM. The JVM settings are defined in "cassandra-env.sh" (older) or jvm.options file in /etc/cassandra
configuration directory. The run settings are computed during startup based on system resources. Recommendation is to use default settings and monitor the system.
Various tips to assist in Cassandra configuration.
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: cluster1
Configure the first and fourth nodes in each data center to be your seed nodes (example IPs only):
- seeds: "172.16.0.1,172.16.0.4,172.16.1.1,172.16.1.4"
Configure the following concurrent_*
values:
# For workloads with more data than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. "concurrent_reads" should be set to (16 * number_of_drives) in
# order to allow the operations to enqueue low enough in the stack
# that the OS and drives can reorder them. Same applies to
# "concurrent_counter_writes", since counter writes read the current
# values before incrementing and writing them back.
#
# On the other hand, since writes are almost never IO bound, the ideal
# number of "concurrent_writes" is dependent on the number of cores in
# your system; (8 * number_of_cores) is a good rule of thumb.
concurrent_reads: 32 ← 1 SSD set to 16
concurrent_writes: 32 ← 8 cores set to 64
concurrent_counter_writes: 32
Configure the concurrent_compactors
values:
# concurrent_compactors defaults to the smaller of (number of disks,
# number of cores), with a minimum of 2 and a maximum of 8.
#
# If your data directories are backed by SSD, you should increase this
# to the number of cores.
concurrent_compactors: 1 ← 8 cores set to 8
Adjust based on actual sstable monitoring ratios:
# Throttles compaction to the given total throughput across the entire
# system. The faster you insert data, the faster you need to compact in
# order to keep the sstable count down, but in general, setting this to
# 16 to 32 times the rate you are inserting data is more than sufficient.
# Setting this to 0 disables throttling. Note that this account for all types
# of compaction, including validation compaction.
compaction_throughput_mb_per_sec: 16
Adjust based on wan capacity:
# Throttles all streaming file transfer between the datacenters,
# this setting allows users to throttle inter dc stream throughput in addition
# to throttling all network stream traffic as configured with
# stream_throughput_outbound_megabits_per_sec
# When unset, the default is 200 Mbps or 25 MB/s
# inter_dc_stream_throughput_outbound_megabits_per_sec: 200
Adjust timeouts:
read_request_timeout_in_ms: 5000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
slow_query_log_timeout_in_ms: 500
# internode_compression controls whether traffic between nodes is compressed.
# Can be:
#
# all
# all traffic is compressed
#
# dc
# traffic between different datacenters is compressed
#
# none
# nothing is compressed.
internode_compression: dc
Long stop-the-world GC pauses are bad, may want to adjust:
# GC Pauses greater than gc_warn_threshold_in_ms will be logged at WARN level
# Adjust the threshold based on your application throughput requirement
# By default, Cassandra logs GC Pauses greater than 200 ms at INFO level
gc_warn_threshold_in_ms: 1000
Configure your JVM (in jvm.options
) to use G1 garbage collector. By default, jvm.options
uses the older CMS garbage collector.
#################
# GC SETTINGS #
#################
### CMS Settings
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
# some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541
-XX:+CMSClassUnloadingEnabled
### G1 Settings (experimental, comment previous section and uncomment section below to enable)
## Use the Hotspot garbage-first collector.
#-XX:+UseG1GC
#
## Have the JVM do less remembered set work during STW, instead
## preferring concurrent GC. Reduces p99.9 latency.
#-XX:G1RSetUpdatingPauseTimePercent=5
#
## Main G1GC tunable: lowering the pause target will lower throughput and vise versa.
## 200ms is the JVM default and lowest viable setting
## 1000ms increases throughput. Keep it smaller than the timeouts in cassandra.yaml.
#-XX:MaxGCPauseMillis=500
## Optional G1 Settings
# Save CPU time on large (>= 16GB) heaps by delaying region scanning
# until the heap is 70% full. The default in Hotspot 8u40 is 40%.
#-XX:InitiatingHeapOccupancyPercent=70
# For systems with > 8 cores, the default ParallelGCThreads is 5/8 the number of logical cores.
# Otherwise equal to the number of cores when 8 or less.
# Machines with > 10 cores should try setting these to <= full cores.
#-XX:ParallelGCThreads=16
# By default, ConcGCThreads is 1/4 of ParallelGCThreads.
# Setting both to the same value can reduce STW durations.
#-XX:ConcGCThreads=16
### GC logging options -- uncomment to enable
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
#-XX:PrintFLSStatistics=1
#Xloggc:/var/log/cassandra/gc.log ← recommend you enable gc logging and rotate the logs
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M
The following is an OS tuning script provided by DataStax.
Note
This was created for use with CentOS 6; it may not reflect settings needed for more modern versions of Linux.
#!/usr/bin/env bash
echo deadline > /sys/block/vda/queue/scheduler
echo 0 > /proc/sys/vm/zone_reclaim_mode
# readahead to 64
blockdev --setra 64 /dev/vda1
# swap off
echo 0 > /proc/sys/vm/swappiness
swapoff --all
# user limits
grep -q -F '* - nproc 32768' /etc/security/limits.d/90-nproc.conf || echo '* - nproc 32768' >> /etc/security/limits.d/90-nproc.conf
grep -q -F 'vm.max_map_count = 131072' /etc/sysctl.conf || echo 'vm.max_map_count = 131072' >> /etc/sysctl.conf
This describes the format of the JSON output produced when the metrics configuration option is activated.
Source: https://docs.datastax.com/en/developer/cpp-driver/2.17/api/struct.CassMetrics/
{
"Requests": {
# Minimum in microseconds
"min": [Number: integer],
# Maximum in microseconds
"max": [Number: integer],
# Mean in microseconds
"mean": [Number: integer],
# Standard deviation in microseconds
"stddev": [Number: integer],
# Median in microseconds
"median": [Number: integer],
# 75th percentile in microseconds
"percentile_75th": [Number: integer],
# 95th percentile in microseconds
"percentile_95th": [Number: integer],
# 98th percentile in microseconds
"percentile_98th": [Number: integer],
# 99th percentile in microseconds
"percentile_99th": [Number: integer],
# 99.9th percentile in microseconds
"percentile_999th": [Number: integer],
# Mean rate in requests per second
"mean_rate": [Number: fraction],
# 1 minute rate in requests per second
"one_minute_rate": [Number: fraction],
# 5 minute rate in requests per second
"five_minute_rate": [Number: fraction],
# 15 minute rate in requests per second
"fifteen_minute_rate": [Number: fraction]
},
"stats": {
# The total number of connections
"total_connections": [Number: integer],
# The number of connections available to take requests
"available_connections": [Number: integer],
# Occurrences when requests exceeded a pool's water mark
"exceeded_pending_requests_water_mark": [Number: integer],
# Occurrences when number of bytes exceeded a connection's water mark
"exceeded_write_bytes_water_mark": [Number: integer]
},
"queries": {
# Number of queries sent to Cassandra
"sent": [Number: integer],
# Number of successful responses
"recv_ok": [Number: integer],
# Number of requests that couldn’t be sent, because the local
# Cassandra driver’s queue was full.
"recv_err_queue_full": [Number: integer],
# Number of requests that didn’t succeed because the Cassandra
# driver couldn’t connect to the server.
"recv_err_no_hosts": [Number: integer],
# Number of requests that didn’t succeed because the Cassandra
# driver timed out while waiting for response from server.
"recv_err_client_timeout": [Number: integer],
# Number of requests that didn’t succeed because the Cassandra
# server reported a timeout communicating with other nodes.
"recv_err_server_timeout": [Number: integer],
# Number of requests which couldn’t succeed, because not enough
# Cassandra nodes were available for the consistency level.
"recv_err_server_unavailable": [Number: integer]
# Number of requests which couldn’t succeed for other reasons.
"recv_err_other": [Number: integer]
},
"errors": {
# Occurrences of a connection timeout
"connection_timeouts": [Number: integer],
# [No description provided]
"pending_request_timeouts": [Number: integer],
# Occurrences of requests that timed out waiting for a connection
"request_timeouts": [Number: integer]
}
}