4 releases
Uses old Rust 2015
0.1.7 | Jun 18, 2018 |
---|---|
0.1.3 |
|
0.1.2 | Sep 18, 2017 |
0.1.1 | Sep 6, 2017 |
0.0.9 | Aug 7, 2017 |
#443 in Debugging
135KB
2.5K
SLoC
rs-collector
rs-collector is a Bosun compatible collector for various services that are not covered by scollector, and that we use at CenterDevice.
Attention: Please be advised, even though we have been running rs-collector on our production systems successfully for months, this is not stable software.
Table of Contents generated with DocToc
Collectors
- Galera - Collects metrics about the cluster status and cluster sync performance of a MySQL Galera cluster.
- HasIpAddr - Checks if a host has bound specific IPv4 addresses.
- JVM - Collects garbage collection statistics.
- Megaraid - Collects Megaraid disk statistics.
- MongoDB - Collects replicaset metrics.
- Postfix - Collects queue lengths for all postfix queues.
- rs-collector - Collects internal metrics for rs-collector.
See below for details about the collectors.
Galera
The Galera collector collects metrics about the cluster status and cluster sync performance of a MySQL Galera cluster. We use it to watch for cluster split brain and general degradation situations. There is a full list of all available metrics in galera.rs, function metadata
.
The Galera collector supports SSL transport encryption on Linux. See the example configuration for how to enable SSL.
Example Alarms
alert galera.cluster.state.uuid.no.consensus {
template = ...
critNotification = default
$metric = avg:galera.wsrep.cluster.state.uuid{domain=wildcard(*)}
$q=q("$metric", "5m", "")
$a = avg($q)
$f = first($q)
$q_alert = ($a - $f) != 0
crit = $q_alert
}
alert galera.cluster.state.not.primary {
template = ...
critNotification = default
$metric = sum:galera.wsrep.cluster.status{host=wildcard(*),domain=wildcard(*)}
$q = q("$metric", "5m", "")
$t = t(last($q), "domain")
$q_alert = sum($t)
$primaryValue = 0
crit = $q_alert != $primaryValue
}
alert galera.local.state.not.synced {
template = ...
critNotification = default
$metric = zimsum:5m-avg:galera.wsrep.local.state{domain=wildcard(*)}
$q = q("$metric", "5m", "")
$q_alert = last($q)
$syncedValue = 12
crit = $q_alert != $syncedValue
}
alert galera.cluster.size.degraded {
template = ...
critNotification = default
$metric = avg:galera.wsrep.cluster.size{domain=wildcard(*)}
$q = q("$metric", "5m", "")
$q_alert = last($q)
$critValue = 3
crit = $q_alert != $critValue
}
HasIpAddr
The HasIpAddr collector sends either 1 or 0, depending on whether a host has bound a specific IPv4 address or not, respectively. This is helpful in cases where hosts bind or release IPv4 addresses dynamically. For example, in a keepalived
VRRP cluster it allows Bosun to check if, and on how many hosts a virtual, high available IP address is bound.
In our production clusters we have observed situations when none of the cluster members had bound the virtual IP address. This collector allows us to define an alarm for such cases.
Example Alarm
alert os.net.vrrp-vip-failed {
template = ...
critNotification = default
$metric = sum:os.net.has_ipv4s{ipv4=wildcard(*)}
$q_alert = sum(t(last(q("$metric", "5m", "")), "ipv4"))
$expected = 1
$critValue = $expected
crit = $q_alert != $critValue
}
JVM
The JVM collector collects garbage collection statistics, i. e. those that jstat -gc
reveals for each specified, running JVM. This collector has been tested with OpenJDK "7u51-2.4.6-1ubuntu4" and Oracle JDK "1.8.0_121". JVMs are identified by a regular expression that matches the class name or the command line arguments.
This collector only collects statistics for specified JVMs; cf. example configuration. It currently does not distinguish between multiple instances of the same identified JVM.
Megaraid
The Megaraid collector collector disk drive statistics using the MegaCLI tool. It collects statistics like
hw.storage.drivestats.mediaerrors
: Number of media errors reported for the device by the RAID controller.hw.storage.drivestats.othererrors
: Number of other errors reported for the device by the RAID controller.hw.storage.drivestats.predfailerrors
: Number of errors that are considered critical by the RAID controller.hw.storage.drivestats.smartflag
hw.storage.drivestats.firmwarestate
hw.storage.drivestats.predfaileventno
: Sequence number of the most recent recorded predictive failure event.
Mongo
The Mongo collector collects MongoDB connection, op counter, replicaset and cluster metrics. We use it to check for cluster split brain and general degradation situations. There is a full list of all available metrics in mongo.rs, function metadata
.
For connection and op statistics, the following metrics are helpful:
mongo.connections.current
collects the number of incoming connections from clients to the database server . This number includes the current shell session. Consider the value of connections.available to add more context to this datum. The value will include all incoming connections including any shell connections or connections from other servers, such as replica set members or mongos instances.mongo.connections.available
collects the number of unused incoming connections available. Consider this value in combination with the value of connections.current to understand the connection load on the database, and the UNIX ulimit Settings document for more information about system thresholds on available connections.mongo.connections.totalCreated
counts of all incoming connections created to the server. This number includes connections that have since closed.mongo.opcounters.insert
collects the total number of insert operations received since the mongod instance last started.mongo.opcounters.query
collects the total number of queries received since the mongod instance last started.mongo.opcounters.update
collects the total number of update operations received since the mongod instance last started.mongo.opcounters.delete
collects the total number of delete operations since the mongod instance last started.mongo.opcounters.getmore
collects the total number of “getmore” operations since the mongod instance last started. This counter can be high even if the query count is low. Secondary nodes send getMore operations as part of the replication process.mongo.opcounters.command
collects the total number of commands issued to the database since the mongod instance last started.mongo.opcounters.command
counts all commands except the write commands: insert, update, and delete.
For replicaset and cluster monitoring, the following two metrics are helpful:
mongo.replicasets.members.mystate
collects the "myState" variable from each replica set member. This allows to compute if that particular replica set is in a sane state.mongo.replicasets.oplog_lag.[min,avg,max]
collects the min, avg, and max oplog replication lag between a replica set's primary and the corresponding secondaries. These values are measured only on the currently active primary.
Example Alarms
alert mongo.replicaset.state.unexpected {
template = ...
critNotification = default
$metric = sum:mongo.replicasets.members.mystate{host=wildcard(*),replicaset=wildcard(*)}
$q = q("$metric", "5m", "")
$t = t(last($q), "replicaset")
$q_alert = sum($t)
$critValue = 5
crit = $q_alert != $critValue
}
Postfix
The Postfix collector collects metrics about Postfix' queues. This is helpful to monitor how the queues fill and empty over time, as well as to see if the queues are emptied at all, in order to alarm when mail delivery stalls. There is a full list of all available metrics in postfix.rs, function metadata
.
Example Alarms
alert postfix.mailqueue.deferred.too.long {
template = ...
critNotification = default
warnNotification = default
$metric = sum:5m-min:postfix.queues.deferred{domain=wildcard(*)}
$q = q("$metric", "5m", "")
$t = t(last($q), "domain")
$q_alert = sum($t)
warn = $q_alert
}
alert postfix.mailqueue.deferred.unchanged {
template = ...
warnNotification = default
$period = 4h
$metric = postfix.queues.deferred{domain=wildcard(*)}
$q_min = q("min:$metric", "$period", "")
$q_max = q("max:$metric", "$period", "")
$min_queue_len = min($q_min)
$max_queue_len = max($q_max)
$q_alert = $min_queue_len > 0 && $max_queue_len == $min_queue_len
warn = $q_alert
}
rs-collector Internal Metrics
rs-collector.stats.rss
collects the resident set size (physical memory) in KB consumed by rs-collector; only supported on Linux.rs-collector.stats.samples
collects the number of transmitted samples.rs-collector.versio
collects the version 'x.y.z' of rs-collector as x * 1.000.0000 + y * 1000 + z.
These metrics can also be used to check the liveliness of rs-collector and as a heartbeat.
Configuration
Please see this example.
Installation
Ubuntu [x86_64 and Raspberry Pi]
Please add my PackageCloud open source repository and install rs-collector via apt.
curl -s https://packagecloud.io/install/repositories/lukaspustina/opensource/script.deb.sh | sudo bash
sudo apt-get install rs-collector
Linux Binaries [x86_64 and Raspberry Pi]
There are binaries available at the GitHub release page. The binaries get compiled on Ubuntu.
macOS
Please use Homebrew to install rs-collector on your system.
brew install lukaspustina/os/rs-collector
Sources
Please install Rust via rustup and then run
cargo install rs-collector
Ansible
There is also an Ansible role available at Ansible Galaxy that automates the installation of rs-collector.
Know Issues
-
General: Minor memory leak in chan::tick -- cf. Roadmap.
-
JVM: Does not distinguish between JVMs with the same name assigned via configuration, i.e., multiples instances of the same Java application.
Roadmap
Please see Todos.
Dependencies
~39MB
~780K SLoC