Since we added to nstat's telegraf plugin the possibility
to collect data from `/proc/net/softnet_stat` regarding
dropped packets and rx_net_action a.k.a time squeeze, we need to enable
it globally on all hosts.
Also grafana dashboard update to include new graphs + added four
new Prometheus alers.
Related-Bug: PROD-21090
Change-Id: I9dfe87bdc8b677a51e3f305dd3c75c7d4cc4e0d4
On mirror.fuel-infra.org there is a package
td-agent-additional-plugins which ships all the additional
fluentd gems (plugins and its deps).
Change-Id: I0f66c793de67e9574d38b30ee3f62d534aa0bb75
Related-Bug: PROD-17532
Its possible for fluentd to match and report false positives with
current regex for hdd errors. The following log example line:
failed to deactivate service binding for container
jenkins_slave02.1.tijvdstzxrs6gikbwrtu85078" error=
can be catched by the regex and report about the (false positive)
issue will be sent to prometheus. So the new regex must be more strict,
in order to avoid such alerts.
Change-Id: Ieb27ca39a32ad7bf6e1d0e88d564405e460a4f5f
Closes-Bug: PROD-17883
The other disk alerts use predict_linear() to trigger before a disk gets
full but they don't trigger when the disk is effectively (or nearly)
full.
Change-Id: I8e6794d35bf96378ca3e3d527db4315d2b3a868d
This is typically used to mount Docker containers but it generates too
many volatile metrics which aren't useful.
Change-Id: I00117895570515b2c8f9690542e83061309464c3
Since metrics on dropped packets are counters, the alerts should use
the rate() function. This change also fixes some inconsistencies in the
alert descriptions.
Change-Id: I9abbc0a49f45ba760836c436a3e7e65aa62f652e
This change adds the Telegraf configuration to collect swap metrics, the
associated Prometheus alarms and graphs to the Grafana dashboard.
Change-Id: I3595fd0b8cab06215c620642da69dd29c398396a
The `linux_netlink.ls` function used a regex to choose which interfaces
to collect metric for.
`_alphanum_re = re.compile(r'^[a-z0-9]+$')`
Unfortunately, by default this excludes vlan and tap interfaces, which
are kind of important. ie `bond0.120` or `tap2a3dab86-fb`.
We also have a problem where even if we update the regex to include
these interfaces... if someone deletes and spawns a new instance then
the tap device name changes on the compute host, which will not be
monitored unless someone re-runs the `collectd` on the compute again.
Less than ideal.
This commit lets us choose `VerboseInterface "all"` using Pillar data
to avoid this problem.