Michal Kobus
97242f156a
Cosmetic changes for alerts
Change-Id: I9e8464e3ee5ef28ca5eb7eb84e645e42fb6576cd
Closes-bug: PROD-20466
pirms 6 gadiem
Michal Kobus
d40d0f1e24
Alerts reworked
Change alerts names, severity and descriptions.
Closes-bug: PROD-19718
Change-Id: I238fbcd51cf48389b504ccb531ba9b2bc9dd4be6
pirms 6 gadiem
Mateusz Matuszkowiak
a7a3bda4f6
Be able to gather hdd_errors with syslog
Change-Id: I562455b281f1a92f674e859fa237b38f3432df7b
Closes-Bug: PROD-19728
pirms 6 gadiem
Mateusz Matuszkowiak
c7be65af8b
Added bond Dashboard
Change-Id: Icf8d0fde5120012b449befc8cd4ebea915da9d0d
Partial-Bug: PROD-16264
pirms 6 gadiem
Mateusz Matuszkowiak
734ab84c19
Added one more alert regarding bond
Partial-Bug: PROD-16264
Change-Id: I4f548a95bfb83076301f4669c1ff662c213c4aa3
pirms 6 gadiem
Mateusz Matuszkowiak
55ca321447
Added bond related Prometheus alerts
Change-Id: Ic3c3186f42762062a65d340010b0ebff40f7c577
Partial-Bug: PROD-16264
pirms 6 gadiem
Mateusz Matuszkowiak
7399519b94
Support monitoring for bond interfaces via telegraf
Change-Id: I963dbca50f9ce9f7ad4913640e18833039b68992
Partial-Bug: PROD-16264
pirms 6 gadiem
Mateusz Matuszkowiak
120f611a0c
Use DEB pkgs for Fluentd plugins
On mirror.fuel-infra.org there is a package
td-agent-additional-plugins which ships all the additional
fluentd gems (plugins and its deps).
Change-Id: I0f66c793de67e9574d38b30ee3f62d534aa0bb75
Related-Bug: PROD-17532
pirms 6 gadiem
Michael Fladischer
1e41e3065d
Use items() instead of iteritems() for Python3 compatibility.
iteritems() was dropped in recent Python3 releases and items() is compatible
with Python 2.7.
pirms 6 gadiem
Mateusz Matuszkowiak
03538a8530
Change regex for hdd errors to be more strict
Its possible for fluentd to match and report false positives with
current regex for hdd errors. The following log example line:
failed to deactivate service binding for container
jenkins_slave02.1.tijvdstzxrs6gikbwrtu85078" error=
can be catched by the regex and report about the (false positive)
issue will be sent to prometheus. So the new regex must be more strict,
in order to avoid such alerts.
Change-Id: Ieb27ca39a32ad7bf6e1d0e88d564405e460a4f5f
Closes-Bug: PROD-17883
pirms 6 gadiem
Bartosz Kupidura
6616077674
Generate metrics from logs
Change-Id: I5a8ccb235d36c1b4115794904f373a5704c2296d
pirms 7 gadiem
Kirill Mashchenko
01ad2ccdce
Increase disk issues timeout for alerts
Change-Id: I646a852be587598ff0866e5941d954a6ac1fdd08
pirms 7 gadiem
Kirill Mashchenko
f2a380d42a
Reduce alerting noise for system disk issues
Change-Id: I4fb69e8defa44a9d92a9fb7c23a6280fffc1a3e9
pirms 7 gadiem
Filip Pytloun
36200a423f
Fix meta/salt.yml to workaround broken formulas
Change-Id: I6b6fbaaebd3e349bf76aa05cb9eb2004a842d9c5
pirms 7 gadiem
Bartosz Kupidura
3852f9cffc
Move fluentd config under agent role
Change-Id: I22e7e4713e20f6a0f79c5ab4b3066f1f0129feb0
pirms 7 gadiem
sgudz
f73b92fddf
Fix for disabled repos
Change-Id: Icbbd64144e6619eaa56e02c6c9362c7bcad9dd96
pirms 7 gadiem
Bartosz Kupidura
f2706bc09a
Fix pos_file location
Change-Id: Ifa2787d76cf18e29f54583046349c071c5e9a25e
pirms 7 gadiem
Bartosz Kupidura
19330f5e9e
Add fluentd support
Change-Id: I64a93135daebe7d55430adc51de2c9186c7a5ad7
pirms 7 gadiem
Szymon Bańka
a0dd1737af
Fix SystemDiskInodesTooLow alert
Change-Id: I715f78983c69084c81d4efd4a5625d5dfe0f276f
pirms 7 gadiem
Ramon Melero
14ef04f504
Adds alert to warn for open files being depleted
Change-Id: I87d132ce6473715b0992e561b2855456f24bcb3b
pirms 7 gadiem
Dmitry Kalashnik
2dd3b450d5
Raise severity for System(Tx,Rx)PacketsDroppedTooHigh
Raise severity from warning to critical
Partial-Bug: PROD-15203
Change-Id: I32f19b5520bc200d61280da57f4ab5842b060454
pirms 7 gadiem
Bartosz Kupidura
652ed7ced6
Remove SwapUsed alert
Change-Id: I67531b6ad15a2e96ee05178f17aae2504b3362bf
pirms 7 gadiem
Serhiy Ovsianikov
67bd56a83c
Add atop
Change-Id: I59297736406469e5314236cb40851d9a6f94386e
pirms 7 gadiem
Simon Pasquier
0ab8d27812
Update Telegraf config to ignore aufs partitions
Change-Id: I94f09359f976ccd0f207277da52d20e659b36a69
pirms 7 gadiem
Simon Pasquier
b9d6e99ca1
Add alerts on disk full
The other disk alerts use predict_linear() to trigger before a disk gets
full but they don't trigger when the disk is effectively (or nearly)
full.
Change-Id: I8e6794d35bf96378ca3e3d527db4315d2b3a868d
pirms 7 gadiem
Ales Komarek
7a7ddfbf8f
Fixed the dns records grain
Change-Id: I574c6e1a31f71502eb279cdc3c5768ee483d73fa
pirms 7 gadiem
Ales Komarek
417e8c5cdb
Allow mining for the dns records for local hosts records
Change-Id: I8f2a66c6edafc425794d7cedc8b9217df7ee5951
pirms 7 gadiem
Jaymes Mosher
a2c295dc68
Add bond member status monitoring.
Pillar values:
linux.monitoring.bond_status.interfaces = [ 'bond0', 'all', 'etc' ]
Leave bond_status.interfaces undefined to disable (default).
Depends-On: Ia07d4c473bf64d98170f51599caaedb46645ede3
Change-Id: I62a7d59251d37cb6c7fc7b761f63a5599930f1dc
pirms 7 gadiem
Simon Pasquier
05a8fd2bb1
Don't collect metrics from overlay filesystems
This is typically used to mount Docker containers but it generates too
many volatile metrics which aren't useful.
Change-Id: I00117895570515b2c8f9690542e83061309464c3
pirms 7 gadiem
Simon Pasquier
1483c5b3d3
Add a critical alert on low memory
Change-Id: I1c8e752de9ad3479da830706ae736df6846b977f
pirms 7 gadiem
Simon Pasquier
c462fdfe27
Fix typos in linux/meta/prometheus.yml
Change-Id: Ia7df4918732ce8fcf28b1d6eed629073146a567c
pirms 7 gadiem
Bartosz Kupidura
3d2af0c43f
Don't collect metrics from 'virtual' filesystems
Change-Id: I456ed02ad54a9b55486b4c4a61c9cebfb8f28613
pirms 7 gadiem
Bartosz Kupidura
d2c6bc323a
Disable not used metrics exposed per cpu
Change-Id: Ie3f9da382c23148836e4a20ff0f37c3929e062cf
pirms 7 gadiem
Simon Pasquier
db768fb47c
Fix Prometheus alerts on dropped packets
Since metrics on dropped packets are counters, the alerts should use
the rate() function. This change also fixes some inconsistencies in the
alert descriptions.
Change-Id: I9abbc0a49f45ba760836c436a3e7e65aa62f652e
pirms 7 gadiem
Simon Pasquier
c7b79ad6b4
Rename Prometheus alerts for consistency
Change-Id: I1cc00b41a6a1774d1401a9f71ab4c6364c65d139
pirms 7 gadiem
Olivier Bourdon
0723131ffd
Fix linux/meta/prometheus.yml for the CI
Change-Id: Idc73c152a0e71d5ac2a8c10f46c955755d8e77ae
pirms 7 gadiem
Jaymes Mosher
aa2a52cf9b
Scratch using interfaces_override
pirms 7 gadiem
Jaymes Mosher
603e62ab9e
Keep regex as default but still allow overrides.
pirms 7 gadiem
Simon Pasquier
9083abf8a3
Add monitoring of the swap usage
This change adds the Telegraf configuration to collect swap metrics, the
associated Prometheus alarms and graphs to the Grafana dashboard.
Change-Id: I3595fd0b8cab06215c620642da69dd29c398396a
pirms 7 gadiem
Jaymes Mosher
cf6dbf1d6a
Use Pillar to chose which interfaces to monitor.
The `linux_netlink.ls` function used a regex to choose which interfaces
to collect metric for.
`_alphanum_re = re.compile(r'^[a-z0-9]+$')`
Unfortunately, by default this excludes vlan and tap interfaces, which
are kind of important. ie `bond0.120` or `tap2a3dab86-fb`.
We also have a problem where even if we update the regex to include
these interfaces... if someone deletes and spawns a new instance then
the tap device name changes on the compute host, which will not be
monitored unless someone re-runs the `collectd` on the compute again.
Less than ideal.
This commit lets us choose `VerboseInterface "all"` using Pillar data
to avoid this problem.
pirms 7 gadiem
Simon Pasquier
4d290b5eec
Add Prometheus alerts for dropped packets
Change-Id: If50f18367b22338b3fba1ff15902d557a0bdf2ea
pirms 7 gadiem
Simon Pasquier
d32688e7aa
Reword Prometheus alert messages
Change-Id: I54e02e0741d53ec7b2335145dc968b7b8c8f5e00
pirms 7 gadiem
Ales Komarek
02f35a537c
Graph metadata
Change-Id: If0ee6f1ac5ab697559fcd853225e1520de2e8c1c
pirms 7 gadiem
Simon Pasquier
234e14acda
Add Grafana dashboard for Prometheus datasource
Change-Id: Icacb0ca22a34f1ff438a895700040563d250bac9
pirms 7 gadiem
Simon Pasquier
b1813426dc
Enable kernel, net and process metrics for Telegraf
Change-Id: I008818853c2058746be08365283b949177efa254
Depends-On: I3c3c569a013aff8c3ab8e46cffb93a60d74ddf09
pirms 7 gadiem
Swann Croiset
d66a782570
Enable diskio input telegraf plugin
Change-Id: I80193afad1842f67967d1bab164f049078e3cd75
pirms 7 gadiem
Erick Cantwell
e5770ac50f
[MMO-132] Check the length of the dict, instead of if it's defined (it
will always be defined since the default is an empty dict)
pirms 7 gadiem
Filip Pytloun
ea11327afe
Fix grains generation when linux_netlink.ls is not available
Change-Id: Id4b0b405872457bd8b20f450e4031d6808d3cf59
pirms 7 gadiem
Filip Pytloun
e70606d0d2
Manage grains using support metadata
Change-Id: I25fb0eb0d4b922b8853eceb0c1c220a4040e1704
pirms 7 gadiem
Bartosz Kupidura
d8b54c95da
Add variables in prometheus alerts
Change-Id: I1765fc6aa4a8c3da25330f19bb043ddbf548b9ad
pirms 7 gadiem