= Prometheus AlertManager =
== Ziel ==
Ablösen von Icinga durch Prometheus. In dem Artikel geht es darum was der Alertmanager an Funktionalitäten benötigt um Icinga abzulösen.
Informationen zu Prometheus befinden sich [[./Prometheus|hier]].


== Alerts ==
In der Tabelle werden Alerts aus dem Altsystem zusammen gefasst und betrachtet wie diese in den Alertmanager übernommen werden können.
Aus 20_base2.cfg
||  Titel  || Command || Umsetzung || Infos ||
|| (1) check_crond  || /usr/lib64/nagios/plugins/check_procs -v -w 1: -c 1: -C crond || script_exporter.pl/check_systemd_process() || (./) ||
|| (2) check_rsyslogd  || /usr/lib64/nagios/plugins/check_procs -v -w 1: -c 1: -C rsyslogd || script_exporter.pl/check_process() || (./) ||
|| (3) check_zombie_procs  || /usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s ZN || script_exporter.pl/check_zombie_process() || (./) [[https://www.unix.com/unix-for-dummies-questions-and-answers/100737-how-do-you-create-zombie-process.html|Script]] zum erzeugen eines Zombieprozesses  ||
|| (4) check_swap  || /usr/lib64/nagios/plugins/check_swap -w 50% -c 20% || node_vmstat_kswapd_* node_vmstat_pgscan_kswapd_* node_memory_Swap* || (./) ||
|| (5) check_load || /usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20 || node_load* || (./) ||
|| (6) check_total_procs || usr/lib64/nagios/plugins/check_procs -k -w 700 -c 800 || script_exporter.pl/check_sum_process()  || (./) ||
|| (7) check_disks || /usr/lib/nagios/plugins/check_disks -w 10% -c 5% || node_filesystem_free / node_filesystem_avail (non-root) or node_filesystem_free ||  (./) ||
|| (8) check_mem || /usr/lib/nagios/plugins/check_mem -w 80% -c 95%  || node_memory_*  || (./) ||
|| (9) check_memcache || /usr/lib/nagios/plugins/check_memcache -w 30 -c 25 || node_memory_* || (./) ||
|| (10) check_ramspeed  || /usr/nagios/libexec/check_ramspeed -w 5 -c 2 || script_exporter.pl/memory_speed() || (./) ||
|| (11) check_crl || /usr/lib/nagios/plugins/check_crl -w 2 -c 4 || gridsecurity_cert_exporter.pl/check_crl() || (./) ||
|| (12)check_cvmfs || /usr/lib/nagios/plugins/check_cvmfs || cvmfs_exporter.pl || Achtung check_cvmfs_repo.sh vom CERN muss mit ausgerollt werden. Test erfolge auf WGS02 und WGS15 ||
|| (13) check_mounts || /usr/lib/nagios/plugins/check_mounts || afs_exporter.pl || (./) ||
|| (14) check_ipmisel || usr/lib/nagios/plugins/check_ipmisel ||  || TODO Kann das durch den ipmi Exporter umgesetzt werden? Können wir https://github.com/lovoo/ipmi_exporter benutzen. Testrechner arktos. Perl-Check ist 737 lang. ||
|| (15) check_bonding || /usr/lib/nagios/plugins/check_bonding ||  || (./) ||

Aus 50_dell-openmanage.cfg
||  Titel  || Command || Umsetzung || Infos ||
|| (16) check_openmanage  || sudo /usr/nagios/libexec/check_openmanage -f /usr/nagios/etc/check_openmanage.conf ||  || Testrechner pear20. Können wir https://github.com/galexrt/dellhw_exporter benutzen. Wenn wir ipmi haben brauchen wir dann noch amsa? Perl-Check ist 5486 Zeilen lang :-( ||

Aus 50_gridcert.cfg
||  Titel  || Command || Umsetzung || Infos ||
|| (17) check_gridcert  || /usr/nagios/libexec/check_sslcert -f /etc/grid-security/hostcert.pem -p /etc/grid-security/certificates || gridsecurity_cert_exporter.pl || (./) getestet auf arktos. Default 30 Tage vor Ende Warnung 14 Tage Error ||


Aus 60_tomcat.cfg (auf arktos gefunden)
||  Titel  || Command || Umsetzung || Infos ||
|| (18) check_http   || /usr/nagios/libexec/check_http -H localhost -p 8080 ||  || TODO sollte mit https://github.com/prometheus/blackbox_exporter umgesetzt werden . ||

Aus 80_server.cfg (auf arktos gefunden)
|| Titel  || Command || Umsetzung || Infos ||
|| (19) check_users || /usr/nagios/libexec/check_users -w 5 -c 10 ||  || ist deaktiviert  ||
|| (20) check_load || /usr/nagios/libexec/check_load -w 15,10,5 -c 30,25,20 ||  || wie (5) ||
|| (21) check_total_procs || /usr/nagios/libexec/check_procs -w 600 -c 800  ||  || TODO (2)werweitern so das eine Liste von Prozessen hinterlegt wird ||
|| (22) check_tomcat || /usr/nagios/libexec/check_http -H localhost -p 8080 ||  || TODO sollte mit https://github.com/prometheus/blackbox_exporter umgesetzt werden . ||
|| (23) check_bonding || /usr/nagios/libexec/check_bonding ||  || siehe (15) ||
|| (24) check_sslcert || /usr/nagios/libexec/check_sslcert -f /etc/pki/tls/hostcert.pem -p /etc/pki/tls/certs ||  || siehe (17) ||
|| (25) check_tsm || /usr/nagios/libexec/check_procs -w 3  -c 1: -C dsmcad ||  || TODO (2) werweitern so das eine Liste von Prozessen hinterlegt wird ||

Aus 85_2000procs.cfg (auf wgs15 gefunden)
|| Titel  || Command || Umsetzung || Infos ||
|| check_total_procs || /usr/nagios/libexec/check_procs -w 1500 -c 2000 ||  || TODO (2) werweitern so das eine Liste von Prozessen hinterlegt wird ||


Aus 90_tsm.cfg (auf arktos gefunden)
|| Titel  || Command || Umsetzung || Infos ||
|| check_tsm || /usr/nagios/libexec/check_procs -w 3  -c 1: -C dsmcad ||  || TODO (2) werweitern so das eine Liste von Prozessen hinterlegt wird ||

Checks die auf icinga direct ausgeführt werden 
/usr1/icinga/etc/shared/checkcommands.cfg

/!\ TODO 

== Mapping Nagios zu Prometheus Konfiguration ==
Das Script nagios_announce_services ermittelt welche Check für Nagios konfiguriert werden sollen.
Bei Prometheus ist entscheidend welcher Exporter eingetragen wird.

Die Tabelle schlüsselt die Beziehung auf. 

|| Nagios check || Prometheus exporter ||
|| Crond  || script ||
|| Mounts || afs ||
|| Disk || node ||
|| RSyslogd || node ||
|| MemCache || node ||
|| SSH ||  ||
|| Load || node ||
|| Procs || script ||
|| Zombies || script ||

== Offene Frage ==
zu check_ramspeed (10)
 * wo ist im Altsystem definiert welcher Rechner welche Geschwindigkeit bereitstellen muss bevor ein Alert ausgelöst wird
 * wie oft wird es aufgerufen
 * warum ist [vulcan01] /etc/prometheus/ssl/client.crl leer
 * wie werden Fehler am besten mit dem Exporter exportiert? Variante 1 wenn nur Positive Zahlen dann mit -1 oder Variante 2 mit Tags. Beispiel siehe https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info tag error.

== Todo Liste ==
Offen
 * Änderungen für Wünsche am GIT Projekt (inventar als ini und ein Playbook ohne includes) mit Timm besprechen
 * Mit Fabian klären wann welcher Alert raus geht.
 * Firewall für Alertmanager GUI öffnen

Umgesetzt
 * Projekt nach GIT umziehen https://stash.desy.de/projects/ZNDV/repos/ansible-prometheus/browse
 * Testrechner aufgesetzt der die neuen exportiert (flaco-vm10)

== Probleme ==
 * Auslagern der Konfiguration unter /etc/prometheus/config funktioniert nicht. Finde zu der Option auch nichts in der Doku vom [[https://github.com/QubitProducts/exporter_exporter|exporter_exporter]].

== Performance Test ==
Für das Testen der Exporter wurde ab (apache benchmark) benutzt.

=== Script script_exporter.pl über Exporter_Exporter  ===
Performance Test:
{{{
ab -n 100 -c 4 "https://flaco-vm10:9998/proxy?module=script"
}}}

{{{
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        4    8   6.3      6      52
Processing:  1699 1753  27.2   1748    1824
Waiting:     1699 1753  27.2   1748    1824
Total:       1705 1761  28.4   1758    1832
}}}

Im Top sieht man das der Ramspeed (mit DD) am meisten Performanz verbraucht. 
Gleicher Test ohne DD
{{{
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        4    8   5.0      6      25
Processing:    97  177  16.1    174     231
Waiting:       97  177  16.1    174     231
Total:        101  185  18.2    182     246
}}}

MemoryLeak Test:

{{{
ab -n 5000000 -c 2 "https://flaco-vm10:9998/proxy?module=script"
}}}
nach 17 Stunden abgebrochen
{{{
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        4    6   1.8      6      48
Processing:   714  887  39.0    880    1700
Waiting:      714  887  39.0    880    1700
Total:        722  893  39.2    886    1707
}}}

== MemoryLeak Test ==
Mit dem Befehl wird eine Liste der Prozesse und verbrauchten Speicher erzeugt
{{{
ps -eo size,pid,user,command --sort -size | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' |cut -d "" -f2 | cut -d "-" -f1
}}}

Ausgabe nach einer Nacht Dauerlast
{{{
       439.63 Mb /usr/lib/polkit
       305.84 Mb /usr/bin/node_exporter 
       213.29 Mb /usr/sbin/rsyslogd 
       146.49 Mb /usr/lib/systemd/systemd 
        72.36 Mb /usr/sbin/chronyd 
        10.91 Mb /usr/bin/exporter_exporter 
         8.55 Mb /usr/bin/dbus
         1.84 Mb /usr/lib/systemd/systemd
         1.45 Mb /usr/lib/systemd/systemd
         1.28 Mb /usr/sbin/crond 
}}}

<!> Warum verbraucht der node_exporter so viel Speicher. Nach dem Restart war er nur noch 104.80 Mb groß.

== Tips ==
Wenn man einen Expoter selber schrieb und den Output testen möchte, kann das promtool dafür verwendet werden.

Beispielaufruf für ein Script über den exporter_exporter:
{{{
curl -s http://127.0.0.1:9997/proxy?module=script | promtool check-metrics

}}}

== Weiterführende Links ==
 * [[https://github.com/QubitProducts/exporter_exporter|exporter_exporter]]