Differences between revisions 20 and 21

Design decisions

announce_services

Queries for RPM packages. The initial idea was a construct like {{{ map { chomp; $installed{$_} = 1 } $rpm -qa;

}}} which would have allowed a fast and elegant check like use_service($service_name) if $installed{$package};. This would, however, require exact package names including version numbers in announce_nagios (like firefox-1.0.8-1.4.1.SL3.1.i386). BRSo i chose a different mode of retrieval of the packages installed: {{{ my $installed = 'XxX'.$rpm -qa; $installed =~ s/\n/XxX/g; }}} to allow a more flexible, yet slower query like use_service($service_name) if $installed =~ /XxX$package/;

update_nagios

The subs that make changes to cfg-files may appear quite confusing. They were designed after realizing that, originally, each file was looped over in very similar fashions (i.e., heavy script code redundancy). The applied model closely resembles functional programming: the filter sub does not much more but concatenate sets of input lines into "sections" (which are described by parameter), and perform a specific action on such a section. The action is chosen by passing the desired sub-reference to filter . A swarm of subs defines how all kinds of sections from cfg-files are to be treated. This solution does not facilitate enhanced readability (on the contrary, actually). But it should enable easy changes and additions.

Anchor(snmptraps)

SNMP trapping

v20z

Sun v20z keeps sending mystery traps. Normally, all SP-EVENT traps are supposed to be located under .1.3.6.1.4.1.9237.2.1.1.6 (SP-MasterAgent-MIB::spEvent), we receive traps with the all-too-short OID of .1.3.6.1.4.1.9237 (SP-MasterAgent-MIB::newisys), which is the beginning of Sun's (?) enterprise tree. The cause might be a bug in trapd2. TODO:

check back with nino and upgrade the original trapd
try and debug our solution
apply a workaround: it may well be possible that appending .2.1.1.6 to the OIDs will have the traps make more sense. Still, no variables were ever received along a trap like that, and nothing appeared in any logs on the source machines, so they may still remain inconclusive $:\$

conclusion: Some debugging suggested that the traps in questions are not truncated or misinterpreted but indeed malformed and inconclusive. Two flavors of truncated OID have been seen so far:

.1.3.6.1.4.1.9237
.1.3.6.1.4.1.9237.2.1

A workaround similar to the one described in 3 is still possible, but chances are the traps won't tell us much.

v65x

We are in possession of MIBs that at least seem to contain usable TRAP OIDs. However, we haven't found a way to enable SNMP-traps yet (i.e., we need to tell the machines where to direct the traps).

[update 2006-08-04]

Andreas found a way to feed the trap-sink to the v65x-es. We had no testing opportunity so far. Firewall reconfiguration will probably be necessary. The MIBs couldn't be verified either.

X4100

Sun X4100 (galaxy) machines are rather mysterious. The traps are from the OID-subtree .1.3.6.1.4.1.3183. Few MIBs covering it can be found using google (and none on the Sun page, AFAIK), and those we found were from DELL and Intel, respectively. The defined traps seem to follow a certain standard, however, powered by DELL and Intel ([ http://www.dell.com/content/topics/global.aspx/power/en/ps1q03_intel ], pdf: [ http://www.dell.com/downloads/global/vectors/2002_asf.pdf ]). Wolfgang also found [ ftp://ftp.us.dell.com/sysman ] with ALL of Dell's published material.

It appears that Sun chose to go so far along as to use traps inside that "Wired for Managemant" OID-Tree, but they seem to be rather invented. We tried provoking a galaxy machine to send a "Power redundancy degraded" trap (which is in the DELL-ASF-MIB we found online). A trap was generated, but not only did it not have the expected OID, but it wasn't anywhere in the ASF-MIB at all.

We have not yet found a source of information about the true OIDs of the PETs from Sun machines. IBM als hosts information on PETs: [http://publib.boulder.ibm.com/infocenter/eserver/v1r2/index.jsp?topic=/diricinfo/fqm0_r_events_pet.html].

[update 2006-07-07]

Appearantly, certain voltage warnings are processed OK, while their respective recovery traps are not recognized (making snmptt discard them and not informing nagios), we also noted other events that were never recognized in the first placed and not reported to nagios at all.

[update 2006-08-07]

We have always been in possession of the "SUN PLATFROM MIB", again pointed out to us by our MCS contact. It is known to be unrelated to the hardware traps. We thouroughly searched another X4100 CD (N1 System Manager) and found a set of MIBs in a package meant for Soloaris on x86 (not in the Linux or Sparc version of the same package). Most of those MIBs referred to noumerous management-or-other related information. Nothing on the received hardware traps could be found, the CD and all MIBs thereon proofed useless just as the other one.

[update 2006-09-07]

MCS suggested to download the image of a new X4100 support CD. We found naught but the SUN-PLATFORM-MIB, which isn't any more helpful than it was half a year ago. Waltraut sent another request, including more details on our problem with the missing OIDs.

[update 2006-10-20]

Waltraut sent another mail to MCS, requesting translation of the 10 unknown OIDs we've seen so far. As of Nov 2nd, there has been no further followup.

[update 2007-01-02]

In December, WolfgangFriebel did some in-depth research and discovered the pattern which quite accurately defines how PET-OIDs are constructed (basically, some bits control wether the PET is a problem/solution notification and the others identify the component in question). He also managed to find the latest ASF MIB (available from Dell), which is supposed to define all PETs in existence. A number of Traps generated by X4100 are not to be found in this MIB.

I agree to Wolfgang's conclusion that SUN may have silently enhanced the ASF MIB for their purposes. For one reason or another they also seem to keep their meddling from the public. Thus a search for a MIB that will translate the so far unknown PETs may remain without success for the time being.

[conclusion 2007-07-31]

The case seems to be solved. We stumbled upon the most unlikely download option on the SUN website. The SUN-ILOM-PET-EVENTS mib appears to hold definitions for all traps that can be expected to be flung by X4100 and other architectures.

There are collisions with Dell's ASF mib, of course. There appear to be no semantic differences among these, though, same OID denotes same problem. Some severities are lower in the SUN mib, though. Here is the new policy:

only one mib entry is used for any incoming trap
the globally ignored OIDs have highest priority for snmptt (see [:Nagios_Administration/Nagios1:Nagios1] and [:Nagios_Administration/Nagios2:Nagios2])
the (new) X4100 PET entries have lowest priority (double definitions are overruled by Dell ASF)

This way, we should always see the most accurate result if any, and nothing if the trap is being ignored.

HIROSS

The MIB that was provided by the manufacturer declared 2 Traps:

EVENT lgpTrapConditionEntryAdded .1.3.6.1.4.1.476.1.42.3.3.1
This trap is sent each time a condition is inserted into the conditions table.
EVENT lgpTrapConditionEntryRemoved .1.3.6.1.4.1.476.1.42.3.3.2
This trap is sent each time a condition is removed from the conditions table.

On the [http://support.ipmonitor.com/mibs_byoidtree.aspx?oid=1.3.6.1.4.1.476.1.42/ net] we found a different MIB version that declared a whole branch of Traps:
```
EVENT lgpEventConditionEntryAdded .1.3.6.1.4.1.476.1.42.3.3.0.1
EVENT lgpEventConditionEntryRemoved .1.3.6.1.4.1.476.1.42.3.3.0.2
EVENT lgpEventLowBatteryWarning .1.3.6.1.4.1.476.1.42.3.3.0.3
... 
```
(Note the difference to the above traps, too.)
the manual of the HiSNMP module specified numerous traps, too:
```
1.3.6.1.4.1.476.1.42.2.3.0.1.0.34 NetzwerkFehler
1.3.6.1.4.1.476.1.42.2.3.0.1.0.70 Kein Anschluss an Geraet 1
1.3.6.1.4.1.476.1.42.2.3.0.1.0.77 Netzwerk-Fehler
1.3.6.1.4.1.476.1.42.3.2.1.1.0.18 Raum-Temperatur zu hoch
...
1.3.6.1.4.1.476.1.42.3.2.1.14.0.10 Wasserleckstelle
...
1.3.6.1.4.1.476.1.42.3.2.1.21.4.0.31 Stoerung des Raumsensors 
```
Those are not specified as traps in any MIB we could find. They are, however, INTEGER values that can be read using snmpget. The manual specifies a triple of "HM-Werte" for each OID, indicating "Alarm (Warnung) aktiv / Alarm (Warnung) lernen / Kein Alarm (Warnung)", the last value always being 0.

Initial test results

Upon manipulating the warning thresholds on the devices, a couple of traps were generated by each:

hiross1:

.1.3.6.1.4.1.476.1.42.3.2.1.1.0.18
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0
.1.3.6.1.4.1.476.1.42.3.2.1.1.0.18
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0

hiross2:

.1.3.6.1.4.1.476.1.42.3.2.1.3.0.20
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0
.1.3.6.1.4.1.476.1.42.3.2.1.3.0.20
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0

hiross3:

.1.3.6.1.4.1.476.1.42.3.2.1.2.0.19
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0
.1.3.6.1.4.1.476.1.42.3.2.1.5.0.7
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0
.1.3.6.1.4.1.476.1.42.3.2.1.0.0.0
.1.3.6.1.4.1.476.1.42.3.2.1.2.0.19

with no further indication wether a trap signifies a failure or recovery.

During the alarm state, we performed the following snmpget:

[triton] ~ % snmpget -v 1 -c public hiross1 .1.3.6.1.4.1.476.1.42.3.2.1.1     
SNMPv2-SMI::enterprises.476.1.42.3.2.1.1 = INTEGER: 11

This seemed to indicate a failure condition for the "lgpConditionHighTemperature".

Thesis

The manual presents two general kinds of value triples: 15/11/0 and 23/19/0, distributed among all variables with a slight bias towards the former. We were able to produce the following values using snmpget:

[triton] ~ % snmpget -v 1 -c public hiross3 .1.3.6.1.4.1.476.1.42.3.2.1.1
LIEBERT-GP-CONDITIONS-MIB::lgpConditionHighTemperature = INTEGER: 8
[triton] ~ % snmpget ...
LIEBERT-GP-CONDITIONS-MIB::lgpConditionHighTemperature = INTEGER: 15
...
LIEBERT-GP-CONDITIONS-MIB::lgpConditionHighTemperature = INTEGER: 9

These represent the following conditions:

everything OK (no alert whatsoever)
alert is active and has not been acknowledged ("device beeping")
alert is acknowledged but not reset, the temperature value is inside normal parameters again, though

Our conclusion:

15 = 01111b = alert present, unacknowledged, bad sensor value
11 = 01011b = alert present, acknowledged, bad sensor value
 9 = 01001b = alert present, acknowledged, sensor OK
 8 = 01000b = alert absent, -, sensor OK

The value 0 (as suggested by the manual to indicate an "OK" status) was not ever encountered. We deducted the following bit mappings:

00100b = unacknowledged alert bit
00010b = bad sensor value bit
00001b = alert present bit

The 8-bit seems to be generally set on all values with a supposed value-triple of 15/11/0. We were encouraged to notice that on the other (23/19/0) variables, the 16-bit seems to be always set:

16 = 10000b = OK (that we actually confirmed)
23 = 10111b = alert present (see 15 above, in theory)
19 = 10011b = alert acknowledged (see 11 above, in theory)

We didn't go through the trouble of producing the actual errors, but as 23 and 19 are documented in the manual and the 16 seems to confirm the thesis so far, we assume that the 3 least significant bits indeed encode the error states for all variables.

Approach

As any received trap is passed to Nagios through a handler script, this script will have to be modified to perform snmpget operations whenever it is handed a trap inside the .1.3.6.1.4.1.476.1.42.3.2.1 tree. Of the received value, the 3 LSBs merely need to be checked to be

0 - OK
non 0 - ALERT

undiscussed problems

Swap vs. Memory in Nagios

We intentionally disallowed nagios to generate mail about Swap problems if Memory is perfectly OK (on hosts that register both checks). On Nov 2nd, the following occured: Both pi and pica ran quite short on memory and very short on swap (both had been the case on pica for a while then). Interestingly enough, they kept scraping along the "warning" threshold with their memory use (but the meters stayed "OK" most of the time), while the swaps ran "Critical". At the moment that occured on pica, memory was obviously in a warning state, as the "swap critical" mail was correctly generated. On pi, however, memory must have been quite OK at that moment, though perfparse shows a short surge at 2pm. Still, the dependency matched and no mail about the swap problem was generated, and the problem got noticed by accident, one hour later.

It remains to be seen if maybe an email will get generated the moment memory usage surpasses the "warning" threshold on pi (if it does).

CategoryHomepage

-  ⇤ ← Revision 20 as of 2007-12-06 12:36:17 → 
  Size: 11139
  Editor: FelixFrank
  Comment: hiross log
+   ← Revision 21 as of 2007-12-06 14:27:17 → ⇥
  Size: 13372
  Editor: FelixFrank
  Comment: completed HIROSS log
-Deletions are marked like this.
+Additions are marked like this.
 Line 79:
- * On the [ http://support.ipmonitor.com/mibs_byoidtree.aspx?oid=1.3.6.1.4.1.476.1.42 net ] we found a different MIB version that declared a whole branch of Traps: {{{
+ * On the [http://support.ipmonitor.com/mibs_byoidtree.aspx?oid=1.3.6.1.4.1.476.1.42/  net] we found a different MIB version that declared a whole branch of Traps: {{{
 Line 93:
- Those are not specified as traps in ''any'' MIB we could find. They are, however, INTEGER values that can be read using `snmpget`. The manual specifies a triple of "HM-Werte" for each OID, indicating "Alarm (Warnung aktiv / Alarm (Warnung) lernen / Kein Alarm (Warnung)", the last value always being 0.
+ Those are not specified as traps in ''any'' MIB we could find. They are, however, INTEGER values that can be read using `snmpget`. The manual specifies a triple of "HM-Werte" for each OID, indicating "Alarm (Warnung) aktiv / Alarm (Warnung) lernen / Kein Alarm (Warnung)", the last value always being 0.
 Line 120:
+ Thesis:: The manual presents two general kinds of value triples: 15/11/0 and 23/19/0, distributed among all variables with a slight bias towards the former. We were able to produce the following values using `snmpget`: {{{
[triton] ~ % snmpget -v 1 -c public hiross3 .1.3.6.1.4.1.476.1.42.3.2.1.1
LIEBERT-GP-CONDITIONS-MIB::lgpConditionHighTemperature = INTEGER: 8
[triton] ~ % snmpget ...
LIEBERT-GP-CONDITIONS-MIB::lgpConditionHighTemperature = INTEGER: 15
...
LIEBERT-GP-CONDITIONS-MIB::lgpConditionHighTemperature = INTEGER: 9 }}}
 These represent the following conditions:
  * everything OK (no alert whatsoever)
  * alert is active and has not been acknowledged ("device beeping")
  * alert is acknowledged but not reset, the temperature value is inside normal parameters again, though
 Our conclusion: {{{
15 = 01111b = alert present, unacknowledged, bad sensor value
11 = 01011b = alert present, acknowledged, bad sensor value
 9 = 01001b = alert present, acknowledged, sensor OK
 8 = 01000b = alert absent, -, sensor OK }}}
 The value 0 (as suggested by the manual to indicate an "OK" status) was not ever encountered. We deducted the following bit mappings: {{{
00100b = unacknowledged alert bit
00010b = bad sensor value bit
00001b = alert present bit }}}
 The 8-bit seems to be generally set on all values with a supposed value-triple of 15/11/0. We were encouraged to notice that on the other (23/19/0) variables, the 16-bit seems to be always set: {{{
16 = 10000b = OK (that we actually confirmed)
23 = 10111b = alert present (see 15 above, in theory)
19 = 10011b = alert acknowledged (see 11 above, in theory) }}}
 We didn't go through the trouble of producing the actual errors, but as 23 and 19 are documented in the manual and the 16 seems to confirm the thesis so far, we assume that the 3 least significant bits indeed encode the error states for all variables.

 Approach:: As any received trap is passed to Nagios through a handler script, this script will have to be modified to perform `snmpget` operations whenever it is handed a trap inside the ''.1.3.6.1.4.1.476.1.42.3.2.1'' tree. Of the received value, the 3 LSBs merely need to be checked to be
 * 0 - OK
 * non 0 - ALERT

Wiki

Page