Email: <Felix.Frank@Desy.De>

Design decisions


  1. Queries for RPM packages. The initial idea was a construct like

    map { chomp; $installed{$_} = 1 } `$rpm -qa`; 

    which would have allowed a fast and elegant check like use_service($service_name) if $installed{$package};. This would, however, require exact package names including version numbers in announce_nagios (like firefox-1.0.8-1.4.1.SL3.1.i386). BRSo i chose a different mode of retrieval of the packages installed:

    my $installed = 'XxX'.`$rpm -qa`;
    $installed =~ s/\n/XxX/g; 

    to allow a more flexible, yet slower query like use_service($service_name) if $installed =~ /XxX$package/;


  1. The subs that make changes to cfg-files may appear quite confusing. They were designed after realizing that, originally, each file was looped over in very similar fashions (i.e., heavy script code redundancy). The applied model closely resembles functional programming: the  filter  sub does not much more but concatenate sets of input lines into "sections" (which are described by parameter), and perform a specific action on such a section. The action is chosen by passing the desired sub-reference to  filter . A swarm of subs defines how all kinds of sections from cfg-files are to be treated. This solution does not facilitate enhanced readability (on the contrary, actually). But it should enable easy changes and additions.

SNMP trapping


Sun v20z keeps sending mystery traps. Normally, all SP-EVENT traps are supposed to be located under . (SP-MasterAgent-MIB::spEvent), we receive traps with the all-too-short OID of . (SP-MasterAgent-MIB::newisys), which is the beginning of Sun's (?) enterprise tree. The cause might be a bug in trapd2. TODO:

  1. check back with nino and upgrade the original trapd :(

  2. try and debug our solution (./)

  3. apply a workaround: it may well be possible that appending . to the OIDs will have the traps make more sense. Still, no variables were ever received along a trap like that, and nothing appeared in any logs on the source machines, so they may still remain inconclusive :\

conclusion: Some debugging suggested that the traps in questions are not truncated or misinterpreted but indeed malformed and inconclusive. Two flavors of truncated OID have been seen so far:

A workaround similar to the one described in 3 is still possible, but chances are the traps won't tell us much.


We are in possession of MIBs that at least seem to contain usable TRAP OIDs. However, we haven't found a way to enable SNMP-traps yet (i.e., we need to tell the machines where to direct the traps).

[update 2006-08-04]

Andreas found a way to feed the trap-sink to the v65x-es. We had no testing opportunity so far. Firewall reconfiguration will probably be necessary. The MIBs couldn't be verified either.


Sun X4100 (galaxy) machines are rather mysterious. The traps are from the OID-subtree . Few MIBs covering it can be found using google (and none on the Sun page, AFAIK), and those we found were from DELL and Intel, respectively. The defined traps seem to follow a certain standard, however, powered by DELL and Intel ([ ], pdf: [ ]). Wolfgang also found [ ] with ALL of Dell's published material.

It appears that Sun chose to go so far along as to use traps inside that "Wired for Managemant" OID-Tree, but they seem to be rather invented. We tried provoking a galaxy machine to send a "Power redundancy degraded" trap (which is in the DELL-ASF-MIB we found online). A trap was generated, but not only did it not have the expected OID, but it wasn't anywhere in the ASF-MIB at all.

We have not yet found a source of information about the true OIDs of the PETs from Sun machines. IBM als hosts information on PETs:

[update 2006-07-07]

Appearantly, certain voltage warnings are processed OK, while their respective recovery traps are not recognized (making snmptt discard them and not informing nagios), we also noted other events that were never recognized in the first placed and not reported to nagios at all.

[update 2006-08-07]

We have always been in possession of the "SUN PLATFROM MIB", again pointed out to us by our MCS contact. It is known to be unrelated to the hardware traps. We thouroughly searched another X4100 CD (N1 System Manager) and found a set of MIBs in a package meant for Soloaris on x86 (not in the Linux or Sparc version of the same package). Most of those MIBs referred to noumerous management-or-other related information. Nothing on the received hardware traps could be found, the CD and all MIBs thereon proofed useless just as the other one.

[update 2006-09-07]

MCS suggested to download the image of a new X4100 support CD. We found naught but the SUN-PLATFORM-MIB, which isn't any more helpful than it was half a year ago. Waltraut sent another request, including more details on our problem with the missing OIDs.

[update 2006-10-20]

Waltraut sent another mail to MCS, requesting translation of the 10 unknown OIDs we've seen so far. As of Nov 2nd, there has been no further followup.

[update 2007-01-02]

In December, WolfgangFriebel did some in-depth research and discovered the pattern which quite accurately defines how PET-OIDs are constructed (basically, some bits control wether the PET is a problem/solution notification and the others identify the component in question). He also managed to find the latest ASF MIB (available from Dell), which is supposed to define all PETs in existence. A number of Traps generated by X4100 are not to be found in this MIB.

I agree to Wolfgang's conclusion that SUN may have silently enhanced the ASF MIB for their purposes. For one reason or another they also seem to keep their meddling from the public. Thus a search for a MIB that will translate the so far unknown PETs may remain without success for the time being.

[conclusion 2007-07-31]

The case seems to be solved. We stumbled upon the most unlikely download option on the SUN website. The SUN-ILOM-PET-EVENTS mib appears to hold definitions for all traps that can be expected to be flung by X4100 and other architectures.

There are collisions with Dell's ASF mib, of course. There appear to be no semantic differences among these, though, same OID denotes same problem. Some severities are lower in the SUN mib, though. Here is the new policy:

  1. only one mib entry is used for any incoming trap
  2. the globally ignored OIDs have highest priority for snmptt (see Nagios1 and Nagios2)

  3. the (new) X4100 PET entries have lowest priority (double definitions are overruled by Dell ASF)

This way, we should always see the most accurate result if any, and nothing if the trap is being ignored.


The protocol now lives in Nagios_Administration/HIROSS.

undiscussed problems

Swap vs. Memory in Nagios

We intentionally disallowed nagios to generate mail about Swap problems if Memory is perfectly OK (on hosts that register both checks). On Nov 2nd, the following occured: Both pi and pica ran quite short on memory and very short on swap (both had been the case on pica for a while then). Interestingly enough, they kept scraping along the "warning" threshold with their memory use (but the meters stayed "OK" most of the time), while the swaps ran "Critical". At the moment that occured on pica, memory was obviously in a warning state, as the "swap critical" mail was correctly generated. On pi, however, memory must have been quite OK at that moment, though perfparse shows a short surge at 2pm. Still, the dependency matched and no mail about the swap problem was generated, and the problem got noticed by accident, one hour later.

It remains to be seen if maybe an email will get generated the moment memory usage surpasses the "warning" threshold on pi (if it does).


FelixFrank (last edited 2008-11-03 12:39:23 by FelixFrank)