Using Nagios with SNMP

From ConShell
Jump to navigation Jump to search

Implementing Nagios Service Checks with SNMP

This article will show how to setup a custom check with nagios to monitor the mail queue on a set of servers. Mail servers are often plagued with backed-up queues and this happens quite alot even on regular servers - at least in the environments I've had the pleasure of working in. A perfect example is the web server which is sometimes expected to deliver data from feedback forms and the like to "people who care".

In the examples shown, host1 is running nagios, and host2 is the system we want to monitor the mail queue size. The MTA irunning on the host2 system is postfix.

First, realize that the size of the mail queue is easy enough to check by hand using the mailq command, which in a typical sendmail installation is a link to the sendmail binary. In postfix installations the situation is quite similar.

[mf1@host1 ~]$ file `which mailq`
/usr/bin/mailq: symbolic link to ../sbin/sendmail
[mf1@host2 ~]$ file `which mailq`
/usr/bin/mailq: symbolic link to /usr/bin/mailq.postfix 

Ok enough about that. So invoking the mailq command on either of these two hosts will do one of two things. If the mail queue is empty, it will say so, otherwise it will print a summary of the messages in the queue. The useful thing for us to know is that mailq will also return an exit value of 0 if the queue is empty or 1 if it is not.

Lucky for us there is a check_mailq command bundled with the nagios-plugins. This command can check the queue size on sendmail, postfix, exim and qmail.

It runs against the local queue, but we want to run it remotely and obtain the results via a nagios service check.

Step 1 is to copy the check_mailq to the remote system (host2). I put it in /usr/local/nagios/libexec/check_mailq and verify it works by running it by hand on host2.

[mf1@host2 ~]$ /usr/local/nagios/libexec/check_mailq -w2 -c4 -t7 -Mpostfix
/usr/local/nagios/libexec/check_mailq -w2 -c4 -t7 -Mpostfix

Step 2 is to setup snmpd on host2 to run the script for us on demand. The configuration for this looks like this.

[mf1@host2 ~]$ grep mailq /etc/snmp/snmpd.conf
exec 1.3.6.1.4.1.5001.100 mailq-check /usr/local/nagios/libexec/check_mailq -w2 -c4 -t7 -Mpostfix

This command says to return a warning if there are 2 or more messages, return critical if 4 or more, and timeout after 7 seconds. The values corresponding to warning and critical are explained in the nagios plugin documentation.

Now we can query the value remotely, using snmp-get or similar utilities. For example...

[mf1@host1 ~]$ snmpget -v 2c -c public host2  .1.3.6.1.4.1.5001.100.101.1
SNMPv2-SMI::enterprises.5001.100.101.1 = STRING: "OK: mailq reports queue is empty|unsent=0;2;4;0"

Step 3 is to use a check_snmp plugin script on the Nagios server to trigger the remote script and make sense of the output. The script is based on some other SNMP check scripts that come with nagios-plugins, but this one allows us to query an arbitrary OID.

Usage:check_snmp -H <hostname> -C <community_string> -o <oid>
./check_snmp -H host2 -C public -o 1.3.6.1.4.1.5001.100.101.1
OK: mailq reports queue is empty|unsent=0;2;4;0

Note that check_snmp is quite powerful and complicated so definitely consult the --help output for more possibilities.

Here is a checkcommand.cfg used by nagios

define  command {
    command_name    check_snmp
    command_line    $USER1$/check_snmp -H $HOSTADDRESS$ -C public -o $ARG1$
    }

Here is the corresponding clause from services.cfg

define  service {
       host_name                       host2
       service_description             mailq
       is_volatile                     0
       check_command                   check_snmp!1.3.6.1.4.1.5001.100.101.1
       max_check_attempts              5
       normal_check_interval           10
       retry_check_interval            5
       active_checks_enabled           1
       passive_checks_enabled          1
       check_period                    workhours
       parallelize_check               1
       obsess_over_service             1
       check_freshness                 0
       event_handler                   notify-by-email
       event_handler_enabled           1
       flap_detection_enabled          1
       process_perf_data               0
       retain_status_information       1
       retain_nonstatus_information    1
       contact_groups                  lue
       notification_interval           60
       notification_period             workhours
       notification_options            w,u,c,r,f
       notifications_enabled           1
       register                        1
       }

One of the hardest parts to setting this all up and making sure it works is to generate some email to purposefully clog the remote queue. This can be accomplished by setting the relayhost parameter (postfix) or DS value (sendmail) to some non-existent host, then generating some email using a command such as 'mail -s "test clogger 1" somebody@example.com. Since the message can't be delivered to the non-existent smart host, it will stay in the queue for testing purposes. Once testing is complete, simply unset the relayhost or use postsuper to delete the queued messages.

Once setup, the configuration can simply be duplicated for each remote host that you want to monitor. Obviously that doesn't scale well, so next time I'll explain how to leverage CFengine to minimize pain on rolling out this setup on a group of heterogenous hosts.


See Also: Snmp


--fostermarkd 11:23, 25 April 2006 (EDT)