2018年5月21日 星期一

How to configure Nagios to monitor your systems and network

Before we start monitoring something with Nagios, we need to first understand its configuration structure.

# cd /usr/local/nagios/etc
# ls -l
-rw-rw-r-- 1 nagios nagios 12999 Apr 24 21:55 cgi.cfg
-rw-r--r-- 1 root   root      50 May 12 11:55 htpasswd.users
-rw-rw-r-- 1 nagios nagios 44868 May 12 14:46 nagios.cfg
drwxrwxr-x 2 nagios nagios  4096 May 14 01:22 objects
-rw-rw---- 1 nagios nagios  1312 Apr 24 21:55 resource.cfg

nagios.cfg is the main configuration file of Nagios.  It contains global parameters and is used to include other user customized configuration files. e.g.
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg

# Definitions for monitoring the local (Linux) host
cfg_file=/usr/local/nagios/etc/objects/localhost.cfg


Let's get started by example:

First, we define something for Nagios to montior.  The basic unit is a host, which may have many services
/usr/local/nagios/etc/objects/localhost.cfg
define host{
        use                     linux-server  ; Name of host template to use
                                                        ; This host definition will inherit all variables that are defined
                                                        ; in (or inherited by) the linux-server host template definition.
        host_name         localhost
        alias                   localhost
        address              127.0.0.1
}
define service{
        use                             local-service         ; Name of service template to use
        host_name                 localhost
        service_description    PING
        check_command        check_ping!100.0,20%!500.0,60%
}

The highlighted part statement tells the host and service to use templates defined in templates.cfg, so let's have a look.  Note dhat "linux-server" itself is the child of another template "generic-host"

/usr/local/nagios/etc/objects/templates.cfg
define host{
        name                                        generic-host ; The name of this host template
        notifications_enabled              1                   ; Host notifications are enabled
        event_handler_enabled           1                   ; Host event handler is
enabled
        flap_detection_enabled           1                   ; Flap detection is enabled
        process_perf_data                   1                   ; Process performance data
        retain_status_information       1                   ; Retain status information across program restarts
        retain_nonstatus_information 1                   ; Retain non-status information across program restarts
        notification_period                  24x7            ; Send host notifications at any time
        register                                     0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
define host{
        name                          linux-server        ; The name of this host template
        use                             generic-host        ; This template inherits other values from the generic-host template
        check_period             24x7                    ; By default, Linux hosts are checked round the clock
        check_interval           5                          ; Actively check the host every 5 minutes
        retry_interval             1                          ; Schedule host check retries at 1 minute intervals
        max_check_attempts 0                          ; Check each Linux host 10 times (max)
        check_command        check-host-alive ; Default command to check Linux hosts
        notification_period     workhours          ; Linux admins hate to be woken up, so we only notify during the day
         ; Note that the notification_period variable is being overridden from
         ; the value that is inherited from the generic-host template!

        notification_interval   120             ; Resend notifications every 2 hours
        notification_options    d,u,r           ; Only send notifications for specific host states
        contact_groups            admins       ; Notifications get sent to the admins by default
        register                        0                 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}

What a template does is to define common parameters that would be used over and over again by many hosts and services.  So, instead of including these parameter in every host and service definition, we create a template.  The template basically tells Nagios how and how often to check on  the host or service, and what to do in case there is a state change.  Most parameters are pretty self-explanatory, for example, "check_period  24x7" and "check_interval  5" is saying this host should be monitored 24 hours a day, 7 days a week, and  Nagios should check on the host every 5 minutes.

The paramters below may not be obvious on how they work, so I will talk more about them

"notification_options" - In which situations should Nagios send out notifications?  If we don't specify any, Nagios will send out notifications in all situations, but sometimes that may not be what we wanted.  So in the example above, "d,u,r" would mean "send me notifications when host is DOWN, UNREACHABLE, and RECOVER from d or u".  Flapping means the host/service is flapping between bad(d,u) and good(r), we would probably talk more about that later.

d = DOWN state
u = UNREACHABLE state
r = recoveries (OK state)
f = starts and stops flapping
s = scheduled downtime starts and ends
n (none) as an option, no host notifications will be sent out


check_command        check-host-alive
This is the command Nagios would call to determine the host's state.  To find out what it does, we would have to look at another configuraiton file - commands.cfg

define command{
        command_name    check-host-alive
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
 }

Ok, what is "$USER1$"?  What is "check_ping"?  etc...   Again, we would need to yet look at another configuration file - resource.cfg, which is quite simple:
$USER1$=/usr/local/nagios/libexec

Let's now run the command: /usr/local/nagios/libexec/check_ping

# /usr/local/nagios/libexec/check_ping
check_ping: Could not parse arguments
Usage:
check_ping -H <host_address> -w <wrta>,<wpl>% -c <crta>,<cpl>%  [-p packets] [-t timeout] [-4|-6]

In Nagios, you may set WARNING and CRITICAL when there is problems detected, so in most commands, -w usually means warning criteria, -c means critical criteria.  When there is time unit involved, usually it would be in ms. In check_ping, rta is "rta" is round trip average, '"pl" is packet loss.  So let's get back to the command_line 
$USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
This would mean we ping the host 5 times (-p 5), and mark "warning" if rta is > 3000ms or there is 80% packet loss; mark critical if rta>5000ms or there is 100% packet loss.


notification_period     workhours
This time we goto "timeperiods.cfg".  You would find a few examples in this file, such as work hours, specific holidays etc.

define timeperiod{
        timeperiod_name 24x7
        alias           24 Hours A Day, 7 Days A Week
        sunday        00:00-24:00
        monday      00:00-24:00
        tuesday       00:00-24:00
        wednesday  00:00-24:00
        thursday      00:00-24:00
        friday          00:00-24:00
        saturday      00:00-24:00
}


contact_groups            admins
Contact is how Nagios notify you when there are state changes.  Let's have a look at contacts.cfg

define contact{
        contact_name            nagiosadmin             ; Short name of user
        use                             generic-contact         ; this is from templates.cfg
        alias                           Nagios Admin            ; Full name of user
        email                          your_email_address@your_domain
        }
define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
}


NOTE that "generic-contact" is in templates.cfg
define contact{
        name                                            generic-contact    ; The name of this contact template
        service_notification_period         24x7                    ; service notifications can be sent anytime
        host_notification_period              24x7                    ; host notifications can be sent anytime
        service_notification_options        w,u,c,r,f,s           ; send notifications for all service states, flapping events, and scheduled downtime events
        host_notification_options             d,u,r,f,s               ; send notifications for all host states, flapping events, and scheduled downtime events
        service_notification_commands   notify-service-by-email 
        host_notification_commands        notify-host-by-email 
        register                        0
}

NOTE that "notify-host-by-email" and "notify-service-by-email" are in commands.cfg.  These are simply using the "/bin/mail" command that comes with the OS to send out the emails.  You can certinaly use other means to send out the notifications other than email.  For instance, we can talk about how to use Telegram to send out the alarms.

define command{
        command_name    notify-host-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
 }

# 'notify-service-by-email' command definition
define command{
        command_name    notify-service-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
 }


Not sure if you are already feeling a bit dizzy as we are always jumping around configuration files...  I got that feeling at first too, but once you get used to the templated type of configuration, it is actually not that difficult.  Next, we will use more real example as I found it the esaier way to learn Nagios.

沒有留言:

張貼留言