A server infrastructure can't be considered fully operative if it's not monitored in some way.
Various tools are available in the market and they cover different facets of the common monitoring needs (alerting, trending, performance, security...) but whatever is your choice you need to configure them in some way.
One of the nice side effects of having modules that include automatic monitoring functions, such as the Example42 ones, is that while deploying a Puppet infrastructure you add the relevant checks to your monitoring software so that you can quickly understand what is working out of the box and what has to be fixed.
All the Example42 Puppet modules provide built in monitoring features, you can activate them just by setting the $monitor variable to "yes" (whatever the method you use to define and classify nodes) and at least one $monitor_tool.
An unique feature of the Example42 modules is the abstraction that is embedded in all the modules, so that it's quite easy and quick to introduce new monitoring tools without having to modify anything in the modules.
Typically a module has 2 kind of checks enabled by default: its listening port, if it is a network service, and its process name, for example, in the samba module you have:
monitor::port { "samba_${samba::params::protocol}_${samba::params::port}": protocol => "${samba::params::protocol}", port => "${samba::params::port}", target => "${samba::params::monitor_target_real}", enable => "${samba::params::monitor_port_enable}", tool => "${monitor_tool}", } monitor::process { "samba_process": process => "${samba::params::processname}", service => "${samba::params::servicename}", pidfile => "${samba::params::pidfile}", enable => "${samba::params::monitor_process_enable}", tool => "${monitor_tool}", }
Even if not excessively obvious, from the above lines we can deduce:
- Two custom defines are used to specify what to check (a port and a process)
- The arguments given to the defines are obtained via qualified variables set in the samba::params class
- The user variable $monitor_tool specifies the monitoring tool(s) to be used for the above resources.
Currently the Example42 Puppet modules support various tools (you can define them using an array): Nagios, Munin, Monit, Collectd and Puppi.
They are actually different by nature and scope but the most interesting ones to actually check if something is working as expected are Nagios and, in some way, Puppi.
Besides port and process checking, for these two tools is possible to define also URL tests based on pattern matching, so that you can actually check different functionalities of your web application checking if custom urls contain specific strings.
An example of Url check:
monitor::url { "Url-Example42_TestDatabase": url => "http://www.example42.com/testdb.php", port => '80', target => "${fqdn}", pattern => 'Database OK', enable => "true", tool => "${monitor_tool}", }
If the http://www.example42.com/testdb.php page contains the string "Database OK" the check is positive. Note that the host on which is run the check is defined with the target argument. Note also that if you set enable to false, the check is removed/disabled.
Another available check is for mount points. With a define like:
monitor::mount { "/var/www/repo": name => "/var/www/repo", fstype => "nfs", ensure => mounted, options => "defaults", device => "nfs.example42.com:/data/repo", atboot => true, }
you both mount and monitor the specified resource.
A proper test driven infrastructure does not only checks if the services delivered by Puppet are running or the mount points are mounted, it verifies also HOW they work.
You can have Apache running but the web application failing in one or more elements. While the basic service/port checks are automatically added when is included the relevant module, for more accurate tests you need to write some (Puppet) code.
For this the monitor::url define is useful for web applications but we haven't still identified a good method to abstract application specific tests (for example: is ldap/mysql/activemq responding correctly?), performance and security checks, proactive failure detection and other generally needed features.
Probably there's not a real way to abstract certain specificities and some custom approach, strictly related to the software used and the contingency, is required.
One possible approach to manage arbitrary checks could be to consider Nagios plugins as de facto standard and refer to them to handle custom checks, considering that they are used by different tools, besides Nagios, and are easily extendable.
Currently, in the Example42 module there's a monitor::plugin define, but its usage is not yet standardized and is more oriented to be used to manage plugins for software like Collectd or Munin rather than to refer to Nagios plugins, practice and operational needs will drive our choice for this point.
Understanding the monitor module
All the above references to the monitor classes or defines imply the usage of the Example42 monitor module.
This is an implementation entirely based on Puppet's DSL of a (strongly needed) monitor abstraction type.
Different approaches and implementations would be welcomed, as we think that for the Puppet ecosystem it would be advisable to define at least standard naming and syntax for the monitoring elements to be included in every module.
The Example42 monitor implementation prefers linearity and extendability over performance and optimization of resources.
The generic monitor defines are placed in files like:
monitor/manifests/process.pp, monitor/manifests/port.pp, monitor/manifests/url.pp.
Let's see for example monitor/manifests/port.pp:
define monitor::port ( $port, $protocol, $target, $tool, $checksource='remote', $enable='true' ) { [...] if ($tool =~ /nagios/) { monitor::port::nagios { "$name": target => $target, protocol => $protocol, port => $port, checksource => $checksource, enable => $enable, } } if ($tool =~ /puppi/) { monitor::port::puppi { "$name": target => $target, protocol => $protocol, port => $port, checksource => $checksource, enable => $enable, } } }
note that here according to the tool requested are called some specific functions that are configured in places like:
monitor/manifests/port/nagios.pp, monitor/manifests/process/port.pp where are called the actual defines that "do" the checks.
Let's see for example monitor/manifests/port/nagios.pp:
define monitor::port::nagios ( $target, $port, $protocol, $checksource, $enable ) { $ensure = $enable ? { "false" => "absent", "no" => "absent", "true" => "present", "yes" => "present", } # Use for Example42 nagios/nrpe modules nagios::service { "$name": ensure => $ensure, check_command => $protocol ? { tcp => $checksource ? { local => "check_nrpe!check_port_tcp!localhost!${port}", default => "check_tcp!${port}", }, udp => $checksource ? { local => "check_nrpe!check_port_udp!localhost!${port}", default => "check_udp!${port}", }, } } # Use for Camptocamp Nagios Module # nagios::service::distributed { "$name": # ensure => $ensure, # check_command => $protocol ? { # tcp => "check_tcp!${port}", # udp => "check_udp!${port}", # } # } }
Note that here you can choose different implementations of the specific module, so you are free to change the whole module to be used for a specific monitoring tool editing just these few files, for example if you don't like the Example42 Nagios module you can use the Camptocamp one just by changing the references in this file.
Note, incidentally, that the port check can be triggered either from the Nagios server or from the same monitored host via nrpe, according to the value of the checksource parameter.
In order to manage per site, per module and per role or host exceptions, the Example42 modules provide a fat but functional approach, generally managed in the params.pp class of each module.
You can basically manage if to enable or not monitoring for all the modules or also module by module by setting the value of some variables:
There are some "node-wide" variables you can set, their defaults are set in params.pp of each module:
$monitor_port (true|false) : Set if you want to enable port monitoring for the host.
$monitor_process (true|false) : Set if you want to enable process checking.
$monitor_target : Set the ip/hostname you want to use on an external monitoring server to monitor the host
These variables can be overriden on a per-module basis (needed, for example if you want to enable process monitoring for some service but not all):
$foo_monitor_port (true|false) : Set if you want to monitor foo's port(s). If any. Default: As defined in $monitor_port
$foo_monitor_process (true|false) : Set if you want to monitor foo's process. If any. Default: As defined in $monitor_process
$foo_monitor_target : Define how to reach (Ip, fqdn...) the host to monitor foo from an external server. Default: As defined in $monitor_target
Note that generally you really have not to care about them, as sensible defaults are set. But this it's important to note that with $monitor_target variable you can set HOW to reach the host to be monitored, by default is its $fqdn, but on multihomed nodes you might want to reach it via and alternative IP or name (possible defined or based on a fact value in order to avoid manual settings).
Finally note that for tools that imply a central monitoring node and a variety of nodes to check, we have introduced the possibility to define a "grouplogic" variable to automatically manage different monitoring servers according to custom groups of nodes.
Let's see how it works, for example for the Nagios module.
You just have to define a variable, $nagios_grouplogic, and set as value the name of another variable you use to group your nodes. For example you may want to have different Nagios servers according custom variables as zones, environments, datacenters etc (ie: $nagios_grouplogic = "env" ).
By default all the checks go to the same server (managed by the same PuppetMaster) if you define in $nagios_grouplogic the name of the variable you want to use as discrimitator, you will have different Nagios servers monitoring the group of nodes having the same value for that variable.
Note that you need to add in the list below your own variable name, if is not already provided.
In nagios/manifests/params.pp you have:
# Define according to what criteria you want to organize # what nodes your Nagios servers monitor $grouptag = $nagios_grouplogic ? { '' => "", 'type' => $type, 'env' => $env, 'environment' => $environment, 'zone' => $zone, 'site' => $site, 'role' => $role, }
In nagios/manifests/service.pp the define nagios::service used to specify every Nagios service check (as we've seen before) is:
define nagios::service ( $host_name = $fqdn, $check_command = '', $service_description = '', $use = 'generic-service', $ensure = 'present' ) { require nagios::params # Autoinclude the target host class include nagios::target # Set defaults based on the same define $name $real_check_command = $check_command ? { '' => $name, default => $check_command } $real_service_description = $service_description ? { '' => $name, default => $service_description } @@file { "${nagios::params::customconfigdir}/services/${host_name}-${name}.cfg": mode => "${nagios::params::configfile_mode}", owner => "${nagios::params::configfile_owner}", group => "${nagios::params::configfile_group}", ensure => "${ensure}", require => Class["nagios::extra"], notify => Service["nagios"], content => template( "nagios/service.erb" ), tag => "${nagios::params::grouptag}" ? { '' => "nagios_service", default => "nagios_service_$nagios::params::grouptag", }, } }
In nagios/manifests/init.pp (used only on the Nagios servers) you collect exported resources with:
case $nagios::params::grouptag { "": { File <<| tag == "nagios_host" |>> File <<| tag == "nagios_service" |>> } default: { File <<| tag == "nagios_host_$nagios::params::grouptag" |>> File <<| tag == "nagios_service_$nagios::params::grouptag" |>> } }
This lets you automatically deploy different Nagios servers monitoring different groups of nodes according to custom variables.
If you need to use a variable different from the ones already defined (type, env, environment, zone, site, role) just add a line in the selector shown in nagios/manifests/params.pp.Neat, isn't it?
In this article we have seen how to use Example42 modules to automatically monitor the resources included in Puppet managed nodes, how to add checks based on string patterns in Urls or mount points, how all this is done using a layer of abstraction that makes it possible to introduce a new monitoring tool that uses all the already present checks and how, de facto, systematic automatic monitoring implies a test driven deployment, since you find yourself checking what you want on your servers and you can quickly see, for example on your Nagios server, what is up and running and what needs to be fixed.
Further work in Example42 modules will be done in the development of support for other monitoring tools, in the definition of other abstract enough monitor defines and generally in exploring the possibilities to automate more specific and complete checks.