.\" $Id: phl.troff 1.1 Wed, 19 Feb 2003 22:46:50 -0500 trockij $ .li 0 .au "http://www.kernel.org/software/mon/, $ Revision: 1.5 $" .sp 5 .c The Design and Operation of "mon" .sp 3 .ce 999 Jim Trocki Transmeta Corporation .T trockij@transmeta.com .T http://www.kernel.org/software/mon/ .ce 0 .\" ============================================================= .bp .1 "Overview" .2 "What is `mon'?" .2 "Server responsibilities" .2 "Monitor responsibilites" .2 "Alerts" .2 "Alert management" .\" ============================================================= .bp .1 "Overview (contd...)" .2 "Clients and their function" .2 "Configuration details and examples" .2 "Example extensions" .2 "Interesting applications" .2 Experience .\" ============================================================= .bp .c What is "mon"? .1 "mon" is a tool for monitoring the availability of services and applications. .2 Used by NOCs and IT staff for fault detection and alert management. For example: .3 Send an alphanumeric page to NOC staff when routing horked .3 Submit trouble ticket when an application becomes inoperable .3 Record the routing history between HQ and a branch office, send notification when path changes .2 Written in Perl .2 Distributed under GNU General Public License v2 .\" ============================================================= .bp .c Features of mon .2 Portable (thanks to Perl) .3 Linux, Solaris, BSD, Cygwin, ... .2 Simple yet very adaptable design .2 Can monitor anything, no clients required .2 Configurable, extensible .2 Supportive community, mailing list mon@linux.kernel.org .\" ============================================================= .bp .c Design Goals of "mon" .2 Simple to add alerts and monitors .2 Simple way of cross-connecting tests and alerts .2 Simple way of gathering data for report generation .2 General-purpose, if you can test it with software, you can monitor it .\" ============================================================= .bp .c Components .2 Server .3 Schedules monitors, handles traps, alerts, clients, logs. .2 Clients .3 Query and control the server .2 Monitors .3 Communicate with monitored systems via HTTP, SNMP, etc. .2 Traps .3 Send notifications from remote systems .2 Alerts .3 Perform actions on failures, page, email, etc. .\" ============================================================= .PSPIC architecture.eps 6i 6i .\" ============================================================= .c Server Responsibilities .2 Schedule tests .3 "Run monitors when necessary" .3 "Gather output and exit status" .2 Accept remote traps .2 Serve clients .3 "Deliver operational status" .3 "Accept control commands" .2 Manage Alerts .3 "Suppress repetitive alerts" .3 "Alert only during specified time periods" .3 "Evaluate dependencies" .\" ============================================================= .bp .c Configuration File .2 "Cross-connects" monitors to alerts .3 "Think of telephone switchboard or patch panel" .3 "Any monitor can be wired to any alert" .2 Defines what is to be monitored and how .3 "What monitors are to be used" .3 "What hosts are to be monitored" .3 "What services are to be monitored" .2 Defines when alerts happen .3 "On which failure" .3 "On which failure->success transition" .3 "Frequency" .3 "Time of day" .\" ============================================================= .bp .c Monitors .2 Test the condition of a service .3 "Usually one service test per monitor" .3 "Tests are user-definable" .3 "SNMP, HTTP, SMTP, ICMP echo, etc." .3 "Application-level tests possible" .2 Report summary and detailed results .2 Exit reporting success/failure .\" ============================================================= .bp .c Monitors cont'd .2 Written in an arbitrary language .3 "Most are in Perl, /bin/sh" .3 "May call third-party software" .3 "No binary linkage with mon itself" .3 "Independent from the mon server" .2 Invoked as separate processes .3 "Many may be run in parallel .3 "Hundreds may run in a minute" .2 Short-lived .3 "Start, test, report, exit" .3 "Helps minimize impact of memory leaks" .2 Simple to write .\" ============================================================= .bp .c Many Available Monitors .1 "Numerous server tests" .2 "http, lpd, smtp, ldap, imap, pop3, telnet, dns, disk quotas, netware" .2 "msql, mysql, oracle, postgres, informix, sybase" .2 "reboot, processes, rpc, clock, disk space, RAID" .2 "Brocade fcal switches, traceroutes, router interfaces, ipsec tunnels, Foundry router chassis, bgp, RADIUS" .2 "Compaq chassis, NT services, samba, printers" .\" ============================================================= .bp .c Traps .2 Traps are notifications sent to a mon server from an external entity .3 "another mon server" .3 "a stand-alone probe" .2 Contain the same information as passed by monitor scripts .3 "summary" .3 "detail" .3 "exit status" .2 Allows distributed mon agents to send their status to a centralized mon server .\" ============================================================= .bp .c Alerts .2 Report the failure status detected by a monitor .2 Independent from the mon server .2 Accept input from the mon server .2 Invoked as separate processes .2 Written in any language .2 Simple to write .\" ============================================================= .bp .c Available Alerts .2 E-Mail .2 "SNPP (alphanumeric paging via TCP/IP)" .2 "Qpage (alphanumeric paging via modem and TAP/IXO)" .2 "Trap to other mon server" .2 AIM .2 Bugzilla .2 GNATS .2 "HP Openview" .2 SMS .2 WinPopup .2 "NetApp snap delete" .\" ============================================================= .bp .c Alert Management .2 "Alert decision logic in the server" .2 "Squelch repetitive alerts" .3 "time period" .3 "alertafter num" .3 "alertafter num timeval" .3 "alertafter timeval" .3 "alertevery .3 "numalerts" .2 "Dependencies" .3 "If router is down, don't alert for unreachable things beyond it" .3 "A simple first-pass at root-cause analysis" .3 "Dependencies are Perl expressions" .\" ============================================================= .bp .c Time::Period Specifications .1 Time::Period by Patrick Ryan .2 True or false if a time(2) is within a specific period .2 scale {range [range ...]} .3 "scales: yr, mo, wk, yd, md, wd, hr, min, sec" .3 "ranges: Mon-Fri, 1-365, 9am-5pm, ..." .2 Examples .3 "wd {Sun-Sat}" .3 "wd {Mon-Fri} hr {9am-4pm}" .3 "wd {Mon Wed Fri} hr {9am-4pm}, wd{Tue Thu} hr {9am-2pm}" .3 "sec {0-4 10-14 20-24 30-34 40-44 50-54}" .\" ============================================================= .bp .c Clients .2 "mon" protocol, registered port 2583 with IANA .2 "Easy Perl interface, Mon::Client" .2 "Get operational status of things monitored" .2 "Disable/enable monitoring and alerting" .2 "Acknowledge alerts sent" .2 "Allows for many reports" .\" ============================================================= .bp .c Example clients .2 "Multiple WWW interfaces" .3 mon.cgi .3 monshow .3 minotaur.cgi .3 "Big Brother facade" .2 Command-line .2 WAP .2 2-Way pager .2 "dtquery" query tool and report generator .\" ============================================================= .bp .c Simple Configuration Example .1 Send email when any web servers become unpingable: .bc hostgroup webservers www1 www2 www3 www4 watch webservers service fping monitor fping.monitor interval 1m period wd {Sun-Sat} alert mail.alert trockij alertevery 24h upalert mail.alert trockij .ec .\" ============================================================= .bp .c Complex Example .bc watch webserver.corp.com service fping monitor fping.monitor interval 1m period P1: wd {Sun-Sat} alert mail.alert trockij alertevery 12h upalert mail.alert trockij period P2: wd {Sun-Sat} alert mail.alert trockij-pager alertevery 24h alertafter 3 10m period P3: wd {Mon-Fri} hr {7am-10pm} alert mail.alert daytime-staff alertevery 4h service http monitor http.monitor interval 2m depend SELF::fping period wd {Sun-Sat} alert mail.alert alertafter 10m numalerts 1 .ec .\" ============================================================= .bp .c Escalation using Multiple Periods .bc watch webserver.corp.com service fping monitor fping.monitor interval 1m period P1: wd {Sun-Sat} alert mail.alert trockij alertafter 3 numalerts 1 period P2: wd {Sun-Sat} alert qpage.alert trockij alertafter 6 numalerts 1 period P3: wd {Sun-Sat} alert call911.alert alertafter 12h alertevery 24h .ec .\" ============================================================= .bp .c Making Monitors .2 "Monitors are simple" .3 "expect a list of items to poll from @ARGV" .3 "some standard env variables are set MON_LOGDIR, etc." .3 "perform tests on items" .3 "first line of output is the summary line" .3 "remaining lines are the detail (not interpreted)" .3 "exit status of zero / nonzero" .\" ============================================================= .bp .c Example Monitor .1 Detect non-operational mountd on NFS servers: .bc #!/usr/bin/perl my @failed; my $detail; foreach my $item (@ARGV) { my $output = `showmount -e $item 2>&1`; if ($?) { push @failed, $item; $detail .= "$item failed:\\n$output\\n"; } else { $detail .= "$item ok:\\n$output\\n"; } } print join (" ", @failed), "\\n"; print $detail; @failed == 0 ? exit 0 : exit 1; .ec .\" ============================================================= .bp .c Making Alerts .2 "Alerts are even simpler than monitors" .3 "@ARGV has some options supplied by server" .3 "rest of @ARGV is from the config file" .3 "first line of stdin is summary" .3 "rest is detail" .3 "perform whatever action desired" .\" ============================================================= .bp .c Example Alert .1 Send email: .bc #!/usr/bin/perl chomp (my $summary = ); my $to = join (",", @ARGV); open (MAIL, "| /usr/lib/sendmail -oi -t") || die; print MAIL < "mon-bd2"); $cl->connect; my %s = $cl->list_opstatus; $cl->disconnect; foreach my $var (keys %{$s{"server"}->{"service"}}) { print "$var=$s{server}->{service}->{$var}\\n"; } .ec .\" ============================================================= .bp .c Parallelization .1 Parallelization is handled using two methods: .2 Monitors are parallel processes .3 Each "service" process runs independently .3 Leverages multiprocessing architectures .2 Monitors should parallelize their own checks .3 Minimize serialization delay when checking numbers of entries .3 fping.monitor operates asynchronously .3 phttp.monitor operates asynchronously .\" ============================================================= .bp .c Interesting Applications 1 .1 "Simlpe home-brew failover" .2 "Several web servers" .2 "Each with eth0 admin and eth0:0 virtual addr" .2 "eth0:0 addresses are published as DNS A records" .2 "mon server polls http servers" .2 "On failure, 'failover.alert' sshs to a 2ndary server and ifup's the dead virtual ip on eth0:1" .\" ============================================================= .bp .c Interesting Applications 2 .1 "Adding on-call schedule support" .2 "Alert uses Schedule::Oncall module" .2 "No changes to the server are needed" .2 "Sends mail to the person on call" .2 "Optionally sends alphanumeric page, also" .2 "Now mon supports on-call schedules!" .\" ============================================================= .bp .c Interesting Applications 3 .1 "Debugging WAN" .2 "Traceroute monitor" .2 "Show when path changes" .2 "Record history of traces" .2 "Call ISP with evidence rather than speculation" .\" ============================================================= .bp .c Interesting Applications 4 .1 "Print queues jamming" .2 "Clumsy unreliable printers, need to tune lprng" .2 "Catch them when they jam so can collect data" .2 "Shows when a queue is making no progress because of paper or toner deficit" .\" ============================================================= .bp .c Interesting Applications 5 .1 .c "Hierarchical Monitoring System" .PSPIC hierarchical.eps 6.5i 6.5i .\" ============================================================= .bp .c Interesting Applications 6 .1 mon-syslog .PSPIC mon-syslog.eps 4i 5i .\" ============================================================= .bp .c Interesting Applications 7 .1 dtquery .2 CGI-based tool, mon client .2 query mon downtime logs for specific downtime events .2 on specific hosts/groups/services .2 during specified date ranges .2 supply with graphs summarizing the results .\" ============================================================= .bp .c Experience .2 "Useful as a debugging tool" .3 "Whip-up custom monitors for debugging" .3 "Logs help investigation of past events" .3 "Identify that a disaster has been resolved" .2 "If it failed twice before, write a monitor" .2 "Helps keep admins in tune with systems problems" .2 "Admin team knows problems before users report them" .\" ============================================================= .bp .c Hints .2 Take time to tune alerts to maintain your sanity .2 Monitor only what you care about, not everything .2 That is, keep it simple and digestable .2 Use alphanumeric paging via a modem if monitoring networks .2 Post your monitors and alerts to the mailing list! .\" ============================================================= .bp .sp 5 .c The Design and Operation of "mon" .sp 3 .ce 999 Jim Trocki Transmeta Corporation .T trockij@transmeta.com .T http://www.kernel.org/software/mon/ .ce 0