|
July 15, 2005
Trapping Windows Events with SNMP
When it comes to managing Windows-based systems, there is no greater
source of information than the native event-logging subsystem. Windows
and the applications that run on it use the event log repository to record
all kinds of significant system events, ranging from excessive user authentication
failures that may indicate a hack attempt, to tracking down time- and
directory-synchronization problems that underlay secondary application
problems, to discovering that a hard drive is starting to show problems
at the filesystem level that may indicate a particular drive is on the
verge of failure, and numerous other problems.
Unfortunately, trying to pull information out of multiple event logs
in a way that is both timely and usable can be difficult and convoluted.
In one common scenario, network administrators will try to integrate
the Windows event logs into a broader logfile-analysis toolkit that requires
all of the system messages to be transmitted to a central server for
string and context analysis. As a result of the complexity and overhead
of managing this kind of multi-system synchronization architecture, many
administrators will often limit this effort to specific servers (leaving
their workstations and secondary servers as completely unsupervised),
or will forgo the effort entirely.
There is a better way, however, and one which reuses SNMP technology
that is already bundled into Windows to generate lightweight alerts against
pre-selected events, thus providing the basis for a flexible and scalable
notification system that can work with existing network management tools.
Cumulatively, this means that network administrators can use the built-in
alert system and an SNMP management station to trap critical events and
automatically respond to them as soon as they happen, with the only additional
requirements being the desired features of the network management console
that is used to monitor and respond to the event traps.
The Pieces
All of the 32-bit Windows versions come with an SNMP agent that has
the ability to generate explicit SNMP trap messages from any of the discrete
Windows event messages that can be logged. However, the component pieces
to get this working are not visible by default, and some of them are
entirely undocumented. In this primer, we'll walk through these components
and explain how to go about generating and trapping the specific SNMP
trap messages that you may be interested in.
The first part of the puzzle is the SNMP agent itself, which is bundled
with Windows, but isn't installed by default. Once this component is
installed, it also has to be configured with basic SNMP settings such
as the trap destination(s), the community string(s) to use, and other
kinds of site-specific SNMP details. Once the basic SNMP agent is configured,
you then need to delve into the event agent components, which is where
you actually define the event messages that you want to capture and retransmit
as SNMP messages.
The next part involves the actual generation of the system events.
Although Windows will log several thousand events on its own accord out-of-the-box,
some applications and subsystems require additional configuration tweaks
before the desired events will be generated in the first place. For example,
Windows does not log login events until it is told to do so, while some
third-party applications may require some kind of syslog-to-eventlog
proxy agent before its events can be captured.
At the other end of the wire, some kind of SNMP management station
is also needed to receive and process the alerts that are generated.
There are dozens of such products on the market today, ranging in price
from a few hundred dollars to tens of thousands of dollars, offering
different features for event disposition, escalation and automation.
You may even have an SNMP management station already that you just don't
know about - the system-management products included with IBM and HP
servers have SNMP capabilities, for example - but finding a workable
platform isn't too difficult if you don't.
The last major component to this architecture is extending the management
console to support the Windows event traps. Most of the SNMP management
systems provide some kind of option to compile Management Information
Base (MIB) files, so this isn't very difficult in principle. However,
the windows SNMP event agent does not include a pre-built MIB (the reasons
for this will become clear later in this primer), and you will need to
manually construct this MIB such that it specifically reflects the events
that you are wanting to trap. For example, if you want to trap DNS-related
events, you will need to construct the MIB file so that those events
are accurately recognized, and then import and compile the resulting
MIB in your management system.
Once these components are configured and operational, it's effectively
possible to generate structured SNMP traps for almost any event that
can be logged, and for your SNMP management station to capture and react
to these events according to its capabilities. Having said that, however,
it's also important to recognize that there are some significant caveats
with this overall approach, and these are also discussed later in this
primer.
Generating Windows Events
By default, Windows systems will log most system-level events on their
own without any further administrative action being required. Some events
are only enabled after a related set of system "auditing"
features are also enabled. For example, if you want to generate events
and traps whenever logon operations fail, you will need to enable the "Audit
logon events" option in the appropriate Windows policy editor. Similarly,
if you want to generate events and traps whenever a system is rebooted,
you'll need to enable the "Audit system events" option in the same policy
editor.
You can define these kinds of policy settings on a per-system basis
by using the "Local Security Settings" applet in the "Administrative
Tools" program group, or can enable them on a domain-wide basis by using
one of the policy editors available for your server operating system.
Figure 1 shows what the domain-wide settings look like inside the Local
Security Settings applet (the padlock and domain indicators show that
these are domain-wide settings that cannot be overridden locally).
Some events simply cannot be trapped in the Windows event log without
the use of external tools. For example, programs that write event data
to their own specific logfile and don't use the Windows event system
cannot be integrated with SNMP traps unless you use a third-party tool
to stuff the logfile entries into the event log, while open source applications
that have been ported to Windows often rely on a "syslog" interface that
may require a local proxy. Even in these cases, however, you still may
not be able to trap individual events, with the overall functionality
being dependant upon whether or not the external program is able to generate
discrete system events for different kinds of entries.
But in general, if you are able to cause events to be stored into one
of the Windows event logs, then you should be able to generate SNMP traps
from those entries.
Managing the Event-to-Trap Mappings
Once your systems are generating the relevant events, you have to instruct
the Windows SNMP event subsystem to generate SNMP traps for each of the
desired events.
To manage the event-to-trap mappings, you have three basic choices.
The easiest tool to use is evntwin.exe, which provides a graphical list
of all the registered events, and lets you choose the ones that you want
to map to SNMP traps. This program isn't linked into any of the default
program groups, so you'll have to type in the name from the "Run" menu
or a command prompt. By default, evntwin starts in a "view" mode, and
you'll need to click the
"Edit" button to actually manage the available events and their corresponding
SNMP traps.
Once the "edit" view is open, you'll see something similar to what's
shown in Figure 2. The top half of the screen contains a list of all
the events that are already configured, while the bottom left shows a
list of the available event logs and their subordinate event sources,
and the bottom right showing the events that are known for each of the
sources. Double-click an event that you want to monitor, and another
dialog opens to let you set any rate-limiting options that you may need
(more on this later). If you choose the "OK" button the event will get
added to the top window list, while the "Cancel" button does what you'd
expect. The buttons along the right side of the main window also allow
you to set these same options, as well as some global options. You can
remove entries from the top list by deleting them directly, or by using
the right-mouse menu options. Once the list of desired events is constructed,
you can export and import the list among multiple systems.
There is also a bare-bones, command-line utility called evntcmd.exe,
but it is only really suitable for importing configuration settings into
a system. However, if you have already configured the list of events
that you want to trap somewhere else, you can use eventcmd to import
the list into other systems through a network logon script or some kind
of shell interface like SSH. The evntcmd utility also has the ability
to write to a remote system's registry, allowing you to push configuration
settings down to a node immediately if you don't want to wait for it
to execute the utility itself.
Both utilities are really just front-ends to the registry, and those
can also be manipulated directly with other tools if you prefer (such
as through policy-manager extensions or any of the other tools).
Either way, once you have configured the events that you want to trap,
you'll need to restart the SNMP service in order for the changes to be
recognized. Once that's done, the event monitoring subsystem will wait
for the selected events to fire, and will then shoot off a relevant SNMP
trap to the specified trap destination(s).
The SNMP Trap Structure
By far, the most complicated part of this process comes from defining
the SNMP MIB data that your management station needs to properly handle
the events. Part of this complexity is due to the fact that Microsoft
doesn't document the SNMP trap format, and also because the traps use
a free-form model that is not entirely predictable. As such, making the
whole system work depends in large part on your willingness to poke around
inside the SNMP traps.
A sample MIB file that traps a handful of events is available for download
from http://www.ehsco.com/software/snmp/EVNTAGENT-MIB.mib,
and illustrates the kind of information that has to be filled-in by the
administrator. We suggest that readers download this MIB file and use
it as a reference throughout the remainder of this discussion, as some
points are best understood by studying the example.
To start with, the default base OID for the SNMP traps is defined as "1.3.6.1.4.1.311.1.13.1",
and all of the OID sequences in the event agent will use this base OID
value (this can be overridden by changing the "BaseEnterpriseOID" registry
key value if needed, although this should not be necessary). The "1.3.6.1.4.1" sequence
is the "enterprise" branch of the public OID hierarchy, while
"311" is the OID assigned to Microsoft Corporation, and "1" is the OID
that Microsoft uses for "software". The "13.1" OID pair represents the
event log messages that are sent as SNMP traps, although there is no
known authoritative reference for these OID values, and Microsoft did
not provide definitive names for these values when asked (we have unilaterally
defined them as "eventlog"
and "evntagent" respectively, but they could be anything).
The Event Traps
The SNMP traps have additional OID values under the base OID that identify
the named event source for the canonical Windows event. Specifically,
these OID sequences indicate the length of the event source name, and
also carry the ASCII values of each letter from that name. For example,
events from the "DNS" source will have the OID sequence of "3.68.78.83" under
the base OID described above, where "3" indicates that there are three
characters in the name of the event source, with the ASCII decimal values
of "68" ("D"), "78" ("N"), and "83" ("S") respectively. Along these same
lines, events from the "Security" event source are identified by the
OID sequence of "8.83.101.99.117.114.105.116.121", and so forth.
The last OID in the full sequence indicates the canonical Windows event
that was fired. Sometimes these OID values mirror the event number, but
most of the time it is a calculated value of some kind. For example,
the explicit OID for "logon failure" is "529", which is the same value
as the event identifier for the canonical event itself. On the other
hand, the explicit OID for the NTP synchronization success event is "1113194531",
which is nothing at all like the canonical Windows event identifier.
Because of this vagary, you will likely need to use some kind of network
analyzer in order to determine which exact OID value will be generated.
Most MIBs require naming contexts, but Microsoft does not provide any
kind of naming or guidance here, so you will have to come up with your
own. While most MIBs map single OID values to a logical name, this doesn't
work with the approach that Microsoft has taken, and you will instead
need to map a sequence of relative values to a single name in order to
manage categories. For example, you can define the relative OID sequence
of "3 68 78 83" (without the dot-separators) as "w32Dns" (or something
similar), and then define discrete children OID values with their own
trap names. We have tried to be flexible and predictable here, using
names like "w32LogonFailure" to indicate login failure errors, and we
would encourage others to behave similarly in case their definitions
leak out to the external world.
The Trap Details
The SNMP trap data itself is provided as an enterprise-specific alert,
using the OID value of 9999 after the base OID value described above.
Every trap message has at least five sub-fields, while some of them can
provide a dozen or more additional event-specific variables. The five
fields that are always present are the textual event message, the user
ID of the process that triggered the event, the computer name of the
event system, a numeric representation of the event "type", and a numeric
representation of the event "category", in that order. We have named
these as "eventText",
"eventUserId", "eventSystem", "eventType" and "eventCategory"
respectively in our sample MIB.
Note that the event "type" value indicates whether the SNMP trap carries
an error, a warning, an information message, an audited success event,
or an audited failure event. Meanwhile, the event
"category" values are the same as the categories that are available in
the Event Viewer for filtering purposes, except they have a numeric value
instead of a textual representation (events from the "login/logout" category
have a numeric value of "2", for example). Finally, note that the event-specific
variable data changes for each event (for example, authentication events
typically provide information about the user account, the authentication
domain, the security provider, and so forth), and will mirror the structure
of the canonical event. Since the event-specific variable data is so
unpredictable, it is best to define it that way, and in our case we have
created MIB definitions for "eventVar1" through
"eventVar20" just to catch them all.
Overall, this may seem like a goofy design model, but it makes some
sense when you consider the open-ended nature of the Windows event subsystem.
New event logs, sources and canonical events can be defined at will in
the Windows logging model, so some kind of extensible model had to be
used for the SNMP traps as well (and preferably one which did not require
developers to register their logging extensions with Microsoft). This
model achieves that goal, but with the unfortunate side effect that administrators
have to do some legwork if they want to trap a variety of events from
a variety of different sources. Conversely, Microsoft could provide a
MIB file that defined all known Windows events, but it would be huge
(there are thousands of discrete events), and it would not easily facilitate
extensibility.
An example SNMP trap is shown in Figure 3, using the w32LogonFailure
event discussed above. In that example, we used MG-SOFT's Trap Ringer
software to compile the MIB definition, and then pointed the systems
on our LAN to that server. We successfully used the same MIB with IBM
Director as well. All of these tools allowed us to associate actions
with these events, such as paging a manager when login failures were
detected on one of the monitored systems.
Caveats Galore
Overall, this mechanism is extremely useful for monitoring the systems
on our network for a variety of trouble indicators. For example, we can
monitor for Service Control Manager events that indicate a service has
crashed or has refused to start. Similarly, we can monitor for NTP synchronization
problems among our different servers, and for filesystem errors that
indicate a disk error may be coming. We can also be alerted to login
attempts, and notified when an event log has been purged, among many
other potential security considerations. Best of all, this is all taken
care of through our existing management systems, and we don't need to
manage secondary systems for the exclusive purpose of managing event
logs in particular.
However, not everything is rosy with this model, and there are some
areas of concern. For one thing, Microsoft has stated that the alerting
mechanism won't always fire, or that it may be slow sometimes (essentially,
the events aren't always trapped immediately). Also, some events will
fire multiple times, and those have to be managed a little differently
(this is what the rate-limiting options in evntwin.exe are provided for).
One of the more annoying factors here is that different systems will
behave differently, making it hard to get a universally-applicable solution
in place. For example, Windows XP will trap the "Shutdown"
security audit events, but doesn't trap the corresponding "Startup"
events, while Windows Server 2003 does the exact opposite. [update
10/31/2005: Shaun Skillin has pointed out that the
eventcreate tool can be used in the startup and shutdown scripts
to overcome some of these annoyances.]
We've also encountered some problems with very large OID values. Although
the SNMP specifications state that these values are unsigned 32-bit integers,
some management systems insist on treating them as signed values, so
some of the high-numbered OIDs are not recognized correctly.
There are also many people who have ongoing security concerns with
SNMP and the use of unencoded community names. In particular, by installing
SNMP on each of the managed nodes, we are potentially exposing a tremendous
amount of information that we would rather keep private. This isn't much
of a problem for our internal network resources, but we certainly appreciate
the concerns that people have here, and share in some of them.
|