Storage event reporting and monitoring - PoC

Topics

Introduction
Reporting MD RAID events
Monitoring storage-related events
And they lived hapily every after
Conclusions

Introduction

In the previous blog post we have presented a proposal for reporting and monitoring storage-related events using journald and structured logging. To test if the proposal is viable we need some proof of concept. Such a PoC should demonstrate the complexity of the proposed solution as well as the sufficiency of the proposed set of stored (logged) items and the catalog entry.

Out of all the storage subsystems like LVM, Btrfs,... and MD RAID, only the last one mentioned provides a monitoring tool that can run any executable when an event appears, passing it the information about the event as arguments. It is thus a logical choice for the implementation of a PoC tool that reports such events using structured logging.

Reporting MD RAID events

The monitoring tools for MD RAID is actually the well-known mdadm tool, just run with the --monitor option. There is also the systemd service mdmonitor.service [1] that makes sure the monitoring is started as part of the booting process. But only when the configuration file /etc/mdadm.conf exists, plus it has to specify which tool to run or where to send emails with the event reports. This is described in the mdadm(8) man page (in the Monitor mode section) together with specifications of all events that can be reported.

The program specified by the PROGRAM directive in /etc/mdadm.conf is run with two or three arguments for every event. The first one is the name of the event, the second one is the path of the MD RAID array the event is reported for, and in some cases the third argument specifies a member device the event is related to (e.g. when a member device is marked as failed).

Here is the trivial implementation of the necessary code to handle the events:

if (argc < 3)
    error (1, 0, "Not enough arguments given!");

event = argv[1];
md_dev = argv[2];

if (argc > 3)
    member = argv[3];

Once the program has the data it can get from mdadm, it needs to derive some more to have everything it needs to report the event using structured logging with all the fields specified by the proposal. Every report is supposed to describe a change in device's state. Thus the program has to derive in which state a given device (MD RAID array or a member) is together with the severity of such change/state and a description of the change/state to make it easier to understand for users.

mdadm only gives the program a name of the event with no details. What every event means is described in the mdadm(8) man page so the easiest way for the PoC is to just have the information stored in itself and do matching based on the event name. So an array of structures like these:

{"DeviceDisappeared", "deactivated", "MD array was deactivated", LOG_WARNING},
{"RebuildStarted", "rebuilding", "MD array is rebuilding", LOG_INFO},
{"RebuildFinished", "rebuilt", "MD array is now rebuilt", LOG_INFO},
{"Fail", "failed", "Device was marked as failed", LOG_WARNING},
...

where the first item is the event name as defined by mdadm, which allows the program to do the matching, and then the additional/derived information follows.

With all the data it needs, the program can use the journal API to log the message (IOW, save the data):

ret = sd_journal_send ("MESSAGE_ID=3183267b90074a4595e91daef0e01462",
                       "MESSAGE=mdadm reported %s on the MD device %s", event, md_dev,
                       "SOURCE=MD RAID", "SOURCE_MAN=mdadm(8)",
                       "DEVICE=%s", md_dev, "STATE=%s", info->state,
                       "PRIORITY=%i", info->priority,
                       "PRIORITY_DESC=%s", log_lvl_desc[info->priority],
                       "DETAILS=%s", info->details,
                       md_fields (event, md_dev, ""),
                       NULL);

It should be obvious what's going on in the above function call. info is the matched structure with extra information, log_lvl_desc provides a string description for the given log level (an enum and thus int). Last but not least, there's the md_fields macro that provides extra fields specific to an MD RAID event so that no information is lost when reporting the event:

#define md_fields(event, md_dev, member) "MD_EVENT=%s", event, \
                                         "MD_ARRAY=%s", md_dev, \
                                         "MD_MEMBER=%s", member

And that's basically it. Everything else in the above described tool is just handling of unexpected failures (invalid data, no journal,...). The complete source can be found in my repository at GitHub.

Monitoring storage-related events

The previous section describes how a simple tool for reporting events from MD RAID to journal can be implemented. Which is great, isn't it?! Well, as long as the user or administrator wants to sit and watch journactl -f to see if something is happening to their MD RAID arrays and devices. Unfortunately, today's users are not like that, for some reason they don't enjoy the endless fun of watching their systems' logs.

In order to make things more user-friendly those storage event reports have to be transformed in some kind of notifications. Of course, user an email can be sent to the user, but modern desktop environments support nice notifications that pop-up and are made visible while user is working with their system. Usually those notifications are also archived to make sure user doesn't miss any.

So what does a tool monitoring journal for new storage-related event reports and transforming them into desktop notifications have to do? It basically needs to open the journal, set a filter to only get messages with the given MESSAGE_ID and then in a loop wait for a new message, get data from it and send it as a desktop notification. The only extra trick is to seek in the journal to the point where the notifying tool was run because otherwise there would be notifications for all the events found in the journal (potentially very old). Here are the journal API function calls to implement all that:

ret = sd_journal_open (&j, 0);
ret = sd_journal_seek_realtime_usec (j, time_stamp * 1000000);
ret = sd_journal_add_match (j, "MESSAGE_ID=3183267b90074a4595e91daef0e01462", 0);
while (ret >= 0) {
    ret = sd_journal_next (j);
    if (ret == 0) {
        /* reached the end, let's wait for changes, and try again */
        ret = sd_journal_wait (j, (uint64_t) -1);
    } else {
        send_notification (j);
    }
}
sd_journal_close (j);

(The above code has all the checks of the ret variable removed to save space.) There is the j variable of type sd_journal which is basically a context on which everything is set and called and which is also used to get data as will soon be shown. The code is very straightforward and hopefully easy to read and understand. It's a modification of what the documentation (man page) for the sd_journal_wait() function contains.

The send_notification() function is trivial. It just collects all the necessary data from the journal entry and uses the simplest possible way to show a desktop notification - running the notify-send utility:

ret = sd_journal_get_data (j, "SOURCE", &source, &length);
source += 7;
ret = sd_journal_get_data (j, "DEVICE", &device, &length);
device += 7;
ret = sd_journal_get_data (j, "STATE", &state, &length);
state += 6;
ret = sd_journal_get_data (j, "DETAILS", &details, &length);
details += 8;

ret = asprintf (&cmd, "notify-send -i drive-harddisk 'Storage event reported by %s'"
                " '%s is %s: %s'", source, device, state, details);
system (cmd);

(All the error checking was again removed to save space.) The reason why all the pointers are shifted with a certain offset is that sd_journal_get_data() returns a pointer to the whole item starting with the KEY= prefix which we want to skip. The value of length can be ignored if we rely on the fact that all the requested values will always be NUL-terminated (and thus valid C strings).

The command to run specifies the icon to show with the notification (a stock icon for an HDD) a summary and some more detailed information. It would of course be better to use a DBus API for desktop notifications, but for the purposes of a PoC, using the notify-send utility is good enough. Again, the full source of the tool is available in the GitHub repository

And they lived hapily every after

Well, not so much. Everything what has been written in this blog post so far sounds like a nice fairy tale about the prince md_report and princess desktop_notification. However, like in every fairy tale, there's evil in this one too and it's called mdadm --monitor. Reading its man page one gets the illusion of how great it is and how well it covers everything that is needed here. But the reality is quite disappointing.

The biggest problem is that once the mdmonitor.service is started, it doesn't catch any new MD RAID creation/activation unless it is restarted. And then it reports all the MD RAID arrays it discovers as one of NewArray or DegradedArray and actually starts to monitor those arrays.

Another problem is that when a device is marked as failed, Fail or FailSpare is reported, but if the MD array becomes degraded as a result, nothing like DegradedArray is reported. And thus to determine if the array became degraded, the tool being run has to somehow "remember" what the state prior to the event was and if a failed device caused the array to become degraded or not. Note that mdadm spawns the tool for ever event, it is not running in the meantime. Of course this could be worked around by having some persistent process running and just a simple tool passing the information to it (and exitting) specified as the PROGRAM for mdadm. But that's quite cumbersome.

Also, there is no event for a device being added to an array. So unless the array was degraded prior to the add, in which case adding a device triggers the SpareActive event, there's no way to report the add event. And again, in order to determine if a new device was added to a degraded array or if an old/existing spare was activated, the tool needs to compare the old and new state of the MD array and its member devices.

Conclusions

Due to the above limitations (and more not mentioned or discovered) the current state of the mdmonitor.service doesn't allow any easy implementation of reliable and useful reporting and monitoring of MD RAID events using structured logging. However, the two proof-of-concept tools demonstrate that the proposed mechanisms are viable, easy to implement and sufficient for the problem area. If/once the storage subsystems improve their event reporting, using journal structured logging will definitely be a good way to implement a common solution for reporting and monitoring storage related events.

In the next blog post we will focus on actions and events that are not failures or restorations like for example when an LV is renamed or resized. These should definitely be reported too and there might be tools and daemons interested in getting such information.

[1]	at least on Fedora/RHEL systems, but surely in others too