Reporting and monitoring storage events

Topics

With any type of storage failures and errors happen. Depending on the setup and the their extent such failures and errors might be anything from being just annoying to absolutely critical. However, they all have one thing in common -- they need to be properly reported so that counter-actions and fixups (e.g. replacing a failed drive, restoring data from a backup, etc.) can happen.

But what does properly mean here? The failures and errors need to be reported:

Ideally, there should be some common method of such reporting across all storage subsystems so that it's easy to get all the storage-related events and distinguish them from the other events.

Current situation

Is very suboptimal (to put it mildly). Every technology/subsystem (MD RAID, LVM,...) doing some kind of monitoring and reporting of storage events uses its own way. MD RAID sends an email to root@localhost and logs messages to syslog, LVM (or dmeventd more precisely) logs messages to syslog, smartd logs messages to syslog, prints warnings to console(s) and is also listened to by udisks providing DBus signals for SMART events, etc (no idea about Btrfs, pointers welcome!). So all the reports are in visible places, persistent and archived and have timestamps. However, they are far from having an unambiguous format that can be understood by both humans and machines. MD RAID provides a way to run an executable on every event which may be used for automation (so does dmeventd, but only for a very limitted set of events).

There's again one common thing -- all the reports are logged. Nevertheless, it's nearly impossible to get all storage-related reports and distinguish them from other reports in the logs because they are more or less just free-form text.

How to improve the current situation?

It's obvious that to make getting all storage-related reports easy, they need to be in one place. Which is already happening -- they are all being logged. However, in a way that makes it hard to distinguish them from other messages and basically impossible to further automatically analyze and process them.

Fortunately, modern GNU/Linux systems are using journald for logging which supports the so called structured logging. In other words, logging where a log entry is not just a plain free-form text message, but a set of key-value pairs (with values potentially being arbitrary BLOBs). So a trivial solution would be to add an extra field TYPE=storage-event to all messages for storage events being logged (with MESSAGE being the standard field for the free-form text) and then use the field to filter logs for storage event reports. And similarly, other extra fields can be used to provide machine-readable information like DEVICE, STATE,... as well as tips for human administrators like links to documentation or man pages.

Since this is a very common (use) case not only for storage-related events reporting journal has a very nice builtin support for all this. There are a few predefined/well-known fields and a few automatically-computed fields added by journald itself, but the logging process can pass arbitrary extra fields on its own even with multiple key-value pairs having the same key (and thus multiple values for the key). And for the TYPE field it has a predefined key with a rather confusing name MESSAGE_ID which is not an ID of the given message but an ID of the message's type/category/... The value of MESSAGE_ID is supposed to be a 128bit UUID written as hexadecimal numbers without any separators. Creating a new such UUID can be done with:

# journalctl --new-id128

Human readability

Moreover, journal also supports the so called catalogs. A catalog entry defines a verbose email-like template that can be used to generate a nice message for a given MESSAGE_ID using the values from the log entry. For a storage-related event it could look like this:

-- 3183267b90074a4595e91daef0e01462
Subject: Storage event reported by @SOURCE@
Defined-by: storage-event-reporting
Documentation: man:storage-event-reporting(7)
Documentation: man:@SOURCE_MAN@

Event: @DEVICE@ is @STATE@: @DETAILS@.
Event-level: @PRIORITY@ (@PRIORITY_DESC@)

In case the device is failing or failed, or if the event level is
4 (warning) or lower (more significant), an immediate action may be
needed to prevent system malfunction!

where the first line is the MESSAGE_ID followed by Subject, Defined-by and Documentation items (similar to email headers) which are predefined and recommended. Then there is a desription of the event specifying the affected device, its state and some details about it. The Event-level specifies the severity and is followed by a generic message explaining the meaning of the above fields and warning the user/admin about the potential need of immediate action.

A concrete message can then look like this [1]:

Mar 24 11:27:15 localhost.localdomain logger[28370]: SMART is reporting too many ECC failures on /dev/sda
-- Subject: Storage event reported by smartd
-- Defined-by: storage-event-reporting
-- Documentation: man:storage-event-reporting(7)
-- Documentation: man:smartd(8)
--
-- Event: /dev/sda is failing: Too many ECC failures.
-- Event-level: 4 (warning)
--
-- In case the device is failing or failed, or if the event level is
-- 4 (warning) or lower (more significant), an immediate action may be needed
-- to prevent system malfunction!

Machine readability

The automated processing would of course need the raw key-value pairs, but it's those what is stored in the journal of course. The above verbose message is just generated on demand based on the template. So for example udisks could easily just use the journal API to monitor the journal filtering the messages based on the above MESSAGE_ID and use the key-value pairs to generate DBus signals. Those DBus signals could then be connected to e.g. GNOME Shell to inform user that something is going on (as is now already happening for warnings/errors coming from smartd and "forwarded" by udisks). Of course not all messages need to be errors and failures. There are harmless and good things happening too, like thin pool being grown in a VG on demand or a multipath path coming back "online". But those can be easily recognized by the PRIORITY field and processed accordingly (e.g. ignored altogether).

It's very easy to work with journal in code. This is how a storage event report can be done [2]:

import systemd.journal
systemd.journal.send("SMART is reporting too many ECC failures on /dev/sda",
                     MESSAGE_ID="3183267b90074a4595e91daef0e01462",
                     DETAILS="Too many ECC failures", DEVICE="/dev/sda",
                     STATE="failing", SOURCE_MAN="smartd(8)",
                     SOURCE="smartd", PRIORITY="4", PRIORITY_DESC="warning")

To get all such reports later:

reader = systemd.journal.Reader()
reader.this_boot()
reader.add_match(MESSAGE_ID="3183267b90074a4595e91daef0e01462")
for entry in reader:
    print(entry)  # entry is a dict

To watch journal for new entries and process them:

state = reader.wait(-1)
while state > 0:  # state < 0 means error
    # could also be INVALID -> should create new Reader()
    if state == systemd.journal.APPEND:
        print(reader.get_next())
    state = reader.wait(-1)

Rationale

It's obvious that a unified storage event reporting framework providing both human and machine readable data would bring many benefits to administrators, users and many automation solutions that already exist or are yet to come in this "era of automation". And there are definitely many ways how to implement such a framework.

It could be based on a new daemon gathering all the reports via a new API and then propagating/sharing them across the system via another new API. Both of the APIs would have to be very low-level from the language and data format's perspective because the originators as well as (some of the) consumers of the event reports are core low-level system utilities (or even kernel) that cannot/don't want to bother with JSON, Python API or anything like that. Of course there could be alternative higher-level APIs probably built on top of the low-level ones. The new daemon would also have to make sure the event reports are stored persistently and archived and that they can be filtered based on some keys.

As a matter of fact there already is an implementation of the daemon described above. See the analogy with journald? journald provides a low-level API for both event reporting and gathering/monitoring using just pure elementary C (not relying on libraries like GLib), it has bindings for high-level languages like Python (and many others [3]) as well as support for exporting data in the JSON format. In fact, it provides a low-level key-value database which is running since very early in the boot process and for the whole time on vast majority of modern GNU/Linux systems. Plus it also brings all the benefits of existing solutions and infrastructure built around system logging like replicating/storing logs to/on another machine, gathering/reading logs from a VM or a running container, tools for viewing and inspecting logs, etc. All this introducing the smallest possible changes to the storage technologies' codebases if being used for storage event reporting because currently all the reports already go to the system log, just in a bad format using wrong API(s).

In other words, there seems to be no reason to implement and run a new daemon for storage event reporting.

Note about Storage Instantiation Daemon

Once the Storage Instantiation Daemon [4] is implemented and running on systems it can be used to get a better picture of what's affected by the event. It will have a tree of block devices with all their dependencies. So knowing there's a problem on e.g. /dev/sda SID will be able to tell that it is for example actually a backing device for a LUKS mapping which is a member of an MD RAID device /dev/md127 which in turn is being used as a PV in VG system that has LVs root, swap and home. This may sound like a crazy complex example, but is actually a common scenario/setup.

Final notes

This blog post is an outcome of the scoping and early research efforts for design and development of a unified cross-distribution solution for reporting and processing storage related events. It's partially based on the even earlier work done by Anne Mulhern [5] (going a few steps backwards, actually).

The goal is to start an as wide as possible discussion on this topic. Do you have any suggestions, concerns, ideas or comments? Please share them in the discussion below and send the link to this post to all people who might be interested in the future of storage event reporting and monitoring.

As the first step towards some real implementation we need a proof of concept. And since only MD RAID provides a way to run an arbitrary code as a reaction on an event it makes sense to start with it. A hopefully short piece of code should demonstrate how hard or easy it is to use the stucturred logging to report storage-related events. And also how hard it is going to be to push such changes upstream so that the mdadm --monitor command does the reporting itself by default without spawning another executable on every event.

Reports from these efforts are definitely going to be published here on this blog. So stay tuned!

[1]output from journalctl -xe MESSAGE_ID=3183267b90074a4595e91daef0e01462
[2]All the code examples are in Python to keep them short, but a C code doing the same looks almost the same - even the C API provides these high-level functions and a single "object" to operate on.
[3]https://wiki.freedesktop.org/www/Software/systemd (bottom of the page)
[4]a link to a web page or blog post about SID will be added here once something like that exists
[5]https://github.com/mulkieran/storage_alerts