IPMI/Environmental Monitoring

experimental

Overview

Most modern computer systems have the capability of reporting temperatures and fan speeds (as well as other information) via ipmi. However, the format of the device names is not standardized making it extremely difficult if not impossible to programmatically interpret and report it. This feature has been declared to be experimental in order to evaluate its success on a broader set of hardware, which I expect will be determined by the number of bug reports. As such it is disabled in collectl.conf so if you want to enable it when running as a daemon be sure to include an E in the -s string in the DeamonCommands line as shown below:
DaemonCommands = -f /var/log/collectl -r00:01,7 -m -F60 -s+CEYZ

Prerequisites

Collectl depends on the open source tool ipmitool, which must be installed first. Installation is as simple as pulling down and unpacking the kit with tar and executing the commands:
./configure
make
make install
That's all it takes. However, your system must support ipmi and one way to tell is if the command dmidecode | grep IPMI produces any output. If not, your system does not support ipmi and even if you were able to install impitool, you won't be able to use it.

The next step is to start the ipmi driver, and this is generally done via the command service ipmi start on a RedHat system or something line /etc/init.d/impi start on others. On some systems such as HP blades, you may need to install a custome ipmi driver such as hp-OpenIPMI and start that instead of the standard driver.

At this point you should be able to execute the command "ipmitool sdr and see a all your sensor data or the commands ipmitool sdr type fan and ipmitool sdr type temp to just see fan and temperature data:

[root@bl460-63 ipmitool-1.8.9]# ipmitool sdr
UID Light        | 0 unspecified     | ok
Int. Health LED  | 0 unspecified     | ok
VRM 1            | 0 unspecified     | cr
VRM 2            | 0 unspecified     | cr
Temp 1           | 47 degrees C      | ok
Temp 2           | 34 degrees C      | ok
Temp 3           | 30 degrees C      | ok
Temp 4           | 30 degrees C      | ok
Temp 5           | 31 degrees C      | ok
Temp 6           | 30 degrees C      | ok
Temp 7           | 30 degrees C      | ok
Temp 8           | 66 degrees C      | ok
Temp 9           | 20 degrees C      | ok
Virtual Fan      | 37.24 unspecifi | nc
Enclosure Status | 0 unspecified     | nc

The collectl interface

Collectl uses a 3rd monitoring interval to collect ipmi data, which by default is 2 minutes. This is done for several reasons: Although you can run collectl to report this data interatively to report a single sample, its typical use is expected to be as a daemon. When run as a daemon collectl and -sE is specified, which is currently the default, it will first check to see if ipmitool is present and if a communications device of the form /dev/ipmi* is present. If so it will start collecting ipmi data for fans and temperature sensors.

You can control the way ipmi data is displayed in playback mode using --envopts and one of 3 switches that allow you to only report fan or temperature data and if you are reporting both, which is the default, you can request the 2 types of data be displayed on separate lines. This latter option can be useful if you have a lot of devices on which to report.

The following is an example of time-stamped output on an HP dl380-g5, first without any options

collectl.pl -sE -i::1 -oT
# ENVIRONMENTAL STATISTICS
#            Fan1    Fan2    Fan3    Fan4    Fan5    Fan6     Fan   Temp1   Temp2   Temp3   Temp4   Temp5   Temp6   Temp7
07:00:58   45.080  45.080  41.944  36.064  36.064  36.064       0      47      22      31      31      52      31      31
07:00:59   45.080  45.080  41.944  36.064  36.064  36.064       0      47      22      31      31      52      31      31
07:01:00   45.080  45.080  41.944  36.064  36.064  36.064       0      47      22      31      31      52      31      31
Here we see the effect of --envopts M, something I'm not particuarly a fan of since it does generate a lot more noise in the output. It's main purpose is for dealing with too much data to comfortably display on a single line.
collectl.pl -sE -i::1 -oT --envopts M
### RECORD    1 >>> opteron167 <<< (1218022891.002) (Wed Aug  6 07:41:31 2008) ###

# ENVIRONMENTAL STATISTICS
#   CFAN1   CFAN2   CFAN3   CFAN4   CFAN5   CFAN6   CFAN7   CFAN8   CFAN9  CFAN10   SFAN1   SFAN2
     6200    6000    6200    6200    6200    5800    6200    6000    6200    6000    6000    6200
#  CTEMP0  CTEMP1   STEMP
       51      48      29

### RECORD    2 >>> opteron167 <<< (1218022892.002) (Wed Aug  6 07:41:32 2008) ###

# ENVIRONMENTAL STATISTICS
#   CFAN1   CFAN2   CFAN3   CFAN4   CFAN5   CFAN6   CFAN7   CFAN8   CFAN9  CFAN10   SFAN1   SFAN2
     6200    6000    6200    6200    6200    5800    6200    6000    6200    6000    6000    6200
#  CTEMP0  CTEMP1   STEMP
       51      48      29
If you choose to convert the data to plot format, a file with the extension env will be created.

Device Names and the collectl challenge

As already mentioned, there is no standard on how one names an ipmi device and as a result the names used even on the small sample of systems tested during development have been quite different. Here is are just a few ways fan names are reported:
Fan 1
Fans
CPU FAN1
SYS FAN1
Fan1A (CPU)
FAN CPU0
FAN MOD 1A RPM
Fan Redundancy
On the one hand, collectl could simply report the exact names as they are reported, but the challenge of trying to format them in such a way as to provide a compact display are impossible. Given that the collectl standard reporting format is a single data header line, the notion of multiple-line headers is not an option. While it is tempting to simply determine the widest device name and use that for a header width, for systems that report over a dozen devices you couldn't fit them on the same line and that's only for systems that have been tested.

After looking at all these different names and formats, one common theme did emerge. All devices appear to have optional numbers (I didn't see any with just letters) and those letters have options letters. Furthermore, there seems to be some sort of optional type associated associated with many as well. This led to the idea of a standard naming for these devices as follows:

[type]Fan|Temp[devicenumber[deviceletter]]

in which the type field would be limited to a single character. Applying this scheme to the examples above leads to the following name mapping:

Fan 1              Fan1
Fans               Fan
CPU FAN1           CFAN1
SYS FAN1           SFAN1
Fan1A (CPU)        CFan1A
FAN CPU0           CFAN0
FAN MOD 1A RPM     MFAN1A
Fan Redundancy     RFan
This is admittedly not perfect but seems like a reasonable compromise and since collectl will report the device names in the same order returned by ipmitool it is not all that difficult to figure out how collectl chose to map them.

Parsing Names and Customization

This is where the fun begins or things get really ugly, depending on your perspective.

After examing many different types of device name formats, it was determined that most tended to follow a patter of

prefix type instanceNumber suffix

Where things get a little crazy is that sometimes the actual instance number can be part of the prefix OR sometimes the instance contains a letter.

All that said, collectl breaks a device name in the these components, assuming a numeric instance. It then applies the minimal set of tests/modifications, note there are examples of all these cases in the sample names shown earlier:

Since this can get verfy confusing, a special switch names --envdebug has been included which will show the actual parsing of the device names. The following is an example of parsing some of the names listed above:
Fan CPU0 Tach,3480
  Prefix:   Name: Fan  Instance:   Suffix: CPU0 Tach
Fan1A (CPU),EAh,ok,29.3,Performance Met
  Prefix:   Name: Fan  Instance: 1  Suffix: A (CPU)
FAN MOD 1A RPM,5775,RPM,ok
  Prefix:   Name: FAN  Instance:   Suffix: MOD 1A RPM

User Defined Parsing Rules

Collectl provides a mechanism for dealing with device names that do not result in the generation of satisfactory names as described in the last section. This is done by supplying collectl with a file containing perl pattern matching/replacement expressions which are very similar to standard regular expressions and are then applied to device names before their initial parsing and/or immediately after a device name is generated.

To use this feature one includes a file containing the directives and points collectl to it using --envrules. The file itself contains lines of the following form noting that spaces and comments (lines preceeded with a #) are permitted:

[pre]
/pattern1/replace1/
/pattern2/replace2/
...
/patternN/replaceN/
[post]
/pattern1/replace1/
/pattern2/replace2/
...
/patternN/replaceN/
If you know perl (and you really should if you use this), collectl builds a perl pattern subsitution command out of the pattern and replace strings. So looking at the string
FAN MOD 1A RPM
and the processing rules described in the previous section, the MOD suffix will be prepended to FAN and the first letter used to name the device MFAN, losing the instance information with is 1A.

There are at least 3 options here. The first is to simply remove MOD from each name which we can do with the rule:

/ MOD//
which will result in the instance names being picked up correctly because they will now immediately follow FAN. In fact, if you include --envdebug along with your rules you'll see the results of the replacement:
FAN MOD 1A RPM,5775,RPM,ok
  Pre-Remapped 'FAN MOD 1A RPM' to 'FAN 1A RPM'
  Prefix:   Name: FAN  Instance: 1  Suffix: A RPM
The second option would be to move MOD to the front of the string so that the rule that uses the first letter as part of the final name will take effect and that rule will look like:
/(.*) MOD (.*)/MOD $1$2/
and results in the following parsing:
FAN MOD 1A RPM,5775,RPM,ok Pre-Remapped 'FAN MOD 1A RPM' to 'MOD FAN 1A RPM' Prefix: MOD Name: FAN Instance: 1 Suffix: A RPM

Unfortunately in order to make perl iterpret the $1$2 symbols an eval is required which generates a little extra overhead and while not horrible an even better solution is the third option which doesn't use any special $ symbols:

/FAN MOD/MOD FAN/
which produces exactly the same results as the previous example except without the eval command.

There is in fact at least one other mechanism for those that are not all that familiar with perl and is only being included for completeness, and that is to simply hardcode the replacement of each device with the desired output. In other words

/FAN MOD 1A RPM/MOD FAN1 A/
/FAN MOD 2A RPM/MOD FAN2 A/
/FAN MOD 3A RPM/MOD FAN3 A/
etc
will produce strings that can also be properly parsed without involved $ variables but this means you need to specify each unique device name to remap and it will also result in all pattern matching statements to be executed for each device which will also result in slightly more overhead.

Restrictions

Some systems report what appears to be device codes in the data field and the data in the 4th field and I don't know why. For now, when this occurs report the 4th column as the data instead. If this breaks other things it will have to be removed and invalid data reported for those who do not report it in column 2.