Tutorial - The Basics

Getting started using collectl may seem a little challenging to the new user because of its many options, but it shouldn't be. After all, how many people simply run the top command and don't even realize there are a rich set of options available? In that same spirit, you can simply enter the collectl command and get a lot of useful information, but you would also be losing out on a lot. The intent of this tutorial is to give you a better appreciation of what you can do with collectl and hopefully encourage you to experiment with even more options than those described below.

Measuring Disk Activity

For this first set of examples I'll be using Robin Miller's dt to write a large file to /tmp using the command dt of=/tmp/test limit=1g bs=1m disable=compare,verify dispose=keep while running collectl in another window:
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
  30  30   254     65      8      2   7920     97      0      4      0       2
  10  10   377     65      0      0  32500    282      4     52      2      19
  10  10   332     61      0      0  29312    246      0      3      0       3
   9   9   330     65      0      0  32512    275      3     45      1       9
  11  11   331     53      4      1  29684    270      0      2      0       2
   8   8   352     63      0      0  35004    273      3     33      1       8
  13  12   329    116      0      0  28924    249      0      2      0       2
Here we see a few things including a burst of cpu activity when the test first starts as well as an I/O rate of about 30MB/sec which corresponds to what dt is telling us in the following summary line:
Average transfer rates: 32051995 bytes/sec, 31300.776 Kbytes/sec
If we compare the write rates to the number of writes we can also infer writes of about 128KB which is good to know because that means we're being efficient in the size of the data blocks being handed to the driver. However if we don't mind using the extra columns, we can include --iosize, which tells collectl to include the average I/O size when using this default display format also known as brief mode. In verbose mode the I/O sizes are always included, at least as of Version 2.6.3.
#<--------CPU--------><---------------Disks---------------->
#cpu sys inter  ctxsw KBRead  Reads Size KBWrit Writes Size
   9   8   381     71      0      0    0  30644    276  111
  14  13   325     85      0      0    0  32888    258  127
  11  10   313     80      0      0    0  31064    261  119
  12  11   421    186      0      0    0  32376    276  117

This may also be a good time to mention screen real estate. There is a lot of information that collectl can display and everything takes space! More often than not you don't really care about time and so by default it isn't displayed. However there may be times you do care and so you can simply add the switch -oT add the option of time to the display. In fact, sometimes you may want to include the date as well in which case -oD will do both. You can even show the times in msec by including m with -o, which can be useful when running at sub-second monitoring levels and/or if you want to correlate data to system or application logs with may themselves have finer grained time. Here's an example of the command collectl -scd -i.25 -oDm which shows the cpu and disk loads every quarter second and includes the date and time in msecs:

#                      <--------CPU--------><----------Disks----------->
#Date    Time          cpu sys inter  ctxsw KBRead  Reads KBWrit Writes
20080212 11:22:47.008    2   0   364     84      0      0  31328    284
20080212 11:22:47.258    8   6   392     92      0      0  30832    356
20080212 11:22:47.508    8   6   308     84      0      0  36256    268
20080212 11:22:47.758    2   0   292     44      0      0  31152    196

So what about that CPU load? Given that this is a 2 CPU system we might be interested in seeing how that load is being distributed by running the command collectl -sC, since an uppercase subsystem type, like cpu, disk or network tells collectl to show instance level details:

# SINGLE CPU STATISTICS
#   CPU  USER NICE  SYS WAIT IRQ  SOFT STEAL IDLE
      0     0    0   17    0    0    0     0   83
      1     0    0    4    0    0    0     0   96
      0     0    0   14    0    0    0     0   86
      1     0    0    0    0    0    0     0  100
      0     0    0   20    0    0    0     0   80
      1     0    0    0    0    0    0     0  100
noting all the load is being delivered by a single CPU as expected. Ok, so now let's read back the 1G file we just wrote and see what happens.
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
  38  37   248    189   7283    111      0      0      1      9      1       8
  24  23   153     81     32      0      0      0      2     32      1       9
Now we see a big burst of CPU load and not much from disk. Furthermore dt is reporting
Average transfer rates: 872960833 bytes/sec, 852500.813 Kbytes/sec
which in fact confirms that reads are coming from cache and not disk since no local disk can read at this rate! In general, when doing disk I/O testing one should use file sizes that are larger than cache to force all I/O to come from disk. So repeating the tests with a larger file we now see more realistic read rates:
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
   9   8   773    743  41376    629      0      0      1      8      1       7
   9   8   619    639  31716    476      0      0      2     33      1       8
  16  15   510    554  23016    370      0      0      0      4      0       2
  10  10   572    624  27272    429      0      0      2     27      1       8
  16  15   458    504  19560    306     12      2      0      4      0       2
So just what is happening to cache during testing? To see memory utilization we can simply add the memory subsystem to the default selections as collectl -s+m but that also makes the display wider and since for our purposes we don't need network information I'm just going to run the following collectl -scmd:
#<--------CPU--------><-----------Memory----------><----------Disks----------->
#cpu sys inter  ctxsw free buff cach inac slab  map KBRead  Reads KBWrit Writes
   3   0   159     80   2G 395M 189M   1M    0    0      0      0     20      3
   1   0   153     52   2G 395M 189M   1M    0    0      0      0      0      0
  43  42   238     68   2G 395M 340M 152M    0    0      0      0   3060     72
  25  25   376     53   1G 395M 431M 242M    0    0      0      0  29808    273
   6   6   377     59   1G 395M 455M 266M    0    0      0      0  30900    266
  10  10   347     55   1G 395M 492M 303M    0    0      0      0  35004    265
   5   4   389     60   1G 395M 506M 318M    0    0      0      0  27308    262
and watch the cache fill up. In fact, if we keep running collectl eventually we use up all available memory (but that's what it's there for) and even after the test completes and there is no more I/O, we still see hardly any free memory. But that too is ok because until someone else needs it or deletes the file, that data stays in cache. Look at the last sample where I manually deleted the file. You can see the cache drop to 204M and the free memory rise to 2G during a single reporting interval:
#<--------CPU--------><-----------Memory----------><----------Disks----------->
#cpu sys inter  ctxsw free buff cach inac slab  map KBRead  Reads KBWrit Writes
   1   1   374     91 171M 397M   2G   2G    0    0      0      0  34624    288
   1   1   368     82 171M 397M   2G   2G    0    0      0      0  31408    260
   2   2   319     56 171M 397M   2G   2G    0    0      0      0  31148    266
   0   0   385     70 172M 397M   2G   2G    0    0      0      0  25844    273
   0   0   167     70 172M 397M   2G   2G    0    0      0      0      0      0
   0   0   173     51 172M 397M   2G   2G    0    0      0      0      0      0
   2   0   181    108 172M 397M   2G   2G    0    0      0      0     12      2
  41  41   148     52   2G 397M 204M  15M    0    0      0      0     72      5
For one more test, I'm going to write that same 1G file to my home directory and look what collectl tells me:
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
   0   0   145     48      0      0      0      0      2     38      2      13
  13  13  6716   3491      0      0      0      0    136    682  21144   14762
  18  18  6802   3426      0      0      0      0    248   1256  39111   27278
  14  14  4680   2420      0      0     28      2    252   1256  40166   28008
   7   7  3105   1520      0      0      0      0    148    752  23256   16228
Since my home directory is mounted via nfs, all I/O goes through the network! In fact, if I run collectl as collectl -scfn I see:
#<--------CPU--------><----------Network----------><--NFS Svr Summary-->
#cpu sys inter  ctxsw netKBi pkt-in netKBo pkt-out   read  write  calls
   0   0   144     48      0      2      0       2      0      0      0
  13  12  2839   1440      5     79      1      10      0      0      0
  17  16  6539   3061    216   1081  33976   23747      0      0      0
  15  14  6887   2704    253   1275  40020   27876      0      0      0
  12  11  4683   2209    252   1256  40166   28008      0      0      0
I can clearly see the network traffic but why don't I see any nfs activity? That's because by default collectl shows server activity rather than the client and so we need to add the option --nfsopts C to the command which now looks like collectl -scfn ---nfsopts C and produces the following:
#<--------CPU--------><----------Network----------><--NFS Clt Summary-->
#cpu sys inter  ctxsw netKBi pkt-in netKBo pkt-out   read  write  calls
   2   1   142     44      0      5      0       3      0      0      0
   3   2  1562    693     46    253   6348    4456      0    256    285
  23  22  6735   3026    251   1250  40166   28007      0   1216   1255
  22  22  6633   3309    241   1220  38053   26541      0   1216   1253
  20  19  6657   3338    245   1223  39109   27271      0   1204   1242
So what is the difference between NFS Writes and NFS Calls? To best demonstrate this let's add F to the subsystems which like other uppercase subsystems indicates details and in the case of nfs this translates to breaking things down by metadata operations, commits and retransmissions:
#<--------CPU--------><----------Network----------><--NFS Clt Summary--><----NFS MetaOps---->
#cpu sys inter  ctxsw netKBi pkt-in netKBo pkt-out   read  write  calls   meta commit retran
  13  13  3625   1462    128    662  20114   14010      0    640    660      0     20      0
   9   9  3764   1165    132    662  21110   14742      0    672    693      0     21      0
and we can now see those extra calls are in fact nfs commits.

So in conclusion you can see there is really quite a lot you can do with just a few basic switches and I haven't even gotten into --verbose, which as they say is an exercise left for the student. So try some simple dt tests yourself or use you own personal favorite load generator, while trying out collectl -sc --verbose or collectl -sm --verbose or even collectl -sn --verbose. You can even put them all together as collectl -scmn --verbose, but then as you'll see you end up using a lot of that valuable screen real estate. As a final bonus, try adding the -ot switch which puts collectl into something I loosely call top format since each sample is displayed at the top of the screen. That command would then look like collectl -smcn --verbose -ot.

enjoy...