Categories

Donate

Advert

ZFS Training Exercises

In 2015 I ran a training session on BTRFS and ZFS. Here are the notes for ZFS updated for the latest versions of everything when running on KVM. You can do all the same exercises on a non virtual machine, but it’s safest not to do this on a machine with important data. For the KVM setup I use /dev/vdc, /dev/vdd, and /dev/vde are the devices for test data.

For running ZFS on Linux in serious use I recommend having more than 4G of RAM in the system or VM that’s doing it to prevent kernel OOM issues. You should be able to complete this training exercise on a 64bit VM with 2G of RAM. To do this on Debian/Buster just replace the references to “bullseye” with “buster” in the APT configuration.

  1. Installing zfsonlinux on Debian/Bullseye
    1. Note that /etc/apt/sources.list has the following lines (the “contrib” is very important):
      deb http://mirror.internode.on.net/pub/debian/ bullseye-updates contrib main
      deb http://deb.debian.org/debian bullseye-backports contrib main
      
    2. Run the following commands:
      apt install linux-headers-amd64
      apt install -t bullseye-backports dkms spl-dkms zfs-dkms
      apt install -t bullseye-backports zfsutils-linux
      The apt command installing spl-dkms takes ages, 4 minutes real time and 9 minutes CPU time on my last test on a 4 core VM. It will take much longer if several people are doing it at the same time.
  2. Making the filesystem
    1. Create the pool:
      zpool create -o ashift=12 nyancat raidz /dev/vdc /dev/vdd /dev/vde
      NB You need to add a “-f” to overwrite the BTRFS filesystem
    2. Make some default filesystem settings:
      zfs set devices=off nyancat
      zfs set atime=off nyancat
      zfs set setuid=off nyancat
      zfs set compression=on nyancat
    3. Get information on the pool:
      zfs get all nyancat|less
      zpool list
      zpool status
    4. Make a filesystem:
      zfs create nyancat/friday
      df -h
    5. See if there are any errors, shouldn’t be any (yet):
      zpool scrub nyancat
      zpool status
    6. Copy some files to the filesystem:
      cp -r /usr /nyancat/friday
    7. Scrub the filesystem:
      zpool scrub nyancat
      zpool status
    1. Online corruption

    2. Corrupt the filesystem:
      dd if=/dev/zero of=/dev/vdd bs=1024k count=2000 seek=50
    3. Scrub again, should give a warning about errors:
      zpool scrub nyancat
      zpool status
    4. Verify that the data is intact:
      diff -ru /usr /nyancat/friday/usr
    5. Corrupt it again:
      dd if=/dev/zero of=/dev/vdd bs=1024k count=2000 seek=50
    6. Unmount the ZFS filesystems and export (stop using) the zpool:
      zfs umount -a
      zpool export -a
      zpool status
    7. “Import” the pool, mount the filesystems, and note that the checksum error count is back to zero (it counts since the last mount not for all time – unlike BTRFS):
      zpool import -a
      zfs mount -a
      zpool status
    8. Check that the data is still intact and look for increases in the error count:
      zpool status
      diff -ru /usr /nyancat/friday/usr
      zpool status
    1. Offline corruption

    2. Umount the filesystem and corrupt the start of one device:
      zfs umount -a
      zpool export -a
      dd if=/dev/zero of=/dev/vdd bs=1024k count=200
    3. Try mounting it again which will work even though one block device is unavailable:
      zpool import -a
      zpool status
    4. Replace the missing device, look for the device name in the “zpool status” output to replace “123456789” in the following sample command:
      zpool replace -f nyancat 123456789 /dev/vdd
    5. Check that everything is ok:
      zpool status
      diff -ru /usr /nyancat/friday/usr

BTRFS Training Exercises

In 2015 I ran a training session on BTRFS and ZFS. Here are the notes for BTRFS updated for the latest versions of everything when running on KVM. You can do all the same exercises on a non virtual machine, but it’s safest not to do this on a machine with important data. For the KVM setup I use /dev/vdc and /dev/vdd are the devices for test data.

On a Debian Buster or Bullseye system all that you need to do to get ready for this is to install the btrfs-progs package.

  1. Making the filesystem
    1. Make the filesystem, this makes a filesystem that spans 2 devices (note you must use the-f option if there was already a filesystem on those devices):
      mkfs.btrfs /dev/vdc /dev/vdd
    2. Use file(1) to see basic data from the superblocks:
      file -s /dev/vdc /dev/vdd
    3. Mount the filesystem (can mount either block device, the kernel knows they belong together):
      mount /dev/vdd /mnt/tmp
    4. See a BTRFS df of the filesystem, shows what type of RAID is used (note the use of “single” IE RAID-0 for data):
      btrfs filesystem df /mnt/tmp
    5. See more information about FS device use:
      btrfs filesystem show /mnt/tmp
    6. Balance the filesystem to change it to RAID-1 and verify the change, note that some parts of the filesystem were single and RAID-0 before this change):
      btrfs balance start -dconvert=raid1 -mconvert=raid1 -sconvert=raid1 --force /mnt/tmp
      btrfs filesystem df /mnt/tmp
    7. See if there are any errors, shouldn’t be any (yet):
      btrfs device stats /mnt/tmp
    8. Copy some files to the filesystem:
      cp -r /usr /mnt/tmp
    9. Check the filesystem for basic consistency (only checks checksums):
      btrfs scrub start -B -d /mnt/tmp
  2. Online corruption
    1. Corrupt the filesystem:
      dd if=/dev/zero of=/dev/vdd bs=1024k count=2000 seek=50
    2. Scrub again, should give a warning about errors:
      btrfs scrub start -B /mnt/tmp
    3. Check error count:
      btrfs device stats /mnt/tmp
      dmesg
    4. Corrupt it again:
      dd if=/dev/zero of=/dev/vdd bs=1024k count=2000 seek=50
    5. Unmount it:
      umount /mnt/tmp
    6. In another terminal follow the kernel log:
      tail -f /var/log/kern.log
    7. Mount it again and observe it reporting (and maybe correcting) errors on mount:
      mount /dev/vdd /mnt/tmp
    8. Run a diff, observe kernel error messages and observe that diff reports no file differences:
      diff -ru /usr /mnt/tmp/usr/
    9. Run another diff, you probably won’t observe kernel error messages as all errors in data read by diff were probably fixed in the previous run. Observe that diff reports no file differences:
      diff -ru /usr /mnt/tmp/usr/
    10. Run another scrub, this may correct some errors which weren’t discovered by diff:
      btrfs scrub start -B -d /mnt/tmp
  3. Offline corruption
    1. Umount the filesystem, corrupt the start, then try mounting it again which will fail because the superblocks were wiped:
      umount /mnt/tmp
      dd if=/dev/zero of=/dev/vdd bs=1024k count=200
      mount /dev/vdd /mnt/tmp
      mount /dev/vdc /mnt/tmp
    2. Note that the filesystem was not mountable due to a lack of a superblock. It might be possible to recover from this but that’s more advanced so we will restore the RAID.
      Mount the filesystem in a degraded RAID mode, this allows full operation.
      mount /dev/vdc /mnt/tmp -o degraded
    3. Add /dev/vdd back to the RAID:
      btrfs device add /dev/vdd /mnt/tmp
    4. Show the filesystem devices, observe that vdd is listed twice, the missing device and the one that was just added:
      btrfs filesystem show /mnt/tmp
    5. Remove the missing device and observe the change:
      btrfs device delete missing /mnt/tmp
      btrfs filesystem show /mnt/tmp
    6. Balance the filesystem, not sure this is necessary but it’s good practice to do it when in doubt:
      btrfs balance start /mnt/tmp
    7. Umount and mount it, note that the degraded option is not needed:
      umount /mnt/tmp
      mount /dev/vdc /mnt/tmp
  4. Compression
    1. Get the properties of the root of the btrfs filesystem, set it to zstd compression (which is inherited when creating subdirectories) and inspect the change:
      btrfs property get /mnt/tmp
      btrfs property set /mnt/tmp compression zstd
      btrfs property get /mnt/tmp
    2. View the space used, convert the filesystem to zstd compression, and view the new space used:
      df -h /mnt/tmp
      btrfs filesystem defrag -czstd -r /mnt/tmp
      df -h /mnt/tmp

  5. Experiment
    1. Experiment with the “btrfs subvolume create” and “btrfs subvolume delete” commands (which act like mkdir and rmdir).
    2. Experiment with “btrfs subvolume snapshot SOURCE DEST” and “btrfs subvolume snapshot -r SOURCE DEST” for creating regular and read-only snapshots of other subvolumes (including the root).

Postal

This is a program I wrote to benchmark SMTP servers. I started work on this because I need to know which mail server will give the best performance with more than 1,000,000 users. I have decided to release it under the GPL because there is no benefit in keeping the source secret, and the world needs to know which mail servers perform well and which don’t!

At the OSDC conference in 2006 I presented a paper on mail relay performance based on the new BHM program that is now part of Postal.

I have a Postal category on my main blog that I use for a variety of news related to Postal. This post (which will be updated periodically) will be the main reference page for the software. Please use the comments section for bug reports and feature requests.

It works by taking a list of email addresses to use as FROM and TO addresses. I originally used a template to generate the list of users because if each email address takes 30 bytes of storage then 3,000,000 accounts would take 90M of RAM which would be more than the memory in the test machine I was using at the time. Since that time the RAM size in commodity machines has increased far faster than the size of ISP mail servers so I removed the template feature (which seemed to confuse many people).

When sending the mail the subject and body will be random data. A header field X-Postal will be used so that procmail can easily filter out such email just in case you accidentally put your own email address as one of the test addresses. ;)

I have now added two new programs to the suite, postal-list, and rabid. Postal-list will list all the possible expansions for an
account name (used for creating a list of accounts to create on your test server). Rabid is the mad Biff, it is a POP benchmark.

Postal now adds a MD5 checksum in the header X-PostalHash to all messages it sends (checksum is over the Subject, Date, Message-ID, From, and To headers and the message body including the “\r\n” that ends each line of text in the SMTP protocol). Rabid now checks the MD5 checksum and displays error messages when it doesn’t match.

I have added rate limiting support in Rabid and Postal. This means that you can specify that these programs send a specific number of messages and perform a specific number of POP connections per minute respectively. This should make it easy to determine the amount of system resources that are used by a particular volume of traffic. Also if you want to run performance analysis software to determine what the bottlenecks are on your mail server then you could set Postal and Rabid to only use half the maximum speed (so the CPU and disk usage of the analysis software won’t impact on the mail server).

I will not release a 1.0 version until the following features are implemented:

  • Matching email sent by Postal and mail received by BHM and Rabid to ensure that each message is delivered correctly (no repeats and no corruption)
  • IMAP support in Rabid that works
  • Support for simulating large numbers of source addresses in Postal. This needs to support at least 2^24 addresses so it is entirely impractical to have so many IP addresses permanently assigned to the test machine.
  • Support for simulating slow servers in Postal and BHM (probably reducing TCP window size and delaying read() calls)
  • Making BHM simulate the more common anti-spam measures that are in use to determine the impact that they have on list servers
  • Determining a solution to the problem of benchmarking DNS servers. This may mean just including documentation on how to simulate the use patterns of a mail server using someone else’s DNS benchmark, but may mean writing my own DNS benchmark.

Here are links to download the source:

  • postal-0.76.tgz – #define PATH_MAX if it’s not already defined, for HURD.
  • postal-0.75.tgz – Fix some compiler warnings. Stop autoconf controlling stripping. Add -b option to bind to an address to bhm. Do not #include when unnecessary.
    Stop linking against libgcrypt. Build against GnuTLS v3. Fix buffer underrun. Made it build with GCC-6.
  • postal-0.73.tgz – Make postal correctly issue the quit command after delivery failure. Fixed a bunch of valgrind errors. Fixed a nasty bug with tab separated fields.
  • postal-0.72.tgz – made LMTP work and accept TAB as a field delimiter.
  • postal-0.71.tgz – rewrote the md5 checking code and fixed lots of little bugs.
  • postal-0.70.tgz – tidied up the man pages and made it build without SSL support.
  • postal-0.69.tgz – fixed some compile warnings, and really made it compile with GCC 4.3
  • postal-0.68.tgz – fixed some compile warnings, made it compile with GCC 4.3, and I think I made it compile correctly with OpenSolaris.
  • postal-0.67.tgz – changed the license to GPL 3
  • postal-0.66.tgz – made GNUTLS work in BHM and added MessageId to Postal.
  • postal-0.65.tgz – significant improvement, many new features and many bugs fixed!
  • postal-0.62.tgz – Slightly improved the installation documents and made it build with GCC 3.2.
  • postal-0.61.tgz – version 0.61. Fixed the bug with optind that stopped it working on BSD systems, and a few other minor bugs.
  • postal-0.60.tgz – version 0.60. Fixed the POP deletion bug, made it compile with GCC 3.0, and added logging of all network IO to disk.
  • postal-0.59.tgz – version 0.59.
  • postal-0.58.tgz – version 0.58. Added some new autoconf stuff, RPM build support, and the first steps to OS/2 and Win32 portability.
  • postal-0.57.tgz – version 0.57. Fixed lots of trivial bugs and some BSD portability issues.
  • postal-0.56.tgz – version 0.56. Added Solaris package manager support. Made it compile without SSL. Added heaps of autoconf stuff.
  • postal-0.55.tgz – version 0.55. Made Rabid work with POP servers that support the CAPA command. Fixed some compile problems on Solaris.
  • postal-0.54.tgz – version 0.54. Added a ./configure option to turn off name expansion (for systems with buggy regex). Fixed a locking bug that allowed Rabid to access the same account through two threads.
  • postal-0.53.tgz – version 0.53. Don’t use NIS domain name etc for SMTP protocol.
  • postal-0.52.tgz – version 0.52. Better portability with autoconf.
  • postal-0.51.tgz – version 0.51. Supports compiling without SSL and some hacky Solaris support.
  • postal-0.50.tgz – version 0.50. Adds SSL support to Postal (Rabid comes next).

Bonnie++

This version starts the re-write of Bonnie++! I will make it totally threaded (the new code does not use fork()). It will also support testing with a specified number of threads doing the same test, this will allow you to really thrash those RAID arrays!

  • bonnie++-2.00a.tgz – fixed a column assignment issue in bon_csv2html and called it 2.00a because 1.9x hasn’t changed much for years and because I stuffed up the first release of 2.00.
  • bonnie++-1.98.tgz – Allow specifying the number of random seeks and the number of seeker processes and store that in the CSV (for testing NVMe). Changed the text output to use KiB/MiB/GiB as units of measurement so we can fit NVMe results on screen.
  • bonnie++-1.97.3.tgz – man page fixes, fixed symlink test, and CSS fix for bon_csv2html.
  • bonnie++-1.97.2.tgz – make it build with GCC-6 and fix some Debian bugs.
  • bonnie++-1.97.tgz – fixed a bunch of bugs including bad CSV output.
  • bonnie++-1.96.tgz – fixed a bunch of bugs and got the colors working correctly in the HTML output.
  • bonnie++-1.95.tgz – support direct IO and made some changes to the build process (including dropping NT and OS/2 support).
  • bonnie++-1.94.tgz – major improvements to zcav.
  • bonnie++-1.93d.tgz – added write support to ZCAV.
  • bonnie++-1.93c.tgz – added support for GCC 3.2.
  • bonnie++-1.93b.tgz – version 1.93b. Fixed an error in calculating seek time and added support for large numbers of directories for the file creation tests.
  • bonnie++-1.93a.tgz – version 1.93a. Better support of NT, better RPM packaging code, and a minor warning fix.
  • bonnie++-1.93.tgz – version 1.93. Added a new test program for per-char IO for the people on the linux-kernel mailing list. ;)
  • bonnie++1.92b.tgz – version 1.92b. Fixed a bunch of bugs in the random seed selection code.
  • bonnie++1.92a.tgz – version 1.92a. Added support for setting the seed for random numbers and fixed a few bugs.
  • bonnie++1.92.tgz – version 1.92. Changed the code to do per-char tests with read() and write() instead. Now reports much lower results for those tests which are more useful IMHO.
  • bonnie++1.91c.tgz – version 1.91c, it now compiles with namespace support in GCC 3.0 and fixed some minor bugs.
  • bonnie++1.91b.tgz – version 1.91b
  • bonnie++1.91a.tgz – version 1.91a, fixed zcav properly and fixed a bunch of minor bugs in Bonnie++.
  • bonnie++1.91.tgz – version 1.91, fixes all known bugs in the 1.90 series.
  • bonnie++1.90g.tgz – version 1.90g, now latency works properly and always gets parsed properly. Changed the -f option to allow tests of per-char IO for small amounts of data.
  • bonnie++1.90f.tgz – version 1.90f, more work on latency in bonnie++ and some slight changes for OS/2 and NT portability.
  • bonnie++1.90e.tgz – version 1.90e, better OS/2 and NT support.
  • bonnie++1.90d.tgz – version 1.90d, contains code from the OS/2 and NT ports, may break things for some versions of UNIX.
  • bonnie++1.90c.tgz – version 1.90c, produces better web pages (full color) with bon_csv2html.
  • bonnie++1.90b.tgz – version 1.90b, adds support for measuring latency.
  • bonnie++1.90a.tgz – version 1.90a, adds basic threading, changes the format of the CSV files, and updates the programs for managing CSV files (and the man pages).

My Programming Profile

Job adverts are asking for a Github or Stack Overflow account nowadays, I don’t use those sites much so here’s a list of comparable resources:

Here is my list of Debian packages with their QA status. I’ve been a Debian Developer since November 2000.

Here is my Salsa profile, Salsa is the Debian equivalent to Github (using Gitlab software). My account shows as active since 2017 because Salsa was created in 2017.

Here is a list of Projects that I’m the primary maintainer of.

Here are some of the papers I’ve presented at conferences.

TED Talks I Like

ZFS and BTRFS

This article was published in Linux Journal in 2014 [1].

BTRFS and ZFS are two options for protecting against data corruption.

Which one should you use, and how should you use it?

For a long time, the software RAID implementation in the Linux kernel has worked well to protect data against drive failure. It provides great protection against a drive totally failing and against the situation where a drive returns read errors. But what it doesn’t offer is protection against silent data corruption (where a disk returns corrupt data and claims it to be good). It also doesn’t have good support for the possibility of drive failure during RAID reconstruction.

Drives have been increasing in size significantly, without comparable increases in speed. Modern drives have contiguous read speeds 300 times faster than drives from 1988 but are 40,000 times larger (I’m comparing a recent 4TB SATA disk with a 100M ST-506 disk that can sustain 500K/s reads). So the RAID rebuild time is steadily increasing, while the larger storage probably increases the risk of data corruption.

Currently, there are two filesystems available on Linux that support internal RAID with checksums on all data to prevent silent corruption: ZFS and BTRFS. ZFS is from Sun and has some license issues, so it isn’t included in most Linux distributions. It is available from the ZFS On Linux Web site (zfsonlinux.org) [2]. BTRFS has no license problems and is included in most recent distributions, but it is at an earlier stage of development. When discussing BTRFS in this article, I concentrate on the theoretical issues of data integrity and not the practical issues of kernel panics (which happen regularly to me but don’t lose any data). [NB This hasn’t happened in recent years.]

Do Drives Totally Fail?

For a drive totally to fail (that is, be unable to read any data successfully at all), the most common problem used to be “stiction”. That is when the heads stick to the platters, and the drive motor is unable to spin the disk. This seems to be very uncommon in recent times. I presume that drive manufacturers fixed most of the problems that caused it.

In my experience, the most common reason for a drive to become totally unavailable is due to motherboard problems, cabling or connectors—that is, problems outside the drive. Such problems usually can be fixed but may cause some downtime, and the RAID array needs to keep working with a disk missing.

Serious physical damage (for example, falling on concrete) can cause a drive to become totally unreadable. But, that isn’t a problem that generally happens to a running RAID array. Even when I’ve seen drives fail due to being in an uncooled room in an Australian summer, the result has been many bad sectors, not a drive that’s totally unreadable. It seems that most drive “failures” are really a matter of an increasing number of bad sectors.

There aren’t a lot of people who can do research on drive failure. An individual can’t just buy a statistically significant number of disks and run them in servers for a few years. I couldn’t find any research on the incidence of excessive bad sectors vs. total drive failure. It’s widely regarded that the best research on the incidence of hard drive “failure” is the Google Research paper “Failure Trends in a Large Disk Drive Population” [3], which although very informative, gives no information on the causes of “failure”. Google defines “failure” as anything other than an upgrade that causes a drive to be replaced. Not only do they not tell us the number of drives that totally died vs. the number that had some bad sectors, but they also don’t tell us how many bad sectors would be cause for drive replacement.

Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau from the University of Wisconsin-Madison wrote a paper titled “An Analysis of Data Corruption in the Storage Stack” [4]. That paper gives a lot of information about when drives have corrupt data, but it doesn’t provide much information about the case of major failure (tens of thousands of errors), as distinct from cases where there are dozens or hundreds of errors. One thing it does say is that the 80th percentile of latent sector errors per disk with errors is “about 50”, and the 80th percentile of checksum mismatches for disks with errors is “about 100”. So most disks with errors have only a very small number of errors. It’s worth noting that this research was performed with data that NetApp obtained by analyzing the operation of its hardware in the field. NetApp has a long history of supporting a large number of disks in many sites with checksums on all stored data.

I think this research indicates that the main risks of data loss are corruption on disk or a small number of read errors, and that total drive failure is an unusual case.

Redundancy on a Single Disk

By default, a BTRFS filesystem that is created for a single device that’s not an SSD will use “dup” mode for metadata. This means that every metadata block will be written to two parts of the disk. In practice, this can allow for recovering data from drives with many errors. I recently had a 3TB disk develop about 14,000 errors. In spite of such a large number of errors, the duplication of metadata meant that there was little data loss. About 2,000 errors in metadata blocks were corrected with the duplicates, and the 12,000 errors in data blocks (something less than 48M of data) was a small fraction of a 3TB disk. If an older filesystem was used in that situation, a metadata error could corrupt a directory and send all it’s entries to lost+found.

ZFS supports even greater redundancy via the copies= option. If you specify copies=2 for a filesystem, then every data block will be written to two different parts of the disk. The number of copies of metadata will be one greater than the number of copies of data, so copies=2 means that there will be three copies of every metadata block. The maximum number of copies for data blocks in ZFS is three, which means that the maximum number of copies of metadata is four.

The paper “An Analysis of Data Corruption in the Storage Stack” shows that for “nearline” disks (that is, anything that will be in a typical PC or laptop), you can expect a 9.5% probability of read errors (latent sector errors) and a 0.466% probability of silent data corruption (checksum mismatches). The typical Linux Journal reader probably can expect to see data loss from hard drive read errors on an annual basis from the PCs owned by their friends and relatives. The probability of silent data corruption is low enough that every user has a less than 50% chance of seeing it on their own PC during their life—unless they purchased one of the disks with a firmware bug that corrupts data.

If you run BTRFS on a system with a single disk (for example, a laptop), you can expect that if the disk develops any errors, they will result in no metadata loss due to duplicate metadata, and any file data that is lost will be reported to the application by a file read error. If you run ZFS on a single disk, you can set copies=2 or copies=3 for the filesystem that contains your most important data (such as /home on a workstation) to decrease significantly the probability that anything less than total disk failure will lose data. This option of providing extra protection for data is a significant benefit for ZFS when compared to BTRFS.

If given a choice between a RAID-1 array with Linux software RAID (or any other RAID implementation that doesn’t support checksums) and a single disk using BTRFS, I’d choose the single disk with BTRFS in most cases. That is because on a single disk with BTRFS, the default configuration is to use “dup” for metadata. This means that a small number of disk errors will be unlikely to lose any metadata, and a scrub will tell you which file data has been lost due to errors. Duplicate metadata alone can make the difference between a server failing and continuing to run. It is possible to run with “dup” for data as well, but this isn’t a well supported configuration (it requires mixed data and metadata chunks that require you to create a very small filesystem and grow it).

It is possible to run RAID-1 on two partitions on a single disk if you are willing to accept the performance loss. I have a 2TB disk running as a 1TB BTRFS RAID-1, which has about 200 bad sectors and no data loss.

Finally, it’s worth noting that a “single disk” from the filesystem perspective can mean a RAID array. There’s nothing wrong with running BTRFS or ZFS over a RAID-5 array. The metadata duplication that both those filesystems offer will reduce the damage if a RAID-5 array suffers a read error while replacing a failed disk. A hardware RAID array can offer features that ZFS doesn’t offer (such as converting from RAID-1 to RAID-5 and then RAID-6 by adding more disks), and hardware RAID arrays often include a write-back disk cache that can improve performance for RAID-5/6 significantly. There’s also nothing stopping you from using BTRFS or ZFS RAID-1 over a pair of hardware RAID-5/6 arrays.

Drive Replacement

When you replace a disk in Linux Software RAID, the old disk will be marked as faulty first, and all the data will be reconstructed from other disks. This is fine if the other disks are all good, but if the other disks have read errors or corrupt data, you will lose data. What you really need is to have the new disk directly replace the old disk, so the data for the new disk can be read from the old disk or from redundancy in the array, whichever works.

ZFS has a zpool replace command that will rebuild the array from the contents of the old disk and from the other disks in a redundant set. BTRFS supports the same thing with the btrfs replace command. In the most common error situations (where a disk has about 50 bad sectors), this will give you the effect of having an extra redundant disk in the array. So a RAID-5 array in BTRFS or in ZFS (which they call a RAID-Z) should give as much protection as a RAID-6 array in a RAID implementation that requires removing the old disk before adding a new disk. At this time, RAID-5 and RAID-6 support in BTRFS is still fairly new, and I don’t expect it to be ready to use seriously by the time this article is published. But the design of RAID-5 in BTRFS is comparable to RAID-Z in ZFS, and they should work equally well when BTRFS RAID-5 code has been adequately tested and debugged.

Hot spare disks are commonly used to allow replacing a disk more quickly than someone can get to the server. The idea is that the RAID array might be reconstructed before anyone even can get to it. But it seems to me that the real benefit of a hot-spare when used with a modern filesystem, such as ZFS or BTRFS, is that the system has the ability to read from the disk with errors as well as the rest of the array while constructing the new disk. If you have a server where every disk bay contains an active disk (which is a very common configuration in my experience), it is unreasonably difficult to support a disk replacement operation that reads from the failing disk (using an eSATA device for the rebuild isn’t easy). Note that BTRFS doesn’t have automatic hot-spare support yet, but it presumably will get it eventually. In the meantime, a sysadmin has to instruct it to replace the disk manually.

As modern RAID systems (which on Linux servers means ZFS as the only fully functional example at this time) support higher levels of redundancy, one might as well use RAID-Z2 (the ZFS version of RAID-6) instead of RAID-5 with a hot spare, or a RAID-Z3 instead of a RAID-6 with a hot-spare. When a disk is being replaced in a RAID-6/RAID-Z2 array with no hot-spare, you are down to a RAID-5/RAID-Z array, so there’s no reason to use a disk as a hot-spare instead of using it for extra redundancy in the array.

How Much Redundancy Is Necessary?

The way ZFS works is that the copies= option (and the related metadata duplication) is applied on top of the RAID level that’s used for the storage “pool”. So if you use copies=2 on a ZFS filesystem that runs on a RAID-1, there will be two copies of the data on each of the disks. The allocation of the copies is arranged such that it covers different potential failures to the RAID level, so if you had copies=3 for data stored on a three-disk RAID-Z pool, each disk in the pool would have a copy of the data (and parity to help regenerate two other copies). The amount of space required for some of these RAID configurations is impractical for most users. For example, a RAID-Z3 array composed of six 1TB disks would have 3TB of RAID-Z3 capacity. If you then made a ZFS filesystem with copies=3, you would get 1TB of usable capacity out of 6TB of disks. 5/6 disks is more redundancy than most users need.

If data is duplicated in a RAID-1 array, the probability of two disks having errors on matching blocks from independent random errors is going to be very low. The paper from the University of Wisconsin-Madison notes that firmware bugs can increase the probability of corrupt data on matching blocks and suggests using staggered stripes to cover that case. ZFS does stagger some of its data allocation to deal with that problem. Also, it’s fairly common for people to buy disks from two different companies for a RAID-1 array to prevent a firmware bug or common manufacturing defect from corrupting data on two identical drives. The probability of both disks in a BTRFS RAID-1 array having enough errors that data is lost is very low. With ZFS, the probability is even lower due to the mandatory duplication of metadata on top of the RAID-1 configuration and the option of duplication of data. At this time, BTRFS doesn’t support duplicate metadata on a RAID array.

The probability of hitting a failure case that can’t be handled by RAID-Z2 but that can be handled by RAID-Z3 is probably very low. In many deployments, the probability of the server being stolen or the building catching on fire will be greater than the probability of a RAID-Z2 losing data. So it’s worth considering when to spend more money on extra disks and when to spend money on better off-site backups.

In 2007, Val Bercovici of NetApp suggested in a StorageMojo interview that “protecting online data only via RAID 5 today verges on professional malpractice” (storagemojo.com/2007/02/26/netapp-weighs-in-on-disks [link broken]). During the past seven years, drives have become bigger, and the difficulties we face in protecting data have increased. While Val’s claim is hyperbolic, it does have a basis in fact. If you have only the RAID-5 protection (a single parity block protecting each stripe), there is a risk of having a second error before the replacement disk is brought on-line. However, if you use RAID-Z (the ZFS equivalent of RAID-5), every metadata block is stored at least twice in addition to the RAID-5 type protection, so if a RAID-Z array entirely loses a disk and then has a read error on one of the other disks, you might lose some data but won’t lose metadata. For metadata to be lost on a RAID-Z array, you need to have one disk die entirely and then have matching read errors on two other disks. If disk failures are independent, it’s a very unlikely scenario. If, however, the disk failures are not independent, you could have a problem with all disks (and lose no matter what type of RAID you use).

Snapshots

One nice feature of BTRFS and ZFS is the ability to make snapshots of BTRFS subvolumes and ZFS filesystems. It’s not difficult to write a cron job that makes a snapshot of your important data every hour or even every few minutes. Then when you accidentally delete an important file, you easily can get it back. Both BTRFS and ZFS can be configured such that files can be restored from snapshots without root access so users can recover their own files without involving the sysadmin.

Snapshots aren’t strictly related to the the topic of data integrity, but they solve the case of accidental deletion, which is the main reason for using backups. From a sysadmin perspective, snapshots and RAID are entirely separate issues. From the CEO perspective, “is the system working or not?”, they are part of the same issue.

Comparing BTRFS and ZFS

For a single disk in a default configuration, both BTRFS and ZFS will store two copies of each metadata block. They also use checksums to detect when data is corrupted, which is much better than just providing corrupt data to an application and allowing errors to propagate. ZFS supports storing as many as three copies of data blocks on a single disk, which is a significant benefit.

For a basic RAID-1 installation, BTRFS and ZFS offer similar features by default (storing data on both devices with checksums to cover silent corruption). ZFS offers duplicate metadata as a mandatory feature and the option of duplicate data on top of the RAID configuration.

BTRFS supports RAID-0, which is a good option to have when you are working with data that is backed up well. The combination of the use of BTRFS checksums to avoid data corruption and RAID-0 for performance would be good for a build server or any other system that needs large amounts of temporary file storage for repeatable jobs but for which avoiding data corruption is important.

BTRFS supports dynamically increasing or decreasing the size of the filesystem. Also, the filesystem can be rebalanced to use a different RAID level (for example, migrating between RAID-1 and RAID-5). ZFS, however, has a very rigid way of managing storage. For example, if you have a RAID-1 array in a pool, you can never remove it, and you can grow it only by replacing all the disks with larger ones. Changing between RAID-1 and RAID-Z in ZFS requires a backup/format/restore operation, while on BTRFS, you can just add new disks and rebalance.

ZFS supports different redundancy levels (via the copies= setting) on different “filesystems” within the same “pool” (where a “pool” is group of one or more RAID sets). BTRFS “subvolumes” are equivalent in design to ZFS “filesystems”, but BTRFS doesn’t support different RAID parameters for subvolumes at this time.

ZFS supports RAID-Z and RAID-Z2, which are equivalent to BTRFS RAID-5, RAID-6—except that RAID-5 and RAID-6 are new on BTRFS, and many people aren’t ready to trust important data to them. There is no feature in BTRFS or planned for the near future that compares with RAID-Z3 on ZFS. There are plans for future development of extreme levels of redundancy in BTRFS at some future time, but it probably won’t happen soon.

Generally, it seems that ZFS is designed to offer significantly greater redundancy than BTRFS supports, while BTRFS is designed to be easier to manage for smaller systems.

Currently, BTRFS doesn’t give good performance. It lacks read optimization for RAID-1 arrays and doesn’t have any built-in support for using SSDs to cache data from hard drives. ZFS has many performance features and is as fast as a filesystem that uses so much redundancy can be.

Finally, BTRFS is a new filesystem, and people are still finding bugs in it—usually not data loss bugs but often bugs that interrupt service. I haven’t yet deployed BTRFS on any server where I don’t have access to the console, but I have Linux servers running ZFS in another country.

DKIM and Mailing Lists

The Problem

DKIM is a standard for digitally signing mail to prove it’s authenticity and prove that it was not modified. In an ideal situation it can be used to reject corrupted or forged messages.

DMARC and ADSP are standards for indicating that mail from a domain should be signed. This prevents hostile parties from removing the DKIM signature and modifying the message. DKIM is only half as useful without them (it can still prove authenticity but it can’t prove that mail was forged and allow rejecting forged mail).

A mailing list is a software system that receives a message from one person and then generates messages to many people with the same content. A common setting of a mailing list is to insert “[listname]” at the start of the subject line of each message that goes through, this breaks the DKIM signature. Another common setting is to append a footer to the message giving information about the list, this breaks the DKIM signature unless the signature uses the “l=” flag (which Gmail doesn’t). When the “l=” flag is used a hostile party can append text to a message without the signature breaking which is often undesired. Mailman (one of the most common mailing list systems) parses and regenerates headers, so it can break DKIM signatures on messages with unexpected header formatting. Mailman also in some situations uses a different MIME encoding for the body which breaks DKIM signatures.

It seems almost impossible to reliably get all mail to go through a Mailman list without something happening to it that breaks DKIM signatures. The problem is that Mailman doesn’t just send the message through, it creates new messages with new headers (created from a parsed copy of the original headers not copying the original headers), and it sometimes parses and re-encodes the body. Even if you don’t choose to use the features for appending a message footer or changing the subject DKIM signatures will often be broken.

Stripping the Signatures

As there is no way to reliably (IE for every message from every sending domain that uses DKIM) pass through messages with DKIM signatures intact the only option is to strip them. To do that with Mailman edit /etc/mailman/mm_cfg.py, add the directive “REMOVE_DKIM_HEADERS = Yes“, and then restart Mailman. If none of the people who send to your list used DMARC or ADSP then that solves your problem.

However if there are senders who use DMARC or ADSP and recipients who check those features then mail will still be rejected and users will get unsubscribed. When DMARC or ADSP are in use the mailing list can’t send out list mail purporting to be from a list member, it must send out mail from it’s own domain.

A Legitimate From Field

In the web based configuration for Mailman there is a dmarc_moderation_action setting that can munge the From field on messages with a DMARC policy. One thing to note is that when one list uses the dmarc_moderation_action setting it causes DKIM users to configure DMARC which then makes more problems for the people who run lists with no settings for DKIM. Also that doesn’t solve things for ADSP messages or messages that don’t use either DMARC or ADSP. It’s not uncommon for people to have special configuration to prevent forged mail from their own domain, requiring a valid DKIM signature is one way of doing this. Finally many users of DKIM enabled mail servers don’t have the option of enabling DMARC.

If you use the from_is_list setting in the web based configuration for Mailman then all mail will have a From field which shows who the message is from as well as the fact that it came From a list server. This combined with REMOVE_DKIM_HEADERS will allow DKIM signed mail sent to the list to go through correctly in all cases.

If you run many lists then changing them all through the web interface can be tedious. Below is a sample of shell code that will use the Mailman config_list program to change the settings to use from_is_list. NB I haven’t actually run this on a Mailman server with lots of lists so check it before you use it, consider it pseudo-code.

for n in lista listb listc ; do
  config_list -o /tmp/$n $n
  sed -i -e "s/from_is_list = 0/from_is_list = 1/" /tmp/$n
  config_list -i /tmp/$n $n
done

The from_is_list setting makes a change like the following:
-From: Russell Coker <russell at coker.com.au>
+From: Russell Coker via linux-aus <linux-aus at lists.linux.org.au>

SPF

There are similar problems with SPF and other anti-forgery methods. The use of from_is_list solves them too.

Signing List Mail

An ideal list configuration has the list server checking DKIM signatures and DMARC settings before receiving mail. There is normally no reason for a mailing list to send mail to another mailing list so mail that the list server receives should pass all DKIM, DMARC, and ADSP checks. Then the list server should send mail out with it’s own DKIM signature.

When a user receives mail from the list they can verify that the DKIM signature is valid. Then if they know that the sender used DKIM (EG the mail originated from gmail.com or another domain that’s well known to use DKIM) then they know that it was verified at the list server and therefore as long as the list server was not compromised the message was not altered from what the sender wrote.

Resources

The Debian Wiki page about OpenDKIM is worth reading [1]. OpenDKIM is generally regarded as the best free software DKIM verification and signing daemon available. The Debian Wiki only documents how to install it with Postfix but the milter interface is used by other MTAs so it shouldn’t be too hard to get it working with other MTAs. Also the Debian Wiki documents the “relaxed” setting which will in some situations solve some of the problems with Mailman munging messages, but it doesn’t guarantee that they will all be solved. Also in most cases it’s not possible to get every user of your list to change the settings of their DKIM signing to “relaxed” for the convenience of the list admin.

The Mailman Wiki page about DMARC [2] and the Mailman Wiki page about DKIM [3] are both good resources. But this article summarises all you really need to know to get things working.

Here is an example of how to use SpamAssassin to score DKIM signatures and give a negative weight to mail from lists that are known to have problems [4]. Forcing list users to do this means more work overall than just having the list master configure the list server to pass DKIM checks.

Mon

Here are the details of some Mon tests I run:

DNS

The following tests the local DNS cache. I didn’t use example.com in my real tests, I used the domain of a multi-national corporation that has a very short DNS timeout that seems related to their use of the Akamai CDN. I won’t tell people which company to use, but I’m sure that any company that can afford Akamai can afford a query from my server every 5 minutes. ;)

watch 127.0.0.1
  service dnscache
    description DNS caching
    interval 5m
    monitor dns.monitor -caching_only -query www.example.com
    period
    numalerts 1
    alert mailxmpp.alert -x russell@coker.com.au -m russell@coker.com.au
    upalert mailxmpp.alert -x russell@coker.com.au -m russell@coker.com.au

The following section of mon.cf.m4 monitors Google DNS for the validity of domains that I host on my name server. The aim of this is to catch the case where someone forgets to pay for zone renewal so that they can pay while the zone is locked before it becomes available for domain squatters. It uses M4 so it can be generated from the BIND configuration.

watch 8.8.8.8
  service myzones
    description check Google DNS has my zones
    interval 1h
    monitor dns.monitor -caching_only QUERYDOMAINS
    period
    numalerts 1
    alert mailxmpp.alert -x russell@coker.com.au -m russell@coker.com.au
    upalert mailxmpp.alert -x russell@coker.com.au -m russell@coker.com.au

The following Makefile generates a mon.cf file from the BIND configuration that monitors the www entries in zones and the first PTR entries in IPv6 reverse zones. Note that the spaces will need to be converted to a TAB if you want to cut/paste this.

all: mon.cf

mon.cf: mon.cf.m4 /etc/bind/named.conf.local Makefile
        m4 -DQUERYDOMAINS="$(shell for n in $$(grep zone /etc/bind/named.conf.local|sed -e s/^zone..// -e s/\"\ .$$//|grep -v ^//| grep -v arpa$$ ; for n in $$(grep zone.*ip6.arpa /etc/bind/named.conf.local|sed -e s/^zone..// -e s/\”\ .$$//|grep -v ^//) ; do echo -query 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.$$n:PTR ; done) ; do echo -query www.$$n ; done)" mon.cf.m4 > mon.cf
        /etc/init.d/mon restart

Free Educational Android Apps

This post has a list of educational apps for Android phones and tablets that I have discovered. I can’t claim that each app is the best for it’s purpose, but I’ve tested out each one and found it to be useful.

I am separating the lists into apps that have full functionality when offline and those which require Internet access. Sometimes it’s handy to be able to load up your phone with apps that you can use later when there’s no net access.

Quick Apps that Don’t Need Internet Access

Here are some apps that can be quickly used without much prior knowledge and without Internet access.

Quick Apps that Need Internet Access

Here are some apps that are easy to use but require Internet access.

More Complex Apps

Classes of Common Apps

There are some educational related categories of apps where there are many apps performing similar tasks, so instead of trying to find one app that could be claimed to be best I’ll just list what you can search for in the Android market. If you know of one particular app in some category that stands out then let me know.

  • “Wikipedia”. There are many apps that read Wikipedia online that work in different ways and none of them really satisfy me. But you need to have one of them.
  • “Conway’s Life” is the classic cellular automata game.
  • “Bridge construction” games are good for teaching some principles of engineering. There are many such games with slightly different features.
  • A “graphing” calculator. In the olden days a graphing calculator cost $100 or more, now there is a range of free apps to do it. Some apps only support a single graph, but apart from that they all seem OK.
  • “Fractal” generating programs can be educational, but only if you have some idea of the maths behind them.
  • “Stop motion” generation programs, also “gif creator” can find some good matches. I haven’t been really satisfied with any of the programs I’ve tried, but some have worked well enough for my needs. Let me know if you find a really good one.