ZFS and BTRFS

This article was published in Linux Journal in 2014 [1].

BTRFS and ZFS are two options for protecting against data corruption.

Which one should you use, and how should you use it?

For a long time, the software RAID implementation in the Linux kernel has worked well to protect data against drive failure. It provides great protection against a drive totally failing and against the situation where a drive returns read errors. But what it doesn’t offer is protection against silent data corruption (where a disk returns corrupt data and claims it to be good). It also doesn’t have good support for the possibility of drive failure during RAID reconstruction.

Drives have been increasing in size significantly, without comparable increases in speed. Modern drives have contiguous read speeds 300 times faster than drives from 1988 but are 40,000 times larger (I’m comparing a recent 4TB SATA disk with a 100M ST-506 disk that can sustain 500K/s reads). So the RAID rebuild time is steadily increasing, while the larger storage probably increases the risk of data corruption.

Currently, there are two filesystems available on Linux that support internal RAID with checksums on all data to prevent silent corruption: ZFS and BTRFS. ZFS is from Sun and has some license issues, so it isn’t included in most Linux distributions. It is available from the ZFS On Linux Web site (zfsonlinux.org) [2]. BTRFS has no license problems and is included in most recent distributions, but it is at an earlier stage of development. When discussing BTRFS in this article, I concentrate on the theoretical issues of data integrity and not the practical issues of kernel panics (which happen regularly to me but don’t lose any data). [NB This hasn’t happened in recent years.]

Do Drives Totally Fail?

For a drive totally to fail (that is, be unable to read any data successfully at all), the most common problem used to be “stiction”. That is when the heads stick to the platters, and the drive motor is unable to spin the disk. This seems to be very uncommon in recent times. I presume that drive manufacturers fixed most of the problems that caused it.

In my experience, the most common reason for a drive to become totally unavailable is due to motherboard problems, cabling or connectors—that is, problems outside the drive. Such problems usually can be fixed but may cause some downtime, and the RAID array needs to keep working with a disk missing.

Serious physical damage (for example, falling on concrete) can cause a drive to become totally unreadable. But, that isn’t a problem that generally happens to a running RAID array. Even when I’ve seen drives fail due to being in an uncooled room in an Australian summer, the result has been many bad sectors, not a drive that’s totally unreadable. It seems that most drive “failures” are really a matter of an increasing number of bad sectors.

There aren’t a lot of people who can do research on drive failure. An individual can’t just buy a statistically significant number of disks and run them in servers for a few years. I couldn’t find any research on the incidence of excessive bad sectors vs. total drive failure. It’s widely regarded that the best research on the incidence of hard drive “failure” is the Google Research paper “Failure Trends in a Large Disk Drive Population” [3], which although very informative, gives no information on the causes of “failure”. Google defines “failure” as anything other than an upgrade that causes a drive to be replaced. Not only do they not tell us the number of drives that totally died vs. the number that had some bad sectors, but they also don’t tell us how many bad sectors would be cause for drive replacement.

Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau from the University of Wisconsin-Madison wrote a paper titled “An Analysis of Data Corruption in the Storage Stack” [4]. That paper gives a lot of information about when drives have corrupt data, but it doesn’t provide much information about the case of major failure (tens of thousands of errors), as distinct from cases where there are dozens or hundreds of errors. One thing it does say is that the 80th percentile of latent sector errors per disk with errors is “about 50”, and the 80th percentile of checksum mismatches for disks with errors is “about 100”. So most disks with errors have only a very small number of errors. It’s worth noting that this research was performed with data that NetApp obtained by analyzing the operation of its hardware in the field. NetApp has a long history of supporting a large number of disks in many sites with checksums on all stored data.

I think this research indicates that the main risks of data loss are corruption on disk or a small number of read errors, and that total drive failure is an unusual case.

Redundancy on a Single Disk

By default, a BTRFS filesystem that is created for a single device that’s not an SSD will use “dup” mode for metadata. This means that every metadata block will be written to two parts of the disk. In practice, this can allow for recovering data from drives with many errors. I recently had a 3TB disk develop about 14,000 errors. In spite of such a large number of errors, the duplication of metadata meant that there was little data loss. About 2,000 errors in metadata blocks were corrected with the duplicates, and the 12,000 errors in data blocks (something less than 48M of data) was a small fraction of a 3TB disk. If an older filesystem was used in that situation, a metadata error could corrupt a directory and send all it’s entries to lost+found.

ZFS supports even greater redundancy via the copies= option. If you specify copies=2 for a filesystem, then every data block will be written to two different parts of the disk. The number of copies of metadata will be one greater than the number of copies of data, so copies=2 means that there will be three copies of every metadata block. The maximum number of copies for data blocks in ZFS is three, which means that the maximum number of copies of metadata is four.

The paper “An Analysis of Data Corruption in the Storage Stack” shows that for “nearline” disks (that is, anything that will be in a typical PC or laptop), you can expect a 9.5% probability of read errors (latent sector errors) and a 0.466% probability of silent data corruption (checksum mismatches). The typical Linux Journal reader probably can expect to see data loss from hard drive read errors on an annual basis from the PCs owned by their friends and relatives. The probability of silent data corruption is low enough that every user has a less than 50% chance of seeing it on their own PC during their life—unless they purchased one of the disks with a firmware bug that corrupts data.

If you run BTRFS on a system with a single disk (for example, a laptop), you can expect that if the disk develops any errors, they will result in no metadata loss due to duplicate metadata, and any file data that is lost will be reported to the application by a file read error. If you run ZFS on a single disk, you can set copies=2 or copies=3 for the filesystem that contains your most important data (such as /home on a workstation) to decrease significantly the probability that anything less than total disk failure will lose data. This option of providing extra protection for data is a significant benefit for ZFS when compared to BTRFS.

If given a choice between a RAID-1 array with Linux software RAID (or any other RAID implementation that doesn’t support checksums) and a single disk using BTRFS, I’d choose the single disk with BTRFS in most cases. That is because on a single disk with BTRFS, the default configuration is to use “dup” for metadata. This means that a small number of disk errors will be unlikely to lose any metadata, and a scrub will tell you which file data has been lost due to errors. Duplicate metadata alone can make the difference between a server failing and continuing to run. It is possible to run with “dup” for data as well, but this isn’t a well supported configuration (it requires mixed data and metadata chunks that require you to create a very small filesystem and grow it).

It is possible to run RAID-1 on two partitions on a single disk if you are willing to accept the performance loss. I have a 2TB disk running as a 1TB BTRFS RAID-1, which has about 200 bad sectors and no data loss.

Finally, it’s worth noting that a “single disk” from the filesystem perspective can mean a RAID array. There’s nothing wrong with running BTRFS or ZFS over a RAID-5 array. The metadata duplication that both those filesystems offer will reduce the damage if a RAID-5 array suffers a read error while replacing a failed disk. A hardware RAID array can offer features that ZFS doesn’t offer (such as converting from RAID-1 to RAID-5 and then RAID-6 by adding more disks), and hardware RAID arrays often include a write-back disk cache that can improve performance for RAID-5/6 significantly. There’s also nothing stopping you from using BTRFS or ZFS RAID-1 over a pair of hardware RAID-5/6 arrays.

Drive Replacement

When you replace a disk in Linux Software RAID, the old disk will be marked as faulty first, and all the data will be reconstructed from other disks. This is fine if the other disks are all good, but if the other disks have read errors or corrupt data, you will lose data. What you really need is to have the new disk directly replace the old disk, so the data for the new disk can be read from the old disk or from redundancy in the array, whichever works.

ZFS has a zpool replace command that will rebuild the array from the contents of the old disk and from the other disks in a redundant set. BTRFS supports the same thing with the btrfs replace command. In the most common error situations (where a disk has about 50 bad sectors), this will give you the effect of having an extra redundant disk in the array. So a RAID-5 array in BTRFS or in ZFS (which they call a RAID-Z) should give as much protection as a RAID-6 array in a RAID implementation that requires removing the old disk before adding a new disk. At this time, RAID-5 and RAID-6 support in BTRFS is still fairly new, and I don’t expect it to be ready to use seriously by the time this article is published. But the design of RAID-5 in BTRFS is comparable to RAID-Z in ZFS, and they should work equally well when BTRFS RAID-5 code has been adequately tested and debugged.

Hot spare disks are commonly used to allow replacing a disk more quickly than someone can get to the server. The idea is that the RAID array might be reconstructed before anyone even can get to it. But it seems to me that the real benefit of a hot-spare when used with a modern filesystem, such as ZFS or BTRFS, is that the system has the ability to read from the disk with errors as well as the rest of the array while constructing the new disk. If you have a server where every disk bay contains an active disk (which is a very common configuration in my experience), it is unreasonably difficult to support a disk replacement operation that reads from the failing disk (using an eSATA device for the rebuild isn’t easy). Note that BTRFS doesn’t have automatic hot-spare support yet, but it presumably will get it eventually. In the meantime, a sysadmin has to instruct it to replace the disk manually.

As modern RAID systems (which on Linux servers means ZFS as the only fully functional example at this time) support higher levels of redundancy, one might as well use RAID-Z2 (the ZFS version of RAID-6) instead of RAID-5 with a hot spare, or a RAID-Z3 instead of a RAID-6 with a hot-spare. When a disk is being replaced in a RAID-6/RAID-Z2 array with no hot-spare, you are down to a RAID-5/RAID-Z array, so there’s no reason to use a disk as a hot-spare instead of using it for extra redundancy in the array.

How Much Redundancy Is Necessary?

The way ZFS works is that the copies= option (and the related metadata duplication) is applied on top of the RAID level that’s used for the storage “pool”. So if you use copies=2 on a ZFS filesystem that runs on a RAID-1, there will be two copies of the data on each of the disks. The allocation of the copies is arranged such that it covers different potential failures to the RAID level, so if you had copies=3 for data stored on a three-disk RAID-Z pool, each disk in the pool would have a copy of the data (and parity to help regenerate two other copies). The amount of space required for some of these RAID configurations is impractical for most users. For example, a RAID-Z3 array composed of six 1TB disks would have 3TB of RAID-Z3 capacity. If you then made a ZFS filesystem with copies=3, you would get 1TB of usable capacity out of 6TB of disks. 5/6 disks is more redundancy than most users need.

If data is duplicated in a RAID-1 array, the probability of two disks having errors on matching blocks from independent random errors is going to be very low. The paper from the University of Wisconsin-Madison notes that firmware bugs can increase the probability of corrupt data on matching blocks and suggests using staggered stripes to cover that case. ZFS does stagger some of its data allocation to deal with that problem. Also, it’s fairly common for people to buy disks from two different companies for a RAID-1 array to prevent a firmware bug or common manufacturing defect from corrupting data on two identical drives. The probability of both disks in a BTRFS RAID-1 array having enough errors that data is lost is very low. With ZFS, the probability is even lower due to the mandatory duplication of metadata on top of the RAID-1 configuration and the option of duplication of data. At this time, BTRFS doesn’t support duplicate metadata on a RAID array.

The probability of hitting a failure case that can’t be handled by RAID-Z2 but that can be handled by RAID-Z3 is probably very low. In many deployments, the probability of the server being stolen or the building catching on fire will be greater than the probability of a RAID-Z2 losing data. So it’s worth considering when to spend more money on extra disks and when to spend money on better off-site backups.

In 2007, Val Bercovici of NetApp suggested in a StorageMojo interview that “protecting online data only via RAID 5 today verges on professional malpractice” (storagemojo.com/2007/02/26/netapp-weighs-in-on-disks [link broken]). During the past seven years, drives have become bigger, and the difficulties we face in protecting data have increased. While Val’s claim is hyperbolic, it does have a basis in fact. If you have only the RAID-5 protection (a single parity block protecting each stripe), there is a risk of having a second error before the replacement disk is brought on-line. However, if you use RAID-Z (the ZFS equivalent of RAID-5), every metadata block is stored at least twice in addition to the RAID-5 type protection, so if a RAID-Z array entirely loses a disk and then has a read error on one of the other disks, you might lose some data but won’t lose metadata. For metadata to be lost on a RAID-Z array, you need to have one disk die entirely and then have matching read errors on two other disks. If disk failures are independent, it’s a very unlikely scenario. If, however, the disk failures are not independent, you could have a problem with all disks (and lose no matter what type of RAID you use).

Snapshots

One nice feature of BTRFS and ZFS is the ability to make snapshots of BTRFS subvolumes and ZFS filesystems. It’s not difficult to write a cron job that makes a snapshot of your important data every hour or even every few minutes. Then when you accidentally delete an important file, you easily can get it back. Both BTRFS and ZFS can be configured such that files can be restored from snapshots without root access so users can recover their own files without involving the sysadmin.

Snapshots aren’t strictly related to the the topic of data integrity, but they solve the case of accidental deletion, which is the main reason for using backups. From a sysadmin perspective, snapshots and RAID are entirely separate issues. From the CEO perspective, “is the system working or not?”, they are part of the same issue.

Comparing BTRFS and ZFS

For a single disk in a default configuration, both BTRFS and ZFS will store two copies of each metadata block. They also use checksums to detect when data is corrupted, which is much better than just providing corrupt data to an application and allowing errors to propagate. ZFS supports storing as many as three copies of data blocks on a single disk, which is a significant benefit.

For a basic RAID-1 installation, BTRFS and ZFS offer similar features by default (storing data on both devices with checksums to cover silent corruption). ZFS offers duplicate metadata as a mandatory feature and the option of duplicate data on top of the RAID configuration.

BTRFS supports RAID-0, which is a good option to have when you are working with data that is backed up well. The combination of the use of BTRFS checksums to avoid data corruption and RAID-0 for performance would be good for a build server or any other system that needs large amounts of temporary file storage for repeatable jobs but for which avoiding data corruption is important.

BTRFS supports dynamically increasing or decreasing the size of the filesystem. Also, the filesystem can be rebalanced to use a different RAID level (for example, migrating between RAID-1 and RAID-5). ZFS, however, has a very rigid way of managing storage. For example, if you have a RAID-1 array in a pool, you can never remove it, and you can grow it only by replacing all the disks with larger ones. Changing between RAID-1 and RAID-Z in ZFS requires a backup/format/restore operation, while on BTRFS, you can just add new disks and rebalance.

ZFS supports different redundancy levels (via the copies= setting) on different “filesystems” within the same “pool” (where a “pool” is group of one or more RAID sets). BTRFS “subvolumes” are equivalent in design to ZFS “filesystems”, but BTRFS doesn’t support different RAID parameters for subvolumes at this time.

ZFS supports RAID-Z and RAID-Z2, which are equivalent to BTRFS RAID-5, RAID-6—except that RAID-5 and RAID-6 are new on BTRFS, and many people aren’t ready to trust important data to them. There is no feature in BTRFS or planned for the near future that compares with RAID-Z3 on ZFS. There are plans for future development of extreme levels of redundancy in BTRFS at some future time, but it probably won’t happen soon.

Generally, it seems that ZFS is designed to offer significantly greater redundancy than BTRFS supports, while BTRFS is designed to be easier to manage for smaller systems.

Currently, BTRFS doesn’t give good performance. It lacks read optimization for RAID-1 arrays and doesn’t have any built-in support for using SSDs to cache data from hard drives. ZFS has many performance features and is as fast as a filesystem that uses so much redundancy can be.

Finally, BTRFS is a new filesystem, and people are still finding bugs in it—usually not data loss bugs but often bugs that interrupt service. I haven’t yet deployed BTRFS on any server where I don’t have access to the console, but I have Linux servers running ZFS in another country.

RAM Speed according to Memtest86+

Here are some speed results for RAM according to Memtest86+ on some machines that I have run. Note that the reported speed varies between runs by a few percent.

Thinkpad 600e PentiumII 400Mhz PC-66 RAM (2 DIMMs) 174MB/s
Compaq P3-866MHz PC133 RAM (3 DIMMs, 2*128 + 256) 190MB/s
Compaq Athlon 1GHz PC133 RAM (3 DIMMs) 219MB/s
Compaq P3-800MHz PC133 RAM (1 DIMM) 270MB/s
Compaq P3-800MHz PC133 RAM (3 DIMMs, 2*128 + 256) 240MB/s
Compaq P4 1.5GHz PC133 RAM (3 DIMMs) 486MB/s
Compaq P4 1.5GHz PC133 RAM (1 or 2 DIMMs) 490MB/s
EeePC 701, DDR2-665 PC2-5300 running at DDR2-333 speed 798MB/s
HP Celeron 1.8GHz PC2100/DDR266 (1 DIMM) 824MB/s
HP Celeron 2.4GHz PC2100/DDR266 RAM (2 DIMMs) 984MB/s
Celeron D (32bit) 2.93GHz PC2400/DDR300 PC3200 RAM 1,140MB/s
HP Celeron 2.4GHz PC2700/DDR333 RAM (2 DIMMs) 1,027MB/s
HP Celeron 2.4GHz PC2700/DDR333 RAM (2 DIMMs) 1,375MB/s
Dell PowerEdge T105 Dual-core Opteron 1212 (2GHz) single DDR2-667 ECC RAM 1,670MB/s
Dell PowerEdge T105 Dual-core Opteron 1212 (2GHz) pair of DDR2-667 ECC RAM 1,826MB/s
NEC Pentium E2160 1.8GHz DDR663 (two mismatched DIMMs) 2,307MB/s
IBM Pentium E2160 1.8GHz DDR2-667 PC2-5300 (single DIMM) 2,371MB/s
IBM Pentium E2160 1.8GHz PC2-4200 (paired DIMMs) 2,436MB/s
NEC Pentium E4600 2.4GHz DDR2-667 PC2-5300 (single DIMM) 2,528MB/s
NEC Pentium D 2.8GHz DDR 533 (unpaired DIMMS) 1,600MB/s
NEC Pentium D 2.8GHz DDR 533 (paired DIMMS) 2,600MB/s
Thinkpad T61 DDR2-665 PC2-5300 (single DIMM) 2,023MB/s
Thinkpad T61 DDR2-665 PC2-5300 (paired or mismatched DIMMs) 2,823MB/s
Core Gen2 2492MHz 3,804MB/s
Core Gen2 2492MHz 2 DIMMs 3,930MB/s
Q8400 2*PC6400 4,830MB/s
Intel X3450 2.67GHz, DDR3-1066 (4 DIMMs) 10.0GB/s
i7-2600 3.4GHz, DDR3-1333 PC3-10600 (paired DIMMs) 16.6GB/s
Dell PowerEdge T110 Core i3-3220 3GHz, DDR3-1600 PC3-12800 (paired DIMMs) 17.9GB/s
Dell PowerEdge T30 Core E3-1225 3.3GHz, DDR4-2133 PC4-17000 (single DIMM) 12.5GB/s

The Wikipedia page on SDRAM lists the theoretical speeds and the various names of the different types of DDR RAM (each type seems to have at least two names).

DDR266 theoretically can do 2100MB/s, but I’ve only seen it do 984MB/s (with two DIMMs).

Mon

Here are the details of some Mon tests I run:

DNS

The following tests the local DNS cache. I didn’t use example.com in my real tests, I used the domain of a multi-national corporation that has a very short DNS timeout that seems related to their use of the Akamai CDN. I won’t tell people which company to use, but I’m sure that any company that can afford Akamai can afford a query from my server every 5 minutes. ;)

watch 127.0.0.1
  service dnscache
    description DNS caching
    interval 5m
    monitor dns.monitor -caching_only -query www.example.com
    period
    numalerts 1
    alert mailxmpp.alert -x russell@coker.com.au -m russell@coker.com.au
    upalert mailxmpp.alert -x russell@coker.com.au -m russell@coker.com.au

The following section of mon.cf.m4 monitors Google DNS for the validity of domains that I host on my name server. The aim of this is to catch the case where someone forgets to pay for zone renewal so that they can pay while the zone is locked before it becomes available for domain squatters. It uses M4 so it can be generated from the BIND configuration.

watch 8.8.8.8
  service myzones
    description check Google DNS has my zones
    interval 1h
    monitor dns.monitor -caching_only QUERYDOMAINS
    period
    numalerts 1
    alert mailxmpp.alert -x russell@coker.com.au -m russell@coker.com.au
    upalert mailxmpp.alert -x russell@coker.com.au -m russell@coker.com.au

The following Makefile generates a mon.cf file from the BIND configuration that monitors the www entries in zones and the first PTR entries in IPv6 reverse zones. Note that the spaces will need to be converted to a TAB if you want to cut/paste this.

all: mon.cf

mon.cf: mon.cf.m4 /etc/bind/named.conf.local Makefile
        m4 -DQUERYDOMAINS="$(shell for n in $$(grep zone /etc/bind/named.conf.local|sed -e s/^zone..// -e s/\"\ .$$//|grep -v ^//| grep -v arpa$$ ; for n in $$(grep zone.*ip6.arpa /etc/bind/named.conf.local|sed -e s/^zone..// -e s/\”\ .$$//|grep -v ^//) ; do echo -query 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.$$n:PTR ; done) ; do echo -query www.$$n ; done)" mon.cf.m4 > mon.cf
        /etc/init.d/mon restart

Debian Repositories

Here is a list of the Debian repositories I maintain. I include sources.list lines both directly and via a TOR Hidden Service (see Petter’s blog post about TOR and APT [1]).

All my repositories support i386 and amd64 architectures. Not all packages will be supported in both architectures. In the past I’ve had to rebuild lots of i386 packages to avoid execmem while amd64 packages needed no changes due in part to amd64 having more registers. But sometimes I just don’t need a package on an architecture for my own systems.

While I aim to make i386 and amd64 equally usable for everyone for SE Linux and WordPress the misc repository is mostly to suit my own needs.

Stretch

Currently the only repository I have for Stretch is for WordPress. I will probably have a SE Linux repository soon.

I am no longer updating the Jessie repository for WordPress, but I expect that the Stretch packages will work on Jessie without any problems. So you can use the Stretch repository if you are still running Jessie.

deb http://www.coker.com.au stretch wordpress

deb http://6kbiotdr2vvb7ftu.onion stretch wordpress

Jessie

The SE Linux repository has policy and various updates of Jessie packages for better support as well as backports from testing/unstable. The backport of systemd has no support for AppArmor as supporting that was inconvenient and it’s impossible to run both SE Linux and AppArmor at the same time.

The misc repository has random changes that aren’t liked by the Debian maintainer, backports, etc. Currently the only thing in there is a build of Sendmail that uses const more often in Milter headers.

The wordpress repository has lots of things that I haven’t added to Debian because of the difficulty in providing useful security support. Also the Jessie WordPress repository worked with a wheezy system last time I tested, it depends on the latest version of WordPress not on any other package.

deb http://www.coker.com.au jessie selinux
deb http://www.coker.com.au jessie misc
deb http://www.coker.com.au jessie wordpress

deb http://6kbiotdr2vvb7ftu.onion jessie selinux
deb http://6kbiotdr2vvb7ftu.onion jessie misc
deb http://6kbiotdr2vvb7ftu.onion jessie wordpress

Wheezy

The wheezy repositories for SE Linux, misc, and WordPress have much the same aims as the Jessie versions – but I’ve had more time to work on them. Note that I am no longer updating the WordPress repository.

The ZFS repository is only for amd64. I generally don’t recommend that people use it. Install ZFS from zfsonlinux.org. I created the wheezy repository before they created their repository. I’m not deleting it because it MIGHT be useful in some situation.

deb http://www.coker.com.au wheezy selinux
deb http://www.coker.com.au wheezy misc
deb http://www.coker.com.au wheezy wordpress
deb http://www.coker.com.au wheezy zfs

deb http://6kbiotdr2vvb7ftu.onion wheezy selinux
deb http://6kbiotdr2vvb7ftu.onion wheezy misc
deb http://6kbiotdr2vvb7ftu.onion wheezy wordpress
deb http://6kbiotdr2vvb7ftu.onion wheezy zfs

Squeeze

I’ve got a bunch of old Squeeze repositories. Apart from the SE Linux one I don’t think that any of them provide much benefit at this time. I’m not supporting Squeeze SE Linux nowadays except as a consulting service.

deb http://www.coker.com.au squeeze selinux
deb http://6kbiotdr2vvb7ftu.onion squeeze selinux

Free Educational Android Apps

This post has a list of educational apps for Android phones and tablets that I have discovered. I can’t claim that each app is the best for it’s purpose, but I’ve tested out each one and found it to be useful.

I am separating the lists into apps that have full functionality when offline and those which require Internet access. Sometimes it’s handy to be able to load up your phone with apps that you can use later when there’s no net access.

Quick Apps that Don’t Need Internet Access

Here are some apps that can be quickly used without much prior knowledge and without Internet access.

Quick Apps that Need Internet Access

Here are some apps that are easy to use but require Internet access.

More Complex Apps

Classes of Common Apps

There are some educational related categories of apps where there are many apps performing similar tasks, so instead of trying to find one app that could be claimed to be best I’ll just list what you can search for in the Android market. If you know of one particular app in some category that stands out then let me know.

  • “Wikipedia”. There are many apps that read Wikipedia online that work in different ways and none of them really satisfy me. But you need to have one of them.
  • “Conway’s Life” is the classic cellular automata game.
  • “Bridge construction” games are good for teaching some principles of engineering. There are many such games with slightly different features.
  • A “graphing” calculator. In the olden days a graphing calculator cost $100 or more, now there is a range of free apps to do it. Some apps only support a single graph, but apart from that they all seem OK.
  • “Fractal” generating programs can be educational, but only if you have some idea of the maths behind them.
  • “Stop motion” generation programs, also “gif creator” can find some good matches. I haven’t been really satisfied with any of the programs I’ve tried, but some have worked well enough for my needs. Let me know if you find a really good one.

SE Linux Play Machine

Free root access on a SE Linux machine!

To access my Debian play machine ssh to zp7zwyd5t3aju57m.onion as root, the password is “SELINUX“.
I give no-one permission to distribute this password. If you want to share information on this machine you must give the URL to this web site. In some jurisdictions it would be considered a crime to distribute the password without my permission (IE without giving the URL to this web page).

Note that such machines require a lot of skill if you are to run them successfully. If you have to ask whether you should run one then the answer is “no“.

The aim of this is to demonstrate that all necessary security can be provided by SE Linux without any Unix permissions (however it is still recommended that you use Unix permissions as well for real servers). Also it gives you a chance to login to a SE machine and see what it’s like.

When you login to a SE Linux play machine make sure that you use the -x option to disable X11 forwarding or set ForwardX11 no in your /etc/ssh/ssh_config file before you login. Also make sure that you use the -a option to disable ssh agent forwarding or set ForwardAgent no in your /etc/ssh/ssh_config file before you login.

If you don’t correctly disable these settings then logging in to the play machine will put you at risk of being attacked through your SSH client.

There is an IRC channel for discussing this, it is #selinux on irc.freenode.net.

FAQ

  • Editing thanks.txt_append_only with vi won’t work, use cat or echo to append data to the file. The following commands will work:
    echo something >> thanks.txt_append_only_dont_edit_with_vi
    cat >> thanks.txt_append_only_dont_edit_with_vi
    
  • There is no harm in letting you see dmesg output for such a machine, security by obscurity isn’t much good anyway. For a serious server you would probably deny dmesg access, but this is a play machine. One of the purposes of the machine is to teach people about SE Linux, and you can learn a lot from the dmesg output.
  • This is not a simulated machine or honeypot. It’s a real Lenovo ThinkCenter desktop PC running Debian/Jessie (pre-release) SE Linux in a Xen DomU. You really have UID==0. The Xen configuration is a default Debian install with a standard Debian kernel. SE Linux does it’s own permission checks in addition to the Unix permission checks. If you don’t believe me you are free to write assembler programs to call getuid() etc. But it would be a lot easier for you to just install a recent version of Debian or Fedora, see how it works, and read the source if you wish.
  • I will provide instructions on installing such machines soon.
  • To administer a SE Linux machine you need to have sysadm_r (the SE Linux administrative role) and UID==0 (the regular Unix admin account). So there needs to be a UID==0 account. As in regular Linux the UID==0 account does not need to be named “root”. In the case of this machine the root account has UID 0, but it has few privs in SE Linux.
  • The default policy in Fedora is known as the targeted policy, it has no restrictions on user login sessions (so can never be used for such a machine). The policy I use for this machine is known as the strict policy. The default configuration of the strict policy does not support running in such a manner and requires some changes.
  • This machine is intentionally more permissive than some other play machines. I let you see the policy files so you can learn how to configure a machine in this way.
  • Regarding core-dumping bash to read the history. That’s nice work, but you could have just used cat, grep, or any of your favourite tools on /root/.bash_history with much less effort.
  • Some people have asked for ping, telnet, etc access. I would like to provide such access (and have provided it in the past). I removed ping access because some people were using ping with large packet sizes to attack machines with small network connections. I removed telnet access because people were running scripts to try and discover (and attack) hosts with broken telnetd’s. As for whether the machine is usable without such access, for it’s intended purpose (demonstrating what SE Linux can do) it is quite useful. As a general shell server it’s not very useful because you share your account with lots of people who may rm your files or kill your processes.
  • Some types of files and directories may not be stat’d by unprivileged users (this includes shadow_t for /etc/shadow). Such files and directories show up in flashing red in the output of “ls -l” because ls can’t even determine whether it’s a file or a directory.

Software vs Hardware RAID

It’s a commonly held myth that hardware RAID is unconditionally better than software RAID. That claim is not true in all cases and is particularly wrong at the low end.

Really Cheap Hardware RAID

The cheapest so-called hardware RAID uses RAID in the BIOS and relies on an OS driver for support when running in protected mode. This is essentially a different sort of software RAID but with BIOS support to boot from it. Using a different disk format to the standard software RAID for your OS can make it more difficult to recover when things go wrong and there’s no benefit to this. If you use software RAID-1 from your OS and set things up correctly then you can boot from either disk. Using software RAID-1 for booting and RAID-5 or RAID-6 for the OS and data is a viable option.

Cheap Hardware RAID

Cheap hardware RAID doesn’t have write-back caching and therefore can’t give any significant performance benefit over software RAID. Note that there are different options for how RAID stripes are laid out which can affect performance, so if a cheap hardware RAID device gives any significant performance benefit over software RAID then it’s probably due to where the blocks happen to be stored working well with your filesystem. Which is of course a benefit you could get from tuning software RAID.

The Mythical CPU Benefits of Hardware RAID

It’s widely regarded that hardware RAID is faster due to taking the processing away from the CPU. But the truth is that for at least the last 10 years CPUs have been fast enough and in fact it’s often been the case that RAID controllers have been the bottleneck.

When I loaded the Linux RAID-5/RAID-6 driver on my Thinkpad T61 it’s 2.2GHz T7500 CPU (which isn’t a particularly new or powerful laptop CPU) was tested and shown to be capable of 3227MB/s for RAID-6 calculations. The fastest SATA disk I’ve benchmarked was capable of sustaining almost 120MB/s on it’s outer tracks. If we assume that newer disks are capable of 150MB/s then my Thinkpad could handle the RAID calculations for an array of 20 such disks.

An old P3-1GHz desktop system I use for a low-end server can do 591MB/s of RAID-6 calculations in software, if I was able to connect SATA disks to that old system then it could drive four of them in a RAID array at full speed!

It’s often regarded that a benefit of hardware RAID is to avoid CPU use. Contiguous IO can use a moderate amount of CPU power, I could potentially use 20% of one core of a T7500 if I had four disks running at once. But usually contiguous IO isn’t that common. If you are using a Gigabit Ethernet port to transfer data then you are limited to something slightly more than 100MB/s. But most applications don’t involve large contiguous data transfers and thus the amount of data transferred goes down.

One way that hardware RAID can save CPU time is if the interface to the hard drives was inefficient. The IDE interface didn’t seem particularly efficient and large transfers to IDE disks used to often require more CPU time than was expected. For such disks having them on a RAID controller that emulated a giant SCSI disk could save some CPU time.

Back in 2000 I did some tests on a Mylex DAC 960 hardware RAID controller that was only capable of sustaining 10MB/s. This wasn’t a problem as the applications were seek intensive and the Mylex performed well for that task. But for contiguous IO software RAID would have given much better performance.

The Real Benefits of Hardware RAID

A good hardware RAID system will have NVRAM for a write-back cache. This can dramatically improve write performance which is very important on RAID-5 and RAID-6 systems that perform really badly for small writes.

Good hardware RAID controllers will often support many more disks than a non-RAID controller. If you want to have more than 4 disks then hardware RAID has some serious benefits. But it has to have NVRAM write-back cache, otherwise you get no useful benefits and you might as well use software RAID.

Conclusion

If you can’t afford a high-end RAID system like a HP CCISS then use software RAID. Software RAID will be faster and more reliable than cheap hardware RAID.

If you need more than four disks then you can probably benefit a lot from hardware RAID with write-back caching.

SE Linux Terminology

Security Context is the SE Linux label for a process, file, or other resource. Each process or object that a process may access has exactly one security context. It has four main parts separated by colons: User:Role:Domain/Type:Sensitivity Label. Note that the Sensitivity Label is a compile-time option that all distributions enable nowadays.

User in terms of SE Linux is also known as the Identity. The program semanage can be used to add new identities and to change the roles and sensitivities assigned to them. System users often end in “_u” (EG user_u, unconfined_u, and system_u) but this is just a convention used to distinguish system users from users that associate directly with Unix accounts – which are typically the same as the name of the account. So the user with Unix account john might have a SE Linux user/identity of john. Note that as the local sysadmin can change the user names with semanage you can’t make any strong assumptions about a naming convention. When a process creates a resource (such as a file on disk) then by default the resource will have the same user as the process.

Role for a process determines the set of domains that may be used for running a child process. Through semanage you can configure which roles may be entered by each user. The default policy has the roles user_r, staff_r, sysadm_r, and system_r. Adding new roles requires recompiling the policy which is something that most sysadmins don’t do. So you can expect that all role names end in “_r“.

Object Class refers to the object that is to be accessed, there are 82 object classes in the latest policy, many of which are related to things such as the X server. Some object classes are file, dir, chr_file, are blk_file. The reason for having an object class is so that access can be granted to one object with a given type label but not be granted to another object of a different object class.

Type is the primary label for the Domain/Type or Type-Enforcement model of access control, by tradition a type name ends in “_t“. There is no strong difference between a domain and a type, a domain is the type of a process. In the DT model there are a set of rules which specify what happens when a domain tries to access an object of a certain object class for a particular access (read, write, etc).

MLS stands for Multi Level Security, it’s a hierarchical system for restricting access to sensitive data. It’s core principle is that of no write-down and no read-up. In a MLS system you can only write data to a resource with an equal or higher sensitivity label.

MCS stands for Multi Category Security.

Sensitivity Level is for a hierarchical level of sensitivity in the MLS policy. In the default policy there are 16 levels from s0 to s15. The MCS policy uses some of the mechanisms of MLS but not the level, so in MCS the level is always set to s0. The policy can be recompiled to have different numbers of levels.

Category is a primitive for the MCS and MLS policies. The default policy has 1024 categories from c0 to c1023, the policy can be recompiled to have different numbers of categories.

Sensitivity Label is for implementing MLS and MCS access controls. It may be ranged, in which case it has a form “LOW-HIGH” where both LOW and HIGH are comprised of a Sensitivity Level and a set of categories separated by a colon – EG “s0:c1-s1:c1.c10” means the range from level s0 with category c1 to the level s1 with the set of categories from c1 to c10 inclusive. If it isn’t ranged then it just has a level and a set of categories separated by a colon. In a set of categories a dot is used to indicate a range of categories (all categories between the low one and the high one are included) while a comma indicates a discontinuity in the range. So “c1.c10,c13” means the set of all categories between c1 and c10 inclusive plus the category c13. The kernel will canonicalise category sets, so if it is passed “c1,c2,c3” then it will return “c1.c3“. These raw labels may be translated into a more human readable form by mcstransd.

Constraint is a rule that restricts access. SE Linux is based on the concept of deny by default and the domain-type model uses rules to allow certain actions. Constraints are used for special cases where access needs to be restricted outside of the domain-type model. MCS and MLS are implemented using constraints.

MySQL Cheat Sheet

This document is designed to be a cheat-sheet for MySQL. I don’t plan to cover everything, just most things that a novice MySQL DBA is likely to need often or in a hurry.

Configuring mysqld

If you are going to provide a database service to other machines edit /etc/mysql/my.cnf and set the bind-address parameter to a suitable value. A value of 0.0.0.0 will cause it to accept connections on any of the server’s addresses. I recommend using a private address range (10.0.0.0/8, 192.168.0.0/16, or 172.16.0.0/12) for such database connections and ideally a
back-end VLAN or Ethernet switch that doesn’t carry any public data.

For the purpose of this post let’s consider the MySQL server to have a private IP address of 192.168.42.1. So you want the my.cnf file to have bind-address = 192.168.42.1

To start mysql administration use the command mysql -u root. In Debian the root account has no password by default, on CentOS 5.x starting mysql for the first time gives a message:
PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:
/usr/bin/mysqladmin -u root password ‘new-password’
/usr/bin/mysqladmin -u root -h server password ‘new-password’

That is wrong, for the second mysqladmin command you need a “-p” option (or you can reverse the order of the commands).

There is also the /usr/bin/mysql_secure_installation script that has an interactive dialog for locking down the MySQL database.

Administrative Password Recovery

If you lose the administration password the recovery process is as follows:

  1. Stop the mysqld, this may require killing the daemon if the password for the system account used for shutdown access is also lost.
  2. Start mysqld with the --skip-grant-tables option.
  3. Use SQL commands such as “UPDATE mysql.user SET Password=PASSWORD('password') WHERE User='root';” to recover the passwords you need.
  4. Use the SQL command “FLUSH PRIVILEGES;
  5. Restart mysqld in the normal manner.

User Configuration

For an account to automatically login to mysql you need to create a file named ~/.my.cnf with the following contents:
[client]
user=USERNAME
password=PASSWORD
database=DBNAME

Replace USERNAME. PASSWORD, and DBNAME with the appropriate values. They are all optional parameters. This saves using mysql client parameters -u parameter for the username, “-p for the password, and specifying the database name on the command line. Note that using the “-pPASSWORD” command-line option to the mysql client is insecure on multi-user systems as (in the absence of any security system such as SE Linux) any user can briefly see the password via ps.

Note that the presence of the database= option in the config file breaks mysqlshow and mysqldump for MySQL 5.1.51 (and presumably earlier versions too). So it’s often a bad idea to use it.

Grants

To grant all access to a new database:
CREATE DATABASE foo_db;
USE foo_db;
GRANT ALL PRIVILEGES ON foo_db.* to 'user'@'10.1.2.3' IDENTIFIED BY 'pass';

Where 10.1.2.3 is the client address and pass is the password. Replace 10.1.2.3 with % if you want to allow access from any client address.

Note that if you use “foo_db” instead of “foo_db.*” then you will end up granting access to foo_db.foo_db (a table named foo_db in the foo_db database) which generally is not what you want.

To grant read-only access replace “ALL PRIVILEGES” with “SELECT“.

To show what is granted to the current user run “SHOW GRANTS;” .

To show the privs for a particular user run “SHOW GRANTS FOR ‘user’@’10.1.2.3’;

To show all entries in the user table (user-name, password, and hostname):
USE mysql;
SELECT Host,User,Password FROM user;

To do the same thing at the command-line:
echo “SELECT Host,User,Password FROM user;” | mysql mysql

To revoke access:
REVOKE ALL PRIVILEGES ON foo_db.* FROM user@10.1.2.3 IDENTIFIED BY ‘pass’;

To test a user’s access connect as the user with a command such as the following:
mysql -u user -h 10.1.2.4 -p foo_db

Then test that the user can create tables with the following mysql commands:
CREATE TABLE test (id INT);
DROP TABLE test;

Listing the Databases

To list all databases that are active on the selected server run “mysqlshow“, it uses the same methods of determining the username and password as the mysql client program.

To list all tables in a database run “SHOW TABLES;” . For more detail select from INFORMATION_SCHEMA.TABLES or run “SHOW TABLE STATUS;

For example to see the engine that is used for each table you can use the command echo “SELECT table_schema, table_name, engine FROM INFORMATION_SCHEMA.TABLES;” |mysql.

But INFORMATION_SCHEMA.TABLES is only in Mysql 5 and above, for prior versions you can use mysqldump -d to get the schema, or “SHOW CREATE TABLE table_name;” at the command-line.

Also the mysqldump program can be used to display the tables in a database via “mysqlshow database” or the columns in a table via “mysqlshow database table“.

To list active connections: “SHOW PROCESSLIST;”

Database backup

The program mysqldump is used to make a SQL dump of the database. EG: “mysqldump mysql” to dump the system tables. The data compresses well (being plain text of a regular format) so piping it through “gzip -9” is a good idea. To backup the system database you could run “mysqldump mysql | gzip -9 > mysql.sql.gz“. To restore simply run “mysql -u user database < file“, in the case of the previous example “zcat mysql.sql.gz | mysql -u root database“.

To dump only selected tables you can run “mysqldump database table1 [table2]“.

The option --skip-extended-insert means that a single INSERT statement will be used for each row. This gives a bigger dump file but allows running diff on multiple dump files.

The option --all-databases or -A dumps all databases.

The option --add-locks causes the tables to be locked on insert and improves performance.

Note that mysqldump blocks other database write operations so don’t pipe it through less or any other process that won’t read all the data in a small amount of time.

mysqldump -d DB_NAME dumps the schema.

The option --single-transaction causes mysqldump to use a transaction for the dump (so that the database can be used in the mean time). This only works with INNODB. To convert a table to INNODB the following command can be used:
ALTER TABLE tablename ENGINE = INNODB;

To create a slave run mysqldump with the --master-data=1.

When a master has it’s binary logs get too big a command such as “PURGE MASTER LOGS BEFORE ‘2008-12-02 22:46:26’;” will purge the old logs. An alternate version is of the form “PURGE MASTER LOGS TO ‘mysql-bin.010’;“. The MySQL documentation describes how to view the slave status to make sure that this doesn’t break replication.