Based on past mistakes by myself and others, here is a check-list before putting a Linux (or other Unix) server online:
- Run memtest86+ (or an equivalent program for other architectures) before going live, ideally run it before installing the OS. Run it again every time you upgrade the RAM.
- Reboot the machine after every significant change. EG if you install a new daemon then reboot it to make sure that the daemon starts correctly. It’s better to have 5 minutes of down-time for a scheduled reboot than a few hours of down-time after something goes wrong at 2AM.
- Make sure that every account that is used for cron jobs has it’s email directed somewhere that a human will see it. Make sure that root has it’s mail sent somewhere useful even if you don’t plan to have any root cron jobs.
- Make sure that ntpd is running and has at least two servers to look at. If you have a big site then run two NTP servers yourself and have each of them look to two servers in the outside world or one server and a GPS.
- Make sure that you have some sort of daily cron job doing basic log analysis. The Red Hat logwatch program is quite effective, then you need to have some way of making sure that you notice if an email stops being sent (getting 11 instead of 12 messages from logwatch in the morning won’t be noticed by most people).
- Make sure that when (not if) a hard drive in your RAID array dies then you will notice it.
Any suggestions on other things I can add?