Kernel Tuning - NICPB HEP Group

IT » Kernel Tuning

Linux tuning for dCache

Overview

I myself had a lot of problems with our dCache storage pool nodes, which already at low traffic amounts (20MB/s writes) started to loose control to the disk io. The basic symptoms were always high io wait for the cpu-s, high number of blocked processes in vmstat output and the system load creeping above 20.

Although I had my worries about XFS and 3ware RAID controllers, they proved to be unjustified. In the following descriptions I assume that you run quite the latest Linux 2.6 series kernel (latest at the time of writing was 2.6.23) as there are numerous fixes in the kernel which impact the system performance, most notably there are tons of XFS related updates in kernels after 2.6.15 with anything below 2.6.18 having problems. Hence best to go with even later versions.

As a quick warning, I came to the results through discussions with friends and reading through half of the internet (big thanks Google) as well as heavy experimentation (during which I lost 3000 files, ca 1.5TB worth of data, though that was my own stupidity by a simple mixup of not logging in to a test machine, but to the production system) until I found the satisfactory settings. So use these settings at your own risk and I will not be responsible for anything how these settings change how your system behaves. All of the usual disclaimers which call for your own common sense also hold :P

XFS tuning

When using XFS on top of a RAID system which has striping (RAID 0, 3, 4, 5, 6, 10, 50) it is wise to also tell XFS about it being on top of such a device. In our case we use RAID 5 for the integrity it provides toward a single disk loss. Basically the important parameters are the stripe size of the RAID (in our case we used 64k stripe sizes, the default setting for the 9550SX cards) as well as the number of data disks. For RAID 0, the number of data disks is the number of disks in the RAID, for RAID 3,4 and 5 it's one disk less (so for a 8 disk RAID, there are 7 data disks) and for RAID 6 it's two disks less as two disks are used for parity. Then when creating the XFS filesystem use the following options:

$ mkfs.xfs -d su=<stripe_size>,sw=<nr_data_disks> -l version=2,su=<stripe_size> /dev/sdX

as an example, our RAID5 with 8 disks and 64k stripe size has the following XFS on top of it:

$ mkfs.xfs -d su=64k,sw=7 -l version=2,su=64k /dev/sda

You can read more about the striping use in XFS from the XFS site.

Block device settings

One problem which could manifest itself in older kernels (2.6.9 which for example is the stock kernel in Scientific Linux 4) is the size of io requests which are buffered before they are communicated to the disk. In 2.6.9 the default value of that for every block device is 8192. In the newer kernels it is 128. One can see that the change in default value is substantial. The trouble with the big value being that if the requests are quite big, then caching a huge number of them can easily lead you to run out of memory and have all kinds of problems.

You can check your current settings easily by just looking into the following file:

/sys/block/<DEV>/queue/nr_requests

so for example:

$ cat /sys/block/sda/queue/nr_requests

To set a new value to it you just echo it into the same file. For example the 3ware recommended value for nr_requests on their 9550SX series controllers is 512.

Bear in mind that this setting only lasts until reboot, so you may want to add the setting of it to /etc/rc.d/rc.local for example.

Readahead

Again a feature of 3ware controllers is that if you run with the default block device settings, then the read performance of the 3ware is terrible. On our own systems, having the default settings produced a read speed of the order of 40MB/s on a single dd operation of 10GB from disk to /dev/zero.

3ware itself recommends setting the readahead value to 16384 (the default is 256) which indeed for at least the streamed copy increases the speed to above 400MB/s in our systems. This has probably some performance impact on memory in case of more parallel streams and smaller files, but in case of dCache pool where the transfers are usually sequential per stream with bigger files, then setting the readahead to a higher value is of use indeed.

To see the current readahead setting of a block device use the blockdev command:

$ blockdev --getra /dev/sdX

for example:

$ blockdev --getra /dev/sda 256

Setting the readahead to a new value also happens with the same command:

$ blockdev --setra 16384 /dev/sda

I/O scheduler

With the newer kernels comes also the option to select the I/O scheduler which makes the decisions on how the read and write buffers are to be queued for the underlying device. You can read more about them in the Linux kernel source documentation: linux/Documentation/block/*iosched.txt

The scheduler which a number of sites have found through testing to be the most useful for high i/o rates is deadline scheduler. You can either select the scheduler already at the kernel compilation time or you can select it later by echo-ing the right one to the block device settings.

You can see the current schedule by catting the block device queue scheduler file:

$ cat /sys/block/sda/queue/scheduler noop anticipatory [deadline] cfq

and you can set the new scheduler with:

$ echo deadline > /sys/block/sda/queue/scheduler

Kernel virtual memory management

In the latest 2.6 kernels it seems that a few settings have changed with regard to how the virtual memory management is performed. Let's have a quick look over a few of them.

Dirty pages cleanup

There are two import settings which control the kernel behaviour with regard to dirty pages in memory. They are:

vm.dirty_background_ratio vm.dirty_ratio

The first of the two (vm.dirty_background_ratio) defines the percentage of memory that can become dirty before a background flushing of the pages to disk starts. Until this percentage is reached no pages are flushed to disk. However when the flushing starts, then it's done in the background without disrupting any of the running processes in the foreground.

Now the second of the two parameters (vm.dirty_ratio) defines the percentage of memory which can be occupied by dirty pages before a forced flush starts. If the percentage of dirty pages reaches this number, then all processes become synchronous, they are not allowed to continue until the io operation they have requested is actually performed and the data is on disk. In case of high performance I/O machines, this causes a problem as the data caching is cut away and all of the processes doing I/O (the important ones in dCache pool) become blocked to wait for io. This will cause a big number of hanging processes, which leads to high load, which leads to unstable system and crappy performance.

Now the default values in Scientific Linux 4 for these settings with the default 2.6.9-cern{smp} are background ratio 10% and synchronous ratio 40%. However with the 2.6.20+ kernels, the default values are respectably 5 and 10%. Now, it is not hard to reach that 10% level and block your system, this is exactly what I faced when trying to understand why my systems were performing poorly and being under high load while doing almost nothing. I finally managed to find a few parameters to watch, which showed me what the system was doing. The two values to monitor are from /proc/vmstat file and are:

$ grep -A 1 dirty /proc/vmstat nr_dirty 30931 nr_writeback 0

If you monitor the values in your /proc/vmstat file you will notice that before the system reaches the vm.dirty_ratio barrier the number of dirty is a lot higher than that of writeback, usually writeback is close to 0 or occasionally flicks higher and then calms down again. Now if you do reach the vm.dirty_ratio barrier, then you will see the nr_writeback start to climb fast and become higher than ever before without dropping back. At least it will not drop back easily if the dirty_ratio is set to a too small number.

I personally use in my servers vm.dirty_background_ratio = 3 and vm.dirty_ratio = 40. You can set these variables by appending at the end of your /etc/sysctl.conf file:

$ grep dirty /etc/sysctl.conf vm.dirty_background_ratio = 3 vm.dirty_ratio = 40

and then executing:

$ sysctl -p

to see your current settings for dirty ratio, do the following:

$ sysctl -a | grep dirty

PS! the original vm.dirty_background_ratio for me was 15%, but after an e-mail from Stijn De Weirdt where he explained that it is not a good idea to run too high level of dirty memory before starting the disk writes and hence he recommended around 3-5% to quickly start the flushing to disk as the underlying hardware should be able to handle that. In addition what I did not mention is that if you grep for "dirty" in the sysctl -a output you will see a few more parameters which for example force the flushing of pages over than X seconds etc. As I didn't tune any of these I decided to just leave them be and not describe them here, you can investigate on your own their effects.

VM overcommit

Another important setting for the virtual memory management is the behavior of memory overcommitting. As we saw a number of cases when the systems were at high load, we did actually hit a point where we did run out of real memory and as Linux by default is quite freely giving away more memory that it really has, then once we reached the real limit of memory we had processes dying due to "Out of memory" errors which at least once also caused one of the systems to crash.

Basically what Linux does is that when someone asks for memory it will easily give that memory, even if it doesn't have that much memory. It just assumes that because there is a lot of asking for memory more than processes really need, that it will not really run out of memory. There is also a setting in the Linux kernel which limits the overcommitting to a certain % of the total memory (vm.overcommit_ratio) and by default it is set to 50%. So when Linux is handing out memory to processes, it assumes it actually has 150% of the memory it really has and to be hones in most cases this is not a problem as the applications are really greedy and ask for more than they really need.

However for a high throughput machine, there is a certain likelyhood to really run out of memory with real allocations and hence I have stopped my Linux kernel from handing out too much memory by setting the vm.overcommit_ratio setting to 2 (default is 0) which disables the overcommitment feature fully.

Again, just add the setting with its value at the end of /etc/sysctl.conf and let sysctl to set it based on the config file by:

$ sysctl -p

The settings in /etc/sysctl.conf are reread at every boot so these settings will remain persistent.

Conclusion

Though my understanding of what is somewhat clearer now based on three weeks of observations and hacking of the storage systems, I'm not quite sure how good or bad these settings can be for your system in some other situations. The systems which used to be able to live no more than 2-6 hours have been performing magnificiently for 6 days now (at the time of writing) doing a lot more transfers and having orders of magnitude smaller load (it actually remains less than number of cpu-s, which means the machines could do even more). So I personally am happy about these settings, but I must remind you, use these settings at your own risk.

Group

Projects

HPC

Stuff