I performed a simple test on the storage node to see at low level if the driver is handling stuff ok. Linux: 2.6.9-55.0.9.EL.cernsmp readahead: 16384 (default caused extremely slow reads) RAID5 config: stripe 64k, write cache on, storsave set to performance, auto verify and queuing disabled, 8 disks XFS creation: [root@jupiter /]# mkfs.xfs -d sunit=128,swidth=896 -f /dev/sdb meta-data=/dev/sdb isize=256 agcount=32, agsize=26702320 blks = sectsz=512 attr=0 data = bsize=4096 blocks=854473984, imaxpct=25 = sunit=16 swidth=112 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 Single write of 10GB in blocks of 7x64k: sync; time dd if=/dev/zero of=testfile bs=448k count=23405 && sync 23405+0 records in 23405+0 records out real 0m28.087s user 0m0.018s sys 0m25.733s 28s ~ 365MB/s vmstat output during the time: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 184 3319776 324 11288 0 0 0 0 1004 13 0 0 100 0 0 0 184 3319784 324 11288 0 0 0 0 1004 15 0 0 100 0 2 0 184 2937128 332 384640 0 0 44 149230 1465 97 0 46 53 0 2 0 184 2536040 332 776720 0 0 0 450464 3849 34 0 69 28 2 2 0 184 2136168 332 1170620 0 0 0 356972 3814 40 0 67 27 6 2 1 184 1740456 332 1563740 0 0 0 277304 3848 40 0 64 25 11 1 2 184 1346728 332 1954520 0 0 0 268376 3779 29 0 64 24 13 1 1 184 957544 340 2334112 0 0 0 463614 3816 55 0 73 22 6 1 0 184 541864 340 2738152 0 0 0 507540 3837 51 0 73 9 18 2 1 184 76200 340 3202772 0 0 0 219644 3819 59 0 61 8 31 1 3 184 23528 172 3251904 0 0 0 414705 3734 296 0 76 25 0 2 2 184 24872 172 3249044 0 0 0 412328 3843 337 0 75 25 0 3 4 184 21928 68 3253568 0 0 0 304384 3780 720 0 71 10 18 3 0 184 25384 68 3248628 0 0 0 415164 3818 338 0 75 0 25 1 4 184 21288 68 3252008 0 0 0 351884 3767 554 0 74 12 15 2 0 184 23464 68 3256428 0 0 0 192121 3839 373 0 67 4 29 1 6 184 24168 68 3244468 0 0 0 638184 3776 903 0 85 15 0 1 6 184 22504 68 3259548 0 0 0 0 3775 404 0 58 41 0 4 5 184 23592 68 3239788 0 0 0 862816 3814 353 0 95 5 0 1 7 184 22824 68 3253048 0 0 0 4800 3775 436 0 59 41 1 2 6 184 23464 68 3249668 0 0 0 436692 3801 397 0 77 24 0 1 8 184 22632 68 3244988 0 0 0 488016 3764 405 0 79 3 18 3 7 184 22824 68 3254608 0 0 0 311857 6061 2048 0 49 0 51 1 2 184 25320 68 3241868 0 0 8 622512 3781 3387 0 76 17 6 1 8 184 24744 68 3255648 0 0 0 0 3777 9235 0 53 0 48 2 0 184 24296 68 3242388 0 0 4 725972 3811 316 0 87 4 9 1 7 184 22504 68 3254088 0 0 0 67042 3755 1975 0 61 8 32 2 0 184 22056 68 3249408 0 0 0 507092 3839 348 0 77 4 20 1 8 184 24680 68 3247068 0 0 0 327152 3747 604 0 74 9 16 1 0 184 25256 100 3237156 0 0 76 610580 3797 316 0 51 4 45 0 0 184 34344 100 3237156 0 0 0 99167 3773 20 0 5 92 3 0 0 184 46376 108 3237148 0 0 8 73 3464 43 0 2 99 0 0 0 184 46376 108 3237148 0 0 0 0 1004 19 0 0 100 0 0 0 184 46376 108 3237148 0 0 0 0 1004 15 0 0 100 0 as can be seen the writes are uneven, they do appear to be around 300-400k, but there are seconds with 0 and some with 800+k. Also the system io wait fluctuates between 0 and 50% and there are quite a number of processes hanging in uninterruptible sleep. Same test with raw device (system has two identical controllers): sync; time dd if=/dev/zero of=/dev/sda bs=448k count=23405 && sync 23405+0 records in 23405+0 records out real 0m30.718s user 0m0.026s sys 0m28.229s 30s ~ 341MB/s vmstat output: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 184 46696 380 3237396 0 0 597 1216 167 70 0 1 98 1 0 0 184 46696 380 3237396 0 0 0 0 1012 22 0 0 100 0 0 0 184 46696 380 3237396 0 0 0 0 1006 23 0 0 100 0 0 0 184 46696 380 3237396 0 0 0 0 1005 13 0 0 100 0 0 0 184 46696 380 3237396 0 0 0 0 1006 19 0 0 100 0 1 0 184 24040 222132 3038264 0 0 0 81 1022 549 0 23 77 0 1 0 184 25384 588792 2659644 0 0 0 541320 3134 534 0 60 41 0 1 0 184 22312 784808 2466488 0 0 0 355276 3759 321 0 56 40 5 1 0 184 23848 785472 2466604 0 0 0 290968 3769 349 0 57 39 5 1 0 184 24552 781672 2466504 0 0 0 437360 3761 333 0 61 39 0 1 0 184 23656 785288 2466788 0 0 0 276584 3768 385 0 57 33 10 1 0 184 23848 782164 2466792 0 0 0 431200 3759 326 0 62 38 0 1 0 184 25448 783104 2466372 0 0 0 286404 3759 380 0 56 39 5 1 0 184 21864 783768 2466748 0 0 0 430768 3758 294 0 61 39 0 1 0 184 25448 782864 2466352 0 0 0 279424 3756 410 0 57 38 5 1 0 184 22120 783324 2466672 0 0 0 430300 3765 287 0 62 35 4 1 0 184 25448 782824 2466652 0 0 0 275584 3764 429 0 57 38 5 1 0 184 25448 781204 2466712 0 0 0 389636 3774 260 0 57 38 5 1 0 184 23528 784112 2466404 0 0 0 337232 3759 435 0 62 38 0 1 0 184 23464 784956 2466600 0 0 0 311108 3761 282 0 56 39 6 1 0 184 24872 781920 2466516 0 0 0 413248 3757 388 0 61 39 0 1 0 184 25384 783728 2466788 0 0 0 282772 3773 349 0 57 39 5 1 0 184 23976 782624 2466592 0 0 0 428776 3760 361 0 62 38 0 1 0 184 21800 787380 2466776 0 0 0 272188 3753 327 0 57 38 5 1 0 184 23848 782924 2466552 0 0 0 427356 3759 376 0 62 38 0 1 0 184 24488 784608 2466428 0 0 0 276504 3762 324 0 57 34 8 1 0 184 25448 780856 2466800 0 0 0 439144 3761 360 0 61 39 0 1 0 184 24232 784624 2466672 0 0 0 272732 3764 361 0 56 39 5 1 0 184 22888 784380 2466396 0 0 0 371472 3613 327 0 56 39 5 1 0 184 24424 781292 2466624 0 0 0 411252 3732 314 0 62 39 0 1 0 184 23144 784928 2466628 0 0 0 276336 3754 405 0 57 38 5 1 0 184 23912 781640 2466536 0 0 0 428512 3746 288 0 62 38 0 1 0 184 22504 785528 2466548 0 0 0 279220 3757 411 0 57 38 5 1 0 184 24936 780916 2466480 0 0 0 418880 3753 286 0 61 38 0 1 0 184 25448 783228 2466508 0 0 0 264216 3758 385 0 56 39 5 0 1 184 32744 783412 2466584 0 0 8 149684 3761 4368 0 14 48 39 0 0 184 818856 100 2466516 0 0 68 64 1178 373 0 20 75 6 0 0 184 818856 100 2466516 0 0 0 0 1006 11 0 0 100 0 0 0 184 818856 100 2466516 0 0 0 0 1005 17 0 0 100 0 as can be seen the speeds is a lot more uniform. It fluctuates between 270k-430k an the io wait is almost non-existent, the only place where it comes up is at the end of the write, probably due to the sync command. Also there are no blocked jobs what so ever during the write which means that the system load will not go up. Same test for reads on XFS: sync; time dd if=testfile of=/dev/zero bs=448k count=23405 && sync 23405+0 records in 23405+0 records out real 0m47.733s user 0m0.028s sys 0m16.518s one can see here the first indication of trouble. The sys time and real time differ by 30s which is basically that real time is 3x that of sys time. The resulting total read speed is 213MB/s. The vmstat output for the time is: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 184 819304 272 2466604 0 0 595 1313 169 70 0 1 98 1 0 0 184 819304 272 2466604 0 0 0 0 1008 10 0 0 100 0 0 0 184 819304 272 2466604 0 0 0 0 1004 23 0 0 100 0 0 0 184 819304 272 2466604 0 0 0 0 1015 33 0 0 100 0 1 0 184 550824 280 2734396 0 0 267880 81 3013 1049 0 20 67 13 1 0 184 173992 280 3106716 0 0 372224 0 3883 1478 0 29 50 22 1 0 184 24168 148 3259728 0 0 406016 0 4194 6135 0 37 50 14 1 0 184 22376 148 3259728 0 0 409600 0 4183 13055 0 39 49 12 1 0 184 22248 148 3264668 0 0 208152 0 2662 6461 0 45 48 6 0 1 212 25448 60 3261868 0 0 173308 12 2898 6610 0 38 49 14 1 0 212 21544 60 3265768 0 0 409600 0 4175 12746 0 41 48 11 1 1 212 21800 60 3265508 0 0 409600 0 4182 12921 0 40 48 12 2 0 212 25064 60 3261868 0 0 409600 4 4166 13059 0 40 49 12 3 0 212 21288 60 3259788 0 0 409600 0 4186 13126 0 39 49 11 3 0 212 22824 60 3260568 0 0 409600 0 4181 12928 0 40 49 12 3 0 212 23464 60 3256928 0 0 409600 0 4183 13105 0 39 49 13 0 1 212 23144 60 3265248 0 0 303376 0 4669 12011 0 30 49 21 0 1 212 22696 60 3265768 0 0 117052 0 5873 11593 0 14 50 36 0 1 212 21160 60 3267588 0 0 113588 0 5963 11728 0 14 50 36 0 1 212 23720 60 3265248 0 0 170016 0 5668 11207 0 20 48 31 0 1 212 25384 60 3263688 0 0 104692 0 5943 11526 0 15 50 36 0 1 212 22376 60 3266548 0 0 104408 0 5996 11579 0 14 50 37 1 0 212 23912 60 3259268 0 0 106748 0 5828 11545 0 15 50 36 0 1 212 24552 60 3264728 0 0 104904 0 5997 11468 0 14 50 36 1 0 212 21672 60 3267588 0 0 110752 0 6015 11699 0 15 50 36 0 1 212 23784 60 3265768 0 0 111352 0 6090 11995 0 14 50 37 1 0 212 24296 60 3263948 0 0 118716 0 6025 12058 0 14 50 36 0 1 212 21672 60 3268368 0 0 106572 0 5903 11300 0 14 50 37 0 1 212 22056 60 3268108 0 0 111740 0 5838 11449 0 14 50 36 1 0 212 24808 60 3265248 0 0 118176 0 5837 11682 0 15 50 35 0 1 212 22952 60 3267328 0 0 104776 0 5910 11321 0 13 50 37 0 1 212 23464 60 3266808 0 0 105228 0 5996 11637 0 15 50 36 1 0 212 25384 60 3264468 0 0 117468 0 5936 11889 0 15 50 36 0 1 212 21544 60 3269148 0 0 112500 0 6003 11599 0 14 50 36 0 1 212 24936 60 3266288 0 0 111456 0 6067 11998 0 14 50 36 3 0 212 23912 60 3267328 0 0 111620 0 5760 11240 0 14 50 36 0 1 212 22376 60 3269148 0 0 112304 0 5998 11667 0 14 49 36 0 1 212 24040 60 3267588 0 0 111620 8 5775 11333 0 15 50 36 0 1 212 23528 60 3268628 0 0 105244 0 5760 11159 0 13 50 38 0 1 212 24488 60 3267588 0 0 95572 0 5328 10146 0 14 50 36 0 1 212 25448 60 3267068 0 0 121616 0 5760 11390 0 14 49 36 1 0 212 21736 60 3272268 0 0 394528 0 4227 11104 0 41 47 12 1 0 212 22568 60 3272528 0 0 409600 0 4176 12611 0 41 47 11 1 0 212 24296 60 3272788 0 0 408824 0 4198 12575 0 41 49 11 0 1 212 24488 60 3273308 0 0 116864 0 5814 11305 0 15 50 35 0 1 212 25384 60 3272788 0 0 112768 0 5786 11334 0 14 50 37 0 1 212 24040 60 3274608 0 0 106520 0 6152 11919 0 15 50 36 0 1 212 24488 60 3274868 0 0 110404 0 6013 11751 0 14 50 37 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 212 24744 60 3275128 0 0 107784 0 5876 11469 0 14 50 37 0 1 212 23336 60 3277208 0 0 112632 0 5680 11052 0 14 50 36 0 1 212 24488 60 3276688 0 0 100712 0 5837 11268 0 13 50 37 0 1 212 24552 88 3277960 0 0 92472 0 5763 10956 0 14 50 37 0 0 212 24552 100 3277948 0 0 12 72 1026 58 0 0 99 1 0 0 212 24552 100 3277948 0 0 0 0 1004 15 0 0 100 0 as can be seen the system wait io is ca 35% all of the time and there is on average 1 blocked process which will cause load to increase over time. Same test on raw device: sync; time dd if=/dev/sda of=/dev/zero bs=448k count=23405 && sync 23405+0 records in 23405+0 records out real 0m24.998s user 0m0.013s sys 0m24.199s There is basically no time difference in real time and system time and the speed is also nicely 400MB/s. The vmstat output: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 212 24744 176 3278132 0 0 683 1308 172 75 0 1 98 1 0 0 212 24744 176 3278132 0 0 0 0 1007 14 0 0 100 0 0 0 212 24744 176 3278132 0 0 0 0 1004 15 0 0 100 0 0 1 212 24936 4280 3273508 0 0 4096 81 1048 183 0 0 86 14 1 0 212 24360 366340 2911188 0 0 362064 0 4672 11257 0 41 49 11 1 0 212 25448 803944 2466824 0 0 437944 0 5646 10985 0 52 48 1 1 0 212 25448 808556 2466372 0 0 438388 0 5903 909 0 66 34 0 1 0 212 25384 809096 2466092 0 0 426220 0 5815 225 0 67 33 0 1 0 212 25448 804880 2466408 0 0 430124 12 5806 239 0 67 33 0 1 0 212 25320 809184 2466004 0 0 430036 4 5827 224 0 67 33 0 1 0 212 25384 806212 2466116 0 0 431636 0 5856 271 0 67 33 0 1 0 212 25448 809048 2466400 0 0 428524 0 5841 225 0 67 33 0 1 0 212 25448 805332 2466216 0 0 430680 0 5816 485 0 67 33 0 1 0 212 25448 809144 2466304 0 0 429480 0 5854 221 0 67 33 0 1 0 212 25448 808604 2466324 0 0 433944 0 5829 235 0 67 32 0 1 0 212 25448 806488 2466360 0 0 432272 0 5906 247 0 67 33 0 1 0 212 25384 809124 2466064 0 0 428120 0 5842 229 0 67 34 0 1 0 212 25384 809200 2466248 0 0 434176 0 5796 228 0 68 33 0 1 0 212 25384 809216 2466232 0 0 425984 0 5808 229 0 66 34 0 1 0 212 25384 809164 2466284 0 0 434176 0 5837 226 0 67 33 0 1 0 212 25384 805848 2466220 0 0 431196 0 5861 238 0 66 34 0 1 0 212 25448 809144 2466304 0 0 428964 0 5873 227 0 68 33 0 1 0 212 25384 805692 2466116 0 0 431112 0 5853 426 0 65 34 1 1 0 212 25384 805376 2466172 0 0 442084 0 5955 228 0 68 32 0 1 0 212 25448 805744 2466324 0 0 442796 0 5902 695 0 67 33 0 1 0 212 25384 809072 2466116 0 0 445288 0 5877 234 0 67 33 0 1 0 212 25448 809016 2466172 0 0 434176 0 5865 679 0 66 34 0 1 0 212 25448 806020 2466308 0 0 431388 0 5861 224 0 67 33 0 0 0 212 836008 108 2466220 0 0 174828 64 3029 160 0 38 61 2 0 0 212 836008 108 2466220 0 0 0 0 1009 15 0 0 100 0 0 0 212 836008 108 2466220 0 0 0 0 1016 45 0 0 100 0 0 0 212 836008 108 2466220 0 0 0 0 1007 13 0 0 100 0 well the read speed is nice and uniform and high speed. Also the system wait io is 0 for the whole duration as is the number of blocked jobs. Hence nice and stable system. Final verdict: system unstable due to XFS being untuned, but how to tune it to work in a similar way???