erinye2-vm2 writes a large file (16gb):
# /afs/ipp-garching.mpg.de/.cs/perftest/i386_rh90/write_test /afs/ifh.de/testsuite/testosd/testfile-large 0 17179869184 ... write of 17179869184 bytes took 187.631 sec. close took 0.602 sec. Total data rate = 89130 Kbytes/sec. for write
...is considerable after all. Upon start of a batch job, read throughput has been observed to drop from ~115MBps to ~50MBps.
Things to find out:
- what role does CPU load play?
- what role does system load play?
- is the amount of time jobs spend in kernel context important?
We did at least once observe cache corruptions. This is what five machines said during one run:
wrong offset found: (0x1, 0xa71c0000) instead of (0x0, 0x15300000) wrong offset found: (0x0, 0x73f90000) instead of (0x1, 0x22160000) wrong offset found: (0x1, 0x7d120000) instead of (0x0, 0x94400000) wrong offset found: (0x2, 0xab0f0000) instead of (0x0, 0xbd0000) wrong offset found: (0x1, 0x50580000) instead of (0x0, 0x88900000)
Apparently, all of them were running pre-r690 versions of the AFS+OSD client, so this seems to confirm that
- there was indeed a problem with the vicep-access code on Lustre and
- r690 may have indeed fixed that very problem.