1173
Comment: brief problem description and extrapolation
|
← Revision 3 as of 2009-06-09 12:48:22 ⇥
1333
|
Deletions are marked like this. | Additions are marked like this. |
Line 13: | Line 13: |
Things to find out: * what role does CPU load play? * what role does system load play? * is the amount of time jobs spend in kernel context important? |
Write performance
erinye2-vm2 writes a large file (16gb):
# /afs/ipp-garching.mpg.de/.cs/perftest/i386_rh90/write_test /afs/ifh.de/testsuite/testosd/testfile-large 0 17179869184 ... write of 17179869184 bytes took 187.631 sec. close took 0.602 sec. Total data rate = 89130 Kbytes/sec. for write
Read performance
load impact
...is considerable after all. Upon start of a batch job, read throughput has been observed to drop from ~115MBps to ~50MBps.
Things to find out:
- what role does CPU load play?
- what role does system load play?
- is the amount of time jobs spend in kernel context important?
Problems
We did at least once observe cache corruptions. This is what five machines said during one run:
wrong offset found: (0x1, 0xa71c0000) instead of (0x0, 0x15300000) wrong offset found: (0x0, 0x73f90000) instead of (0x1, 0x22160000) wrong offset found: (0x1, 0x7d120000) instead of (0x0, 0x94400000) wrong offset found: (0x2, 0xab0f0000) instead of (0x0, 0xbd0000) wrong offset found: (0x1, 0x50580000) instead of (0x0, 0x88900000)
Apparently, all of them were running pre-r690 versions of the AFS+OSD client, so this seems to confirm that
- there was indeed a problem with the vicep-access code on Lustre and
- r690 may have indeed fixed that very problem.