<> = Subcommands = This is the remainder of our first tentative tries with the OpenAFS+OSD system. Bug hints should be checked for their validity, this information is months old already! == fs createstripedfile == === creating mirrored files === {{{ # fs creates /afs/desytest/osdtest/mirrored-file 1 14 -copies 2 }}} where 1 is the ''number of stripes'' and 14 is ''log_2(stripesize in bytes)''. Note:: Allowed exponents are 12..19, allowing stripe sizes 4kB..512kB. The above command created an unstriped but mirrored file. === creating striped files === This is how we created the first striped file: {{{ # fs creates /afs/desytest/osdtest/striped_file_1 2 14 done }}} We stored a kernel image inside: {{{ # cp /boot/vmlinuz-2.6.18-8.1.10.el5 striped_file_1 # mv striped_file_1 vmlinuz-2.6.18-8.1.10.el5 # ll total 7615 -rw-r--r-- 1 bin root 417032 Jan 14 14:54 libafs.ko -rw-r--r-- 1 bin root 5612872 Jan 28 15:31 mirrored._usr_afs.2007-12-17.tar.bz2 -rw-r--r-- 1 bin root 1765652 Feb 14 13:37 vmlinuz-2.6.18-8.1.10.el5 # fs osd vmlinuz-2.6.18-8.1.10.el5 vmlinuz-2.6.18-8.1.10.el5 has 172 bytes of osd metadata, v=3 On-line, 1 segm, flags=0x0 segment: lng=0, offs=0, stripes=2, strsize=16384, cop=1, 2 objects object: pid=536870924, oid=720578573194231814, osd=12, stripe=0 obj=536870924.6.167772773.0 object: pid=536870924, oid=3026421582407925766, osd=11, stripe=1 obj=536870924.6.704643685.0 }}} As the stripes were put in OSDs 11 and 12 respectively: {{{ # osd listo id name(loc) ---total space--- flag prior. own. server lun size range 1 local_disk wr rd (0kb-1mb) 8 io_a 279 gb 0.0 % up hsm 50 0 io.ifh.d 0 (1mb-64gb) 9 io_b 279 gb 0.0 % up hsm 50 0 io.ifh.d 1 (1mb-64gb) 10 io_c 279 gb 0.0 % up hsm 50 0 io.ifh.d 2 (1mb-64gb) 11 io_d 279 gb 0.0 % up hsm 50 0 io.ifh.d 3 (1mb-64gb) 12 io_e 278 gb 0.0 % up hsm 50 0 io.ifh.d 4 (1mb-64gb) # osd vol io 1 volumes found: 536870924 # osd obj io 536870924 3 536870924.4.33555043.0 fid 536870924.4.611 tag 0 not-striped lng 4209654 lc 1 536870924.6.704643685.0 fid 536870924.6.613 tag 0 1/2/16384 lng 880916 lc 1 536870924.4.704643683.1 fid 536870924.4.611 tag 1 1/2/16384 lng 0 lc 1 3 object(s) with totally 5090570 bytes for volume 536870924 found # osd obj io 536870924 4 536870924.4.167772771.0 fid 536870924.4.611 tag 0 0/2/16384 lng 0 lc 1 536870924.6.167772773.0 fid 536870924.6.613 tag 0 0/2/16384 lng 884736 lc 1 2 object(s) with totally 884736 bytes for volume 536870924 found }}} The first file on ''io/3'' (vicepd) belongs to the mirrored file created above, not the striped file. === adding segments to an existing OSD file === {{{ # fs creates /afs/desytest/osdtest/mirrored._usr_afs.2007-12-17.tar.bz2 2 14 File /afs/desytest/osdtest/mirrored._usr_afs.2007-12-17.tar.bz2 already exists. Create new segment? (Y|N) y done # fs osd mirrored._usr_afs.2007-12-17.tar.bz2 mirrored._usr_afs.2007-12-17.tar.bz2 has 436 bytes of osd metadata, v=3 On-line, 3 segm, flags=0x0 segment: lng=5612872, offs=0, stripes=1, strsize=16384, cop=2, 2 objects object: pid=536870924, oid=144117812300873732, osd=10, stripe=0 obj=536870924.4.33555043.0 object: pid=536870924, oid=144117812300873732, osd=11, stripe=0 obj=536870924.4.33555043.0 segment: lng=5612872, offs=5612872, stripes=2, strsize=16384, cop=1, 2 objects object: pid=536870924, oid=720578564604297220, osd=12, stripe=0 obj=536870924.4.167772771.0 object: pid=536870924, oid=3026421573885100036, osd=11, stripe=1 obj=536870924.4.704643683.1 segment: lng=0, offs=5612872, stripes=2, strsize=16384, cop=1, 2 objects object: pid=536870924, oid=720578564671406084, osd=12, stripe=0 obj=536870924.4.167772771.1 object: pid=536870924, oid=3026421573952208900, osd=11, stripe=1 obj=536870924.4.704643683.2 }}} == fs osd == {{{ # fs osd /afs/desytest/osdtest/mirrored-file /afs/desytest/osdtest/mirrored-file has 172 bytes of osd metadata, v=3 On-line, 1 segm, flags=0x0 segment: lng=0, offs=0, stripes=1, strsize=16384, cop=2, 2 objects object: pid=536870924, oid=144117812300873732, osd=10, stripe=0 obj=536870924.4.33555043.0 object: pid=536870924, oid=144117812300873732, osd=11, stripe=0 obj=536870924.4.33555043.0 }}} where ''osd=10'' and ''osd=11'' identify the storing OSD of each file copy. IDs are looked up using `osd listo`. == osd wipecandidates == {{{ # osd wipe io # osd wipe iokaste [ some waiting time ] RXOSD_wipe_candidates failed with code -1 Request aborted. # osd wipe /afs/desytest/osdtest/mirrored._usr_afs.2007-12-17.tar.bz2 (no error output) # osd wipe io 0 # osd wipe io 1 # osd wipe io 2 0 Jan 24 16:03 4209654 536870924.4.33555043.0 536870924.4.33555043 # osd wipe io 3 0 Jan 28 15:31 4209654 536870924.4.33555043.0 536870924.4.33555043 # osd wipe io 4 RXOSD_wipe_candidates failed with code 5 Request aborted. }}} Note:: The last request propably failed because there was no /vicepe/AFSIDat/ directory on io at the time. Note:: Using a filename as the only argument to `osd wipe` should have at least yielded an error similar to when using ''iokaste'' (which is not an OSD server). Note:: The entries for io/2 and io/3 are the two copies of the file created above. == fs wipe == {{{ # fs wipe mirrored._usr_afs.2007-12-17.tar.bz2 Could not wipe mirrored._usr_afs.2007-12-17.tar.bz2, error code was 22 }}} Note:: This appears to be the default answer to a tried wipe of a file that has not yet been archived using `fs archive`. == osd createvolume == {{{ [iokaste] ~ # vos create iokaste a osdtestvol2 Volume 536870927 created on partition /vicepa of iokaste [iokaste] ~ # osd createvol io 536870927 0 Createvolume: volume 536870927 successfully created. [iokaste] ~ # osd volu io 0 2 volumes found: 536870924 536870927 [iokaste] ~ # osd volu io 1 1 volumes found: 536870924 }}} == osd volumes == {{{ # osd volumes io 1 volumes found: 536870924 }}} That's the same VolID as in AFS: {{{ # vos listvl VLDB entries for all servers osdtestvol RWrite: 536870924 number of sites -> 1 server iokaste.ifh.de partition /vicepa RW Site root.cell RWrite: 536870921 number of sites -> 1 server iokaste.ifh.de partition /vicepa RW Site Total entries: 2 }}} Note:: This information seems to be persistent, even after removing all of a volume's files from Object Storage: {{{ [iokaste] /afs/desytest/osdtest # osd volu io 0 1 volumes found: 536870924 [iokaste] /afs/desytest/osdtest # osd volu io 1 1 volumes found: 536870924 [iokaste] /afs/desytest/osdtest # osd volu io 2 1 volumes found: 536870924 [iokaste] /afs/desytest/osdtest # osd volu io 3 1 volumes found: 536870924 [iokaste] /afs/desytest/osdtest # osd volu io 4 1 volumes found: 536870924 [iokaste] /afs/desytest/osdtest # osd volu io 5 0 volumes found: [iokaste] /afs/desytest/osdtest # osd obj io 536870924 0 0 object(s) with totally 0 bytes for volume 536870924 found [iokaste] /afs/desytest/osdtest # osd obj io 536870924 1 0 object(s) with totally 0 bytes for volume 536870924 found [iokaste] /afs/desytest/osdtest # osd obj io 536870924 2 0 object(s) with totally 0 bytes for volume 536870924 found [iokaste] /afs/desytest/osdtest # osd obj io 536870924 0 object(s) with totally 0 bytes for volume 536870924 found [iokaste] /afs/desytest/osdtest # osd obj io 536870924 3 0 object(s) with totally 0 bytes for volume 536870924 found [iokaste] /afs/desytest/osdtest # osd obj io 536870924 4 0 object(s) with totally 0 bytes for volume 536870924 found }}} == osd objects == Both volume and partition are needed: {{{ # osd obj io 536870924 0 object(s) with totally 0 bytes for volume 536870924 found # osd obj io 536870924 2 536870924.4.33555043.0 fid 536870924.4.611 tag 0 not-striped lng 4209654 lc 1 1 object(s) with totally 4209654 bytes for volume 536870924 found }}} Note:: Without specifying a lun, the default of 0 (vicepa) is used. == osd examine == This command also has issues with missing LUNs: {{{ [iokaste] /afs/desytest/osdtest # osd exa io 536870924.6.621.0 RXOSD_examine failed with code 5 Request aborted. [iokaste] /afs/desytest/osdtest # osd exa io 536870924.6.621.0 -lun 4 536870924.6.621.0 fid 536870924.6.621 tag 0 not-striped lng 44537893 lc 1 Feb 21 13:53 }}} == osd read == This works for unstriped files, but suffers from the "missing LUN syndrome" as well: {{{ [iokaste] /afs/desytest/osdtest # osd read io 536870924.6.621 RX xdr error Cannot read the object rx_EndCall returns: 5 Request aborted. [iokaste] /afs/desytest/osdtest # osd read io 536870924.6.621 0 512 -lun 4 sscanf failed at offset (0x0, 0x0) [1177711200] LOG ROTATION: DAILY [1177711200] LOG VERSION: 2.0 [1177711200] CURRENT HOST STATE: a;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.13 ms [1177711200] CURRENT HOST STATE: amafs;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.14 ms [1177711200] CURRENT HOST STATE: aphrodite;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.54 ms [1177711200] CURRENT HOST STATE: aquila;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.25 ms [1177711200] CURRENT HOST STATE: ares;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0. reading of 512 bytes took 0.003 sec. Total data rate = 180 Kbytes/sec. for read. }}} Note:: '''0''' and '''512''' in example 2 are ''offset'' and ''bytes to read'', resp. == fs replaceosd == First tries were very discouraging: {{{ [iokaste] ~ # fs osd /afs/desytest/osdtest/nagios-04-29-2007-00.log /afs/desytest/osdtest/nagios-04-29-2007-00.log has 132 bytes of osd metadata, v=3 On-line, 1 segm, flags=0x0 segment: lng=0, offs=0, stripes=1, strsize=0, cop=1, 1 objects object: pid=536870924, oid=2667174690822, osd=12, stripe=0 obj=536870924.6.621.0 [iokaste] ~ # fs rep /afs/desytest/osdtest/nagios-04-29-2007-00.log 12 8 failed to replace osd 12 for /afs/desytest/osdtest/nagios-04-29-2007-00.log, error code is 5 [iokaste] ~ # fs rep /afs/desytest/osdtest/nagios-04-29-2007-00.log 12 }}} The last command hung and could not be TERMed or KILLed. A subsequent {{{ [iokaste] ~ # fs osd /afs/desytest/osdtest/nagios-04-29-2007-00.log }}} got stuck much the same. = Case study: Creating a long striped file = {{{ [iokaste] /afs/desytest/osdtest # fs creates striped-ascii-44m 2 12 done [iokaste] /afs/desytest/osdtest # cp /root/nagios-04-29-2007-00.log striped-ascii-44m [ took about 3 seconds ] [iokaste] /afs/desytest/osdtest # fs osd striped-ascii-44m striped-ascii-44m has 172 bytes of osd metadata, v=3 On-line, 1 segm, flags=0x0 segment: lng=0, offs=0, stripes=2, strsize=4096, cop=1, 2 objects object: pid=536870924, oid=576463393708310532, osd=12, stripe=0 obj=536870924.4.134218343.0 object: pid=536870924, oid=2882306402922004484, osd=11, stripe=1 obj=536870924.4.671089255.0 [iokaste] /afs/desytest/osdtest # osd obj io 536870924 3 536870924.4.671089255.0 fid 536870924.4.615 tag 0 1/2/4096 lng 22267941 lc 1 1 object(s) with totally 22267941 bytes for volume 536870924 found [iokaste] /afs/desytest/osdtest # osd obj io 536870924 4 536870924.4.134218343.0 fid 536870924.4.615 tag 0 0/2/4096 lng 22269952 lc 1 1 object(s) with totally 22269952 bytes for volume 536870924 found }}} Note:: The lengths ('''lng''') of both objects amount to the size of the stored file: `22267941 + 22269952 = 44537893`. = patch analysis = This is a log of how the `goodies` branch came into existence. == general problems and remarks == * OSD awareness is not strictly protected by `#ifdef`s. VolSplit for example sports osd flags wether compiled in or not. * should/can the xg files respect definition of AFS_RXOSD_SUPPORT? * is it feasible to engross the declaration of, say, `UV_CreateVolume2` in a way that uses OSD-specific parameter(s) only if enabled? * inserting alternative calls will be fairly easy on the other hand Answer:: This is basically reserving of spare fields, and extra info doesn't hurt seeing as its use is #ifdef-ed. * can `fs fidvnode` be done without OSD support enabled? Answer:: Being implemented right now. * the `fid*` subcommands to `fs` sort of pollute its range of subcommands, `fs help` has gone quite long * could not a fid be specified in place of the file name in any case? or is ambiguity too great an issue? * in the latter case, ambiguity could be caught and reported as an error, forcing the user to specify a parameter to choose Answer:: We don't really want that, as fids are generally valid file names as well. Enlonging `fs help` output is tragic but to be solved in another way, if at all. * AFSFetchStatus.SyncCount is still used in WINNT/afsd/cm_[sd]cache.c * however, these are only initializations, the values are apparently never used, so it should be OK to erase/replace the initializations? Status:: fixed in the branch already, will be backported to trunk after build test == unidentified snippets == * {o} src/viced/afsfileprocs.c:2472 * Hartmut inserted an `#ifdef AFS_NT40_ENV` protecting two functions and some definitions. Probably a fix. Answer:: As the calls to the functions were thus #ifdef-ed already, so are their definitions now. * {o} src/viced/afsfileprocs.c:229 * this function is declared here in Hartmuts code. not yet sure why. probably because of the use introduced into common_StoreData64. not yet sure what kind of fixes are hidden here and to what extent they're required. '''Watch patches for calls to `common_StoreData64`'''! Answer:: There have been changes to their structure and some unification. It has to do with the different prefixes ''SAFS_'' and ''SRXAFS_''. * {o} src/viced/afsfileprocs.c:3715 * as mentioned above, the function common_StoreData64 was patched quite a bit. might have to ask Hartmut. * {o} src/viced/afsfileprocs.c:4858 * looks like a bug fix Answer:: The concept of file quota is new and interesting for MR/wipeable volumes. * {o} src/viced/afsfileprocs.c:5283 * this line moved inside the following if block. bug fix? Answer:: Yes. * /!\ src/viced/afsfileprocs.c:7379 * this appears to be lacking the matching setActive() * (!) probably matched to the ones in GiveUpCallBacks() and GiveUpAllCallbacks(), but cannot be sure yet. Answer:: It's about right. * afsfileprocs.c: * {o} `static char *locktype[4]` * what's this about? * {o} `SRXAFS_SetVolumeStatus` * something was stripped from here? * {o} fs.c * Hartmut seems to pass NULL instead of 0 to cmd_CreateSyntax conventionally. Where the hell is the declaration for this? Looks like a style fix/correction. Answer:: Yes. It had been done in 1.4.7 in some places, anyway, and it's correct. * {o} vol/vnode.c * This new log message is more compelling. I'd have included this in the fs_listvnode patch but it looks like a general fix and should go into a general patch as that. * {o} fsint/afsint.xg:136 * Looks like a common sense fix to exclude negative values here. Answer:: Pretty much. Allows for larger numbers, too. * {o} general question: why does `fs vnode` not work without OSD support in code. * have i missed something? * are vnodes fundamentally different with Hartmut's code? this couldn't be - it's interoperable w/ vanilla openafs when not build with --object-storage Answer:: There are still bugs. * {o} volser/vsutils.c:469 * this also has a fix-ish look to it Answer:: Allows for a read to terminate after the VolID (i.e. at the first '.') * {o} venus/kdump.c * not understood yet what this does at all Answer:: This is a debug utility that's supposed to allow the user to peek inside the kernel. According to Hartmut, there's trouble with 64-bit. * {o} vol/fssync.c * exactly what is that "good idea" and what do the other changes do? Answer:: GOOD_IDEA is just a random define to make new code visible better. * {o} venus/fs.c:155 * why is InitializeCBService() protected by `#ifdef AFS_RXOSD_SUPPORT`? * ListOffline() depends on it, but the command definition is not protected. Which one's the error here? Answer:: Most subcommand probably don't really need it. Its required when inquiring about files that are in object storage. * {*} venus/fs.c:3306 * this looks like a fix but i can't quite figure it out Answer:: Actually that was an addition to the funcionality (changing work station cell on the fly). * {o} Makefile.in * target ''vol'' is now dependency of both viced and afsd. how come? * from a certain `rm -rf` statement, Hartmut deleted the ${SYS_NAME} argument. why? * vol/vol-salvage.c * {o} 2939: why introduce the ''length'' variable here? == isolating vos splitvol == * sought through the patch, identifying places where * the client command was changed * the server was given the extra RPC * used `vimdiff` to introduce changes into * rxgen file * vos implementation * volser implementation and added the new ''vol_split.c'' code file, therefore * also updating volser's ''Makefile.in'' Conclusion:: It turns out that the volsplit functionality depends on the '''reverse lookup''' functionality, also new with the full patch. Thus isolating volsplit failed at this stage and must be postponed until reverse lookups have been isolated. Addendum:: Apparently, InverseLookup is only used by vos splitvol, so both possibly belong in the same sub patch. For better = Making the Rxosd server a fileserver+Rxosd server = 1. {{{ osd setosd 12 -wrprior 0 }}} where 12 is the ID of the OSD we want cleared. 2. {{{ for fid in `vos listobjects 12 iokaste` ; do fs fidreplaceosd $fid 12 ; done }}} `vos listo 12 iokaste` returns the ''object id''s of all objects stored on OSD 12 and metadata in volumes on ''iokaste''. 3. {{{ osd deleteosd 12 }}} 4. {{{ bos restart io rxosd }}} 5. unmount ''/vicepe'' and create new filesystem (data is still being kept otherwise, salvager tries and creates non-attachable volumes from what's left on the disk) 6. mount ''/vicepe'' 7. touch file ''OnlyRxosd'' on vice partitions a-d 8. create the fs instance on io = Stress testing = == 2009-02-11 == * galaxies 11 through 30 setup with osd147plus and 64MB memcache * reading several gigabytes of data from OSDs * performance was very poor, we discovered high client system loads due to heavy syslog activity * AFS module generated many warnings: `afs_get_hash_stats: Warning! exceeded max bucket len xx` with xx some number over 30 * things we tried * client 1.4.8, did '''not''' help * don't use blocksize of 4MB, did '''not''' help * client satyr4, did help (slower network?) * client phobos, did help (disk cache?) * disk cache, did help * read from AFS fileserver, did help * apparently, reading from OSD with memcache is the culprit /!\ For the time being, galaxy11 remains our only 1.4.8 farm node for testing = Volume splitting = Test scripts lived in root.cell. We didn't like that anymore, thus: {{{ [phobos] ~ # fs getfid /afs/desytest/wn-tests File /afs/desytest/wn-tests (536870922.5.7641) contained in volume 536870922 [phobos] ~ # vos split root.cell wntests 5 -verbose Volume wntests 536871067 created and brought online Created the VLDB entry for the volume wntests 536871067 1st step: extract vnode essence from large vnode file 5 large vnodes found 2nd step: look for name of vnode 5 in directory 536870921.1.1 name of 5 is wn-tests 3rd step: find all directory vnodes belonging to the subtree under 5 "wn-tests" 2 large vnodes will go into the new volume 4th step extract vnode essence from small vnode file 14 small vnodes found 5th step: find all small vnodes belonging to the subtree under 5 "wn-tests" 7 small vnodes will go into the new volume 6th step: create hard links in the AFSIDat tree between files of the old and new volume 7th step: create hard links in the AFSIDat tree between directories of the old and new volume and make dir 5 to new volume's root directory. 8th step: write new volume's metadata to disk 9th step: create mountpoint "wn-tests" for new volume in old volume's directory 1. 10th step: delete large vnodes belonging to subtree in the old volume. 11th step: delete small vnodes belonging to subtree in the old volume. Finished! [phobos] ~ # ls /afs/desytest files import_satyr4 phobos_volumes testing wn-tests [phobos] ~ # fs lsm /afs/desytest/wn-tests '/afs/desytest/wn-tests' is not a mount point. [phobos] ~ # ls /afs/desytest/wn-tests Ice IceLarge IceLargeNOOSD Output scripts [phobos] ~ # vos release root.cell Released volume root.cell successfully [phobos] ~ # fs lsm /afs/desytest/wn-tests '/afs/desytest/wn-tests' is a mount point for volume '#wntests' [phobos] ~ # ls /afs/desytest/wn-tests Ice IceLarge IceLargeNOOSD Output scripts [phobos] ~ # fs lq /afs/desytest/wn-tests Volume Name Quota Used %Used Partition wntests 5000 10 0% 20% }}} Yes, it's really that simple!