[[TableOfContents]] = General concepts = == Goodies == * the `threads` subcommand lets you know what each thread is currently doing {{{osd threads io }}} == Working with an AFS+OSD cell == The following techniques are only useful when using OSD as a frontend for HSM: * {{{vos listobj }}} is useful for finding all objects stored on a given OSD * {{{fs prefetch }}} allows the user to schedule tape restore operations * the fetch queue for a given osd can be examined: {{{osd fetchqueue io_a osd f io_a}}} * it is useful to have ''bosserver'' run a script to wipe objects off the OSDs according to OSD fill rate etc. * the atime can be used to reach wiping decisions * this requires a filesystem with atime support * the OSDDB can tell non/wipeable OSDs apart Other HSM related features: * md5 checksums can be retrieved for archival copies of objects (unverified) {{{ osd md5 io 536870930.2.1413.0 4 }}} This is how a fileserver and OSD server can share one machine: * touch the file ''/vicepx/OnlyRXOSD'' to forbid the fileserver to use this partition The new `fs ls` subcommand can tell files apart: {{{ [iokaste] /afs/desytest/osdtest_2 # fs ls m rwx root 2048 2008-05-08 13:21:01 . m rwx root 2048 2008-05-08 16:31:48 .. f rw- bin 44537893 2008-05-08 11:36:40 ascii-file f rw- bin 1048576 2008-05-08 13:21:14 ascii-file.osd d rwx bin 2048 2008-05-08 09:06:52 dir f rw- bin 0 2008-05-08 13:08:46 empty-file o rw- bin 44537893 2008-05-08 13:19:58 new-after-move }}} where '''m''' is a mountpoint, '''f''' a file, '''d''' a directory and '''o''' an object. `fs ls` will also identify files with their objects wiped from on-line object storage (i.e., with archival copies only). === How to migrate data from an OSD === 1. set a low write priority to stop fileservers from storing data on the OSD in question {{{osd setosd -wrprior 0 }}} 2. use {{{vos listobj }}} to identify the files (by fid) that have data on the OSD 3. use {{{fs replaceosd }}} to move each file's data to another OSD == Priorities and choice of storing OSD == * OSDs are used in a Round Robin fashion * priorities add weight to each OSD * static priorities can be set by the administrator {{{osd setosd -wrprior ... -rdprior ... }}} * priorities are dynamic * ownerships apply a modifier of +/- 10 to the static priority * locations apply a modifier of +/- 20 * the fill percentage plays a role if above 50% or 95% respectively Customizing owner, location: * the `osd addserver` subcommand {{{osd adds 141.34.22.101 iokaste dtc ifh }}} makes an AFS fileserver (!) known to the OSDDB * this is required in order to make use of ''ownerships'' and ''locations'' * this data can be examined using {{{[iokaste] /afs/desytest/osdtest_2 # osd servers Server 'iokaste' with id=141.34.22.101: owner = 1685349120 = 'dtc' location = 1768318976 = 'ifh'}}} * /!\ The help on `osd addserver` is misleading: {{{ # osd help adds osd addserver: create server entry in osddb Usage: osd addserver -id -name [-owner ] [-location ] [-cell ] [-help] }}} as the name to be specified is not actually an "osd name" but an alias name for the file server you're adding. == Data held in volumes, DBs etc. == * all metadata belonging to files in a volume are stored in a designated volume special file * metadata references OSDs by ''id'' * OSD ''id'' is an index into the OSDDB * these IDs are permanent and can never be reused after deletion of an OSDDB entry * view the OSDDB using {{{osd l }}} * a file's metadata is found using the ''metadataindex'' stored in the file's vnode * view the metadata using {{{fs osd }}} == How to upgrade a cell to AFS+OSD == 1. set up OSDDB on the database servers 2. set up pristine AFS+OSD fileservers + OSDs 3. move volumes to the AFS+OSD fileservers * volserver is supposed to be armed with a `-convertvolumes` switch for that purpose * otherwise, set the osdflag by hand {{{vos setfields -osd 1 }}} = Policies = * policies are going to decide 4 basic things: 1. whether object storage is to be used for a given file 2. whether the objects are to be mirrored and how often 3. how many stripes the file will comprise 4. the size of the stripes * two pieces of information will be used to make decisions: 1. file size 2. file name (i.e. prefix and suffix) == Open questions == * will inheritance be supported for policy definitions? * how many levels of inheritance are permitted (1 or ''n'')? * in what way can policies be represented/stored? = Technical aspects = == Performance == * client connections to OSDs are cheap because they are unauthorized * clients do connect to each single OSD they have business with * `osd psread` acts as though reading stripes from multiple OSDs * i.e. it opens several connections, in this case all targets are the same however * each stripe has an altered offset (e.g., normally first n stripes start at offset 0 on each OSD, here it is 0+(i*stripesize) etc.) * this is not impossible for access to AFS fileservers, but making connections is more costly = Notes on the code (changes) = * hashing is used for management of DB contents (not a change) * choice of OSDs resides in ''osddbuser.c'' == Link tables == * original link table entries consisted of 5 columns of 3 bits each * thus, 5 different versions of a file were supported, each with a max. link count of 7 * AFS+OSD (like MR-AFS) supports more places for one and the same file to live: {{{ * 2 RW volumes (the 2nd during a move operation) * 1 BK volume * 13 RO volumes * 1 clone during move}}} which amounts to up to 17 places for a file * also, there might be as many as 6 versions of a file: {{{ * 1 RW volume * 1 BK volume * 1 clone during move * 1 RO * 1 RO-old during vos release * 1 may be an old RO which was not reachable during the last vos release.}}} * as at least 6 columns are needed with at least 5 bits per column, a link table row now consists of 32 bits instead of 16 The explanations are from ''vol/namei_ops.c''. The new format is used as ''Linktable version 2'', with the original format still being supported as ''version 1''. == Debugging techniques == * add trace* calls to the code * CM_TRACE_WASHERE especially handy * use {{{fstrace }}} to enable debugging on the client/server (?) = Open issues = * multihomed servers are a problem * this actually requires changes to the DB * possible coding technique: change existing RPCs, provide the original RPC as "Old..." that the clients can know about * the link count is a critical datum * it is controlled by UDP-based RPCs * this can cause data loss * are correcting algorithms necessary? * stripe sizes appear to be chosen too small for realistic use * current possible values would require immense read-ahead by the client * file size is not always known at creation time * this is a general problem for files that are larger than the client cache, as those ''will'' be fsync'd to the fileserver prematurely * is there any kind of salting to the round robin algorithm used for choosing the storing OSDs? * i.e., will a job that writes ''n'' files of fixed but different sizes periodically to ''n'' OSDs always put the ''i''th file onto the ''i''th OSD? * how can one change the layout of a file segment (e.g., add a mirror)? * this would require a new subcommand and a new RPC for the RXOSD server * can the changes be ported to openafs-1.5.x? * this has been done for an earlier version of the 1.5-branch and should be possible in general * many osd subcommands require the user to onter IP+LUN, where OSD ID would be better * this has to be changed, and care must be taken as these commands are in production at Garching = Unsorted = * something about Copy``On``Write for volume replikas