[[TableOfContents]] = General concepts = == Goodies == * the `threads` subcommand lets you know what each thread is currently doing {{{osd threads io }}} == Working with an AFS+OSD cell == The following techniques are only useful when using OSD as a frontend for HSM: * {{{vos listobj }}} is useful for finding all objects stored on a given OSD * {{{fs prefetch }}} allows the user to schedule tape restore operations * it is useful to have ''bosserver'' run a script to wipe objects off the OSDs according to OSD fill rate etc. These are general techniques: * {{{osd addserver }}} makes an AFS fileserver (!) known to the OSDDB * this is required in order to make use of ''ownerships'' and ''locations'' This is how a fileserver and OSD server can share one machine: * touch the file ''/vicepx/OnlyRXOSD'' to forbid the fileserver to use this partition === How to migrate data from an OSD === 1. set a low write priority to stop fileservers from storing data on the OSD in question {{{osd setosd -wrprior 0 }}} 2. use {{{vos listobj }}} to identify the files (by fid) that have data on the OSD 3. use {{{fs replaceosd }}} to move each file's data to another OSD == Priorities and choice of storing OSD == * OSDs are used in a Round Robin fashion * priorities add weight to each OSD * static priorities can be set by the administrator {{{osd setosd -wrprior ... -rdprior ... }}} * priorities are dynamic * ownerships apply a modifier of +/- 10 to the static priority * locations apply a modifier of +/- 20 * the fill percentage plays a role if above 50% or 95% respectively == Data held in volumes, DBs etc. == * all metadata belonging to files in a volume are stored in a designated volume special file * metadata references OSDs by ''id'' * OSD ''id'' is an index into the OSDDB * these IDs are permanent and can never be reused after deletion of an OSDDB entry * view the OSDDB using {{{osd l }}} * a file's metadata is found using the ''metadataindex'' stored in the file's vnode * view the metadata using {{{fs osd }}} == How to upgrade a cell to AFS+OSD == 1. set up OSDDB on the database servers 2. set up pristine AFS+OSD fileservers + OSDs 3. move volumes to the AFS+OSD fileservers * volserver is supposed to be armed with a `-convertvolumes` switch for that purpose * otherwise, set the osdflag by hand {{{vos setfields -osd 1 }}} = Policies = * policies are going to decide 4 basic things: 1. whether object storage is to be used for a given file 2. whether the objects are to be mirrored and how often 3. how many stripes the file will comprise 4. the size of the stripes * two pieces of information will be used to make decisions: 1. file size 2. file name (i.e. prefix and suffix) == Open questions == * will inheritance be supported for policy definitions? * how many levels of inheritance are permitted (1 or ''n'')? * in what way can policies be represented/stored? = Technical aspects = == Performance == * client connections to OSDs are cheap because they are unauthorized * clients do connect to each single OSD they have business with * `osd psread` acts as though reading stripes from multiple OSDs * i.e. it opens several connections, in this case all targets are the same however * each stripe has an altered offset (e.g., normally first n stripes start at offset 0 on each OSD, here it is 0+(i*stripesize) etc.) * this is not impossible for access to AFS fileservers, but making connections is more costly = Notes on the code (changes) = * original linktable was insufficient, further fields were required * link table v2 uses 30 bits instead of 15 * hashing is used for management of DB contents * choice of OSDs resides in ''osddbuser.c'' == Debugging techniques == * add trace* calls to the code * CM_TRACE_WASHERE especially handy * use {{{fstrace }}} to enable debugging on the client/server (?) = Open issues = * multihomed servers are a problem * this actually requires changes to the DB * possible coding technique: change existing RPCs, provide the original RPC as "Old..." that the clients can know about * the link count is a critical datum * it is controlled by UDP-based RPCs * this can cause data loss * are correcting algorithms necessary? * stripe sizes appear to be chosen too small for realistic use * current possible values would require immense read-ahead by the client * file size is not always known at creation time * this is a general problem for files that are larger than the client cache, as those ''will'' be fsync'd to the fileserver prematurely * is there any kind of salting to the round robin algorithm used for choosing the storing OSDs? * i.e., will a job that writes ''n'' files of fixed but different sizes periodically to ''n'' OSDs always put the ''i''th file onto the ''i''th OSD? * how can one change the layout of a file segment (e.g., add a mirror)? * this would require a new subcommand and a new RPC for the RXOSD server * can the changes be ported to openafs-1.5.x? * this has been done for an earlier version of the 1.5-branch and should be possible in general * many osd subcommands require the user to onter IP+LUN, where OSD ID would be better * this has to be changed, and care must be taken as these commands are in production at Garching = Unsorted = * something about Copy``On``Write for volume replikas