[[TableOfContents]]

= General concepts =

== Goodies ==
 * the `threads` subcommand lets you know what each thread is currently doing {{{osd threads io
}}}

== Working with an AFS+OSD cell ==
The following techniques are only useful when using OSD as a frontend for HSM:
 * {{{vos listobj
}}} is useful for finding all objects stored on a given OSD
 * {{{fs prefetch
}}} allows the user to schedule tape restore operations
  * the fetch queue for a given osd can be examined: {{{osd fetchqueue io_a
osd f io_a}}}
 * it is useful to have ''bosserver'' run a script to wipe objects off the OSDs according to OSD fill rate etc.
 * the atime can be used to reach wiping decisions
  * this requires a filesystem with atime support
 * the OSDDB can tell non/wipeable OSDs apart

Other HSM related features:
 * md5 checksums can be retrieved for archival copies of objects (unverified) {{{ osd md5 io 536870930.2.1413.0 4
}}}

This is how a fileserver and OSD server can share one machine:
 * touch the file ''/vicepx/OnlyRXOSD'' to forbid the fileserver to use this partition

The new `fs ls` subcommand can tell files apart: 
 {{{ [iokaste] /afs/desytest/osdtest_2 # fs ls
m rwx    root        2048 2008-05-08 13:21:01 .
m rwx    root        2048 2008-05-08 16:31:48 ..
f rw-     bin    44537893 2008-05-08 11:36:40 ascii-file
f rw-     bin     1048576 2008-05-08 13:21:14 ascii-file.osd
d rwx     bin        2048 2008-05-08 09:06:52 dir
f rw-     bin           0 2008-05-08 13:08:46 empty-file
o rw-     bin    44537893 2008-05-08 13:19:58 new-after-move }}} where '''m''' is a mountpoint, '''f''' a file, '''d''' a directory and '''o''' an object. `fs ls` will also identify files with their objects wiped from on-line object storage (i.e., with archival copies only).





=== How to migrate data from an OSD ===
 1. set a low write priority to stop fileservers from storing data on the OSD in question {{{osd setosd -wrprior 0
}}}
 2. use {{{vos listobj
}}} to identify the files (by fid) that have data on the OSD
 3. use {{{fs replaceosd
}}} to move each file's data to another OSD

== Priorities and choice of storing OSD ==
 * OSDs are used in a Round Robin fashion
 * priorities add weight to each OSD
 * static priorities can be set by the administrator {{{osd setosd -wrprior ... -rdprior ...
}}}
 * priorities are dynamic
  * ownerships apply a modifier of +/- 10 to the static priority
  * locations apply a modifier of +/- 20
  * the fill percentage plays a role if above 50% or 95% respectively

Customizing owner, location:
 * the `osd addserver` subcommand {{{osd adds 141.34.22.101 iokaste dtc ifh
}}} makes an AFS fileserver (!) known to the OSDDB
  * this is required in order to make use of ''ownerships'' and ''locations''
  * this data can be examined using {{{[iokaste] /afs/desytest/osdtest_2 # osd servers
Server 'iokaste' with id=141.34.22.101:
        owner           = 1685349120 = 'dtc'
        location        = 1768318976 = 'ifh'}}}
  * /!\ The help on `osd addserver` is misleading: {{{ # osd help adds
osd addserver: create server entry in osddb 
Usage: osd addserver -id <ip address> -name <osd name> [-owner <group name (max 3 char)>] [-location <max 3 characters>] [-cell <cell name>] [-help] }}} as the name to be specified is not actually an "osd name" but an alias name for the file server you're adding.

== Data held in volumes, DBs etc. ==
 * all metadata belonging to files in a volume are stored in a designated volume special file
 * metadata references OSDs by ''id''
 * OSD ''id'' is an index into the OSDDB
  * these IDs are permanent and can never be reused after deletion of an OSDDB entry
  * view the OSDDB using {{{osd l
}}}
 * a file's metadata is found using the ''metadataindex'' stored in the file's vnode
  * view the metadata using {{{fs osd <file>
}}}

== How to upgrade a cell to AFS+OSD ==
 1. set up OSDDB on the database servers
 2. set up pristine AFS+OSD fileservers + OSDs
 3. move volumes to the AFS+OSD fileservers
  * volserver is supposed to be armed with a `-convertvolumes` switch for that purpose
  * otherwise, set the osdflag by hand {{{vos setfields <volume> -osd 1
}}}



= Policies =
 * policies are going to decide 4 basic things:
  1. whether object storage is to be used for a given file
  2. whether the objects are to be mirrored and how often
  3. how many stripes the file will comprise
  4. the size of the stripes
 * two pieces of information will be used to make decisions:
  1. file size
  2. file name (i.e. prefix and suffix)

== Open questions ==
 * will inheritance be supported for policy definitions?
  * how many levels of inheritance are permitted (1 or ''n'')?
 * in what way can policies be represented/stored?





= Technical aspects =

== Performance ==
 * client connections to OSDs are cheap because they are unauthorized
 * clients do connect to each single OSD they have business with
 * `osd psread` acts as though reading stripes from multiple OSDs
  * i.e. it opens several connections, in this case all targets are the same however
  * each stripe has an altered offset (e.g., normally first n stripes start at offset 0 on each OSD, here it is 0+(i*stripesize) etc.)
  * this is not impossible for access to AFS fileservers, but making connections is more costly




= Notes on the code (changes) =
 * hashing is used for management of DB contents (not a change)
 * choice of OSDs resides in ''osddbuser.c''

== Link tables ==
 * original link table entries consisted of 5 columns of 3 bits each
  * thus, 5 different versions of a file were supported, each with a max. link count of 7
 * AFS+OSD (like MR-AFS) supports more places for one and the same file to live: {{{ 
 *      2 RW volumes (the 2nd during a move operation)
 *      1 BK volume
 *     13 RO volumes
 *      1 clone during move}}} which amounts to up to 17 places for a file
 * also, there might be as many as 6 versions of a file: {{{
 *      1 RW volume
 *      1 BK volume 
 *      1 clone during move
 *      1 RO
 *      1 RO-old during vos release
 *      1 may be an old RO which was not reachable during the last vos release.}}}
 * as at least 6 columns are needed with at least 5 bits per column, a link table row now consists of 32 bits instead of 16
The explanations are from ''vol/namei_ops.c''. The new format is used as ''Linktable version 2'', with the original format still being supported as ''version 1''.

== Debugging techniques ==
 * add trace* calls to the code
  * CM_TRACE_WASHERE especially handy
 * use {{{fstrace
}}} to enable debugging on the client/server (?)





= Open issues =

 * multihomed servers are a problem
  * this actually requires changes to the DB
  * possible coding technique: change existing RPCs, provide the original RPC as "Old..." that the clients can know about

 * the link count is a critical datum
  * it is controlled by UDP-based RPCs <!>
  * this can cause data loss
   * are correcting algorithms necessary?

 * stripe sizes appear to be chosen too small for realistic use
  * current possible values would require immense read-ahead by the client

 * file size is not always known at creation time
  * this is a general problem for files that are larger than the client cache, as those ''will'' be fsync'd to the fileserver prematurely

 * is there any kind of salting to the round robin algorithm used for choosing the storing OSDs?
  * i.e., will a job that writes ''n'' files of fixed but different sizes periodically to ''n'' OSDs always put the ''i''th file onto the ''i''th OSD?

 * how can one change the layout of a file segment (e.g., add a mirror)?
  * this would require a new subcommand and a new RPC for the RXOSD server

 * can the changes be ported to openafs-1.5.x?
  * this has been done for an earlier version of the 1.5-branch and should be possible in general

 * many osd subcommands require the user to onter IP+LUN, where OSD ID would be better
  * this has to be changed, and care must be taken as these commands are in production at Garching

= Unsorted =
 * something about Copy``On``Write for volume replikas