[[TableOfContents]]

= General concepts =

== Goodies ==
 * the `threads` subcommand lets you know what each thread is currently doing {{{osd threads io
}}}

== Working with an AFS+OSD cell ==
The following techniques are only useful when using OSD as a frontend for HSM:
 * {{{vos listobj
}}} is useful for finding all objects stored on a given OSD
 * {{{fs prefetch
}}} allows the user to schedule tape restore operations
 * it is useful to have ''bosserver'' run a script to wipe objects off the OSDs according to OSD fill rate etc.

These are general techniques:
 * {{{osd addserver
}}} makes an AFS fileserver (!) known to the OSDDB
  * this is required in order to make use of ''ownerships'' and ''locations''

This is how a fileserver and OSD server can share one machine:
 * touch the file ''/vicepx/OnlyRXOSD'' to forbid the fileserver to use this partition

=== How to migrate data from an OSD ===
 1. set a low write priority to stop fileservers from storing data on the OSD in question {{{osd setosd -wrprior 0
}}}
 2. use {{{vos listobj
}}} to identify the files (by fid) that have data on the OSD
 3. use {{{fs replaceosd
}}} to move each file's data to another OSD

== Priorities and choice of storing OSD ==
 * OSDs are used in a Round Robin fashion
 * priorities add weight to each OSD
 * static priorities can be set by the administrator {{{osd setosd -wrprior ... -rdprior ...
}}}
 * priorities are dynamic
  * ownerships apply a modifier of +/- 10 to the static priority
  * locations apply a modifier of +/- 20
  * the fill percentage plays a role if above 50% or 95% respectively

== Data held in volumes, DBs etc. ==
 * all metadata belonging to files in a volume are stored in a designated volume special file
 * metadata references OSDs by ''id''
 * OSD ''id'' is an index into the OSDDB
  * these IDs are permanent and can never be reused after deletion of an OSDDB entry
  * view the OSDDB using {{{osd l
}}}
 * a file's metadata is found using the ''metadataindex'' stored in the file's vnode
  * view the metadata using {{{fs osd <file>
}}}

== How to upgrade a cell to AFS+OSD ==
 1. set up OSDDB on the database servers
 2. set up pristine AFS+OSD fileservers + OSDs
 3. move volumes to the AFS+OSD fileservers
  * volserver is supposed to be armed with a `-convertvolumes` switch for that purpose
  * otherwise, set the osdflag by hand {{{vos setfields <volume> -osd 1
}}}



= Policies =
 * policies are going to decide 4 basic things:
  1. whether object storage is to be used for a given file
  2. whether the objects are to be mirrored and how often
  3. how many stripes the file will comprise
  4. the size of the stripes
 * two pieces of information will be used to make decisions:
  1. file size
  2. file name (i.e. prefix and suffix)

== Open questions ==
 * will inheritance be supported for policy definitions?
  * how many levels of inheritance are permitted (1 or ''n'')?
 * in what way can policies be represented/stored?





= Technical aspects =

== Performance ==
 * client connections to OSDs are cheap because they are unauthorized
 * clients do connect to each single OSD they have business with
 * `osd psread` acts as though reading stripes from multiple OSDs
  * i.e. it opens several connections, in this case all targets are the same however
  * each stripe has an altered offset (e.g., normally first n stripes start at offset 0 on each OSD, here it is 0+(i*stripesize) etc.)
  * this is not impossible for access to AFS fileservers, but making connections is more costly




= Notes on the code (changes) =
 * original linktable was insufficient, further fields were required
  * link table v2 uses 30 bits instead of 15

 * hashing is used for management of DB contents
 * choice of OSDs resides in ''osddbuser.c''
 

== Debugging techniques ==
 * add trace* calls to the code
  * CM_TRACE_WASHERE especially handy
 * use {{{fstrace
}}} to enable debugging on the client/server (?)





= Open issues =

 * multihomed servers are a problem
  * this actually requires changes to the DB
  * possible coding technique: change existing RPCs, provide the original RPC as "Old..." that the clients can know about

 * the link count is a critical datum
  * it is controlled by UDP-based RPCs <!>
  * this can cause data loss
   * are correcting algorithms necessary?

 * stripe sizes appear to be chosen too small for realistic use
  * current possible values would require immense read-ahead by the client

 * file size is not always known at creation time
  * this is a general problem for files that are larger than the client cache, as those ''will'' be fsync'd to the fileserver prematurely

 * is there any kind of salting to the round robin algorithm used for choosing the storing OSDs?
  * i.e., will a job that writes ''n'' files of fixed but different sizes periodically to ''n'' OSDs always put the ''i''th file onto the ''i''th OSD?

 * how can one change the layout of a file segment (e.g., add a mirror)?
  * this would require a new subcommand and a new RPC for the RXOSD server

 * can the changes be ported to openafs-1.5.x?
  * this has been done for an earlier version of the 1.5-branch and should be possible in general

 * many osd subcommands require the user to onter IP+LUN, where OSD ID would be better
  * this has to be changed, and care must be taken as these commands are in production at Garching

= Unsorted =
 * something about Copy``On``Write for volume replikas