Differences between revisions 14 and 15
Revision 14 as of 2008-10-30 14:44:41
Size: 15461
Editor: FelixFrank
Comment: syntax repair
Revision 15 as of 2009-06-04 10:49:45
Size: 15353
Editor: FelixFrank
Comment: some corrections / additions
Deletions are marked like this. Additions are marked like this.
Line 59: Line 59:
 * md5 checksums can be retrieved for archival copies of objects (unverified) {{{  * md5 checksums can be retrieved for archival copies of objects {{{
Line 75: Line 75:
o rw- bin 44537893 2008-05-08 13:19:58 new-after-move }}} where '''m''' is a mountpoint, '''f''' a file, '''d''' a directory and '''o''' an object. `fs ls` will also identify files with their objects wiped from on-line object storage (i.e., with archival copies only). o rw- bin 44537893 2008-05-08 13:19:58 new-after-move }}} where '''m''' is a mountpoint, '''f''' a regular AFS file, '''d''' a directory and '''o''' a file that uses object storage. `fs ls` will also identify files with their objects wiped from on-line object storage (i.e., with archival copies only).
Line 85: Line 85:
fs replaceosd fs fidreplaceosd
Line 159: Line 159:
 * in what way can policies be represented/stored?    Answer:: No, policies use aggregation instead.
Line 162: Line 162:
   Answer:: No, clients are not concerned with policies at all.
Line 163: Line 164:
  Answer:: Currently not. Such ammendments would add a magnitude of complexity to the user interface.
Line 168: Line 170:
  Answer:: Has not been considered further.
Line 287: Line 290:
 * printfs in the client code will turn up in dmesg / messages
Line 302: Line 306:
 * stripe sizes appear to be chosen too small for realistic use
  * current possible values would require immense read-ahead by the client /!\

 * file size is not always known at creation time
  * this is a general problem for files that are larger than the client cache, as those ''will'' be fsync'd to the fileserver prematurely
  * policies will tackle this partially
Line 314: Line 311:
   Answer:: No. Create a new file instead.
Line 330: Line 328:

= Unsorted =
 * something about Copy``On``Write for volume replikas

1. General concepts

1.1. Goodies

  • the threads subcommand lets you know what each thread is currently doing

    osd threads io

1.2. Working with an AFS+OSD cell

The following techniques are only useful when using OSD as a frontend for HSM:

  • vos listobj
    is useful for finding all objects stored on a given OSD
  • fs prefetch
    allows the user to schedule tape restore operations
    • the fetch queue for a given osd can be examined:

      osd fetchqueue io_a
      osd f io_a
  • using osd setvariable, one can specify the max. allowed number of parallel fetches from an OSD server:

    # osd setvar io maxParallelFetches 4
    • this is the only variable (!) currently available in OSD servers

    • this is not to be confused with osd osd which sets fields in the OSDDB

  • it is useful to have bosserver run a script to wipe objects off the OSDs according to OSD fill rate etc.

  • the atime can be used to reach wiping decisions
    • this requires a filesystem with atime support
  • the OSDDB can tell non/wipeable OSDs apart
  • on wipeable OSDs, set the high water mark using osd setosd:

    # osd seto 12 -highwatermark 900
    where 12 is the OSD's ID from the OSDDB.
    • examine this setting using osd osd:

      # osd osd 12
      Osd 'io_e' with id=12:
              type            = 0
              minSize         = 1024 KB
              maxSize         = 67108864 KB
              totalSize       = 284819 MB
              pmUsed          = 1 per mille used
              totalFiles      = 0 M Files
              pmFilesUsed     = 0 per mille used
              ip              = 141.34.22.100
              server          = 0
              lun             = 4
              alprior         = 50
              rdprior         = 0
              migage          = 0 seconds
              flags           = 0
              unavail         = 0
              owner           = 0 = ''
              location        = 0 = ''
              timeStamp       = 1211807708 =  May 26 15:15
              highWaterMark   = 900 per mille used
              lowWaterMark    = 0 per mille used (obsolete)
              chosen          = 0 (should be zero) 
  • md5 checksums can be retrieved for archival copies of objects

    osd md5 io 536870930.2.1413.0 4

This is how a fileserver and OSD server can share one machine:

  • touch the file /vicepx/OnlyRXOSD to forbid the fileserver to use this partition

The new fs ls subcommand can tell files apart:

  • [iokaste] /afs/desytest/osdtest_2 # fs ls
    m rwx    root        2048 2008-05-08 13:21:01 .
    m rwx    root        2048 2008-05-08 16:31:48 ..
    f rw-     bin    44537893 2008-05-08 11:36:40 ascii-file
    f rw-     bin     1048576 2008-05-08 13:21:14 ascii-file.osd
    d rwx     bin        2048 2008-05-08 09:06:52 dir
    f rw-     bin           0 2008-05-08 13:08:46 empty-file
    o rw-     bin    44537893 2008-05-08 13:19:58 new-after-move 

    where m is a mountpoint, f a regular AFS file, d a directory and o a file that uses object storage. fs ls will also identify files with their objects wiped from on-line object storage (i.e., with archival copies only).

1.2.1. How to migrate data from an OSD

  1. set a low write priority to stop fileservers from storing data on the OSD in question

    osd setosd -wrprior 0
  2. use

    vos listobj
    to identify the files (by fid) that have data on the OSD
  3. use

    fs fidreplaceosd
    to move each file's data to another OSD

1.3. Backup

  • dumping of volumes is possible:

    vos dump -osd
    will include the data from object storage in your dump
  • wether incremental dumps work like this will have to be tested ( and implemented ) /!\

1.4. Priorities and choice of storing OSD

  • OSDs are used in a Round Robin fashion
  • priorities add weight to each OSD
  • static priorities can be set by the administrator

    osd setosd -wrprior ... -rdprior ...
  • priorities are dynamic
    • ownerships apply a modifier of +/- 10 to the static priority
    • locations apply a modifier of +/- 20
    • the fill percentage plays a role if above 50% or 95% respectively

Customizing owner, location:

  • the osd addserver subcommand

    osd adds 141.34.22.101 iokaste dtc ifh

    makes an AFS fileserver (!) known to the OSDDB

    • this is required in order to make use of ownerships and locations

    • this data can be examined using

      [iokaste] /afs/desytest/osdtest_2 # osd servers
      Server 'iokaste' with id=141.34.22.101:
              owner           = 1685349120 = 'dtc'
              location        = 1768318976 = 'ifh'
    • /!\ The help on osd addserver is misleading:

      # osd help adds
      osd addserver: create server entry in osddb 
      Usage: osd addserver -id <ip address> -name <osd name> [-owner <group name (max 3 char)>] [-location <max 3 characters>] [-cell <cell name>] [-help] 
      as the name to be specified is not actually an "osd name" but an alias name for the file server you're adding.

1.5. Data held in volumes, DBs etc.

  • all metadata belonging to files in a volume are stored in a designated volume special file
  • metadata references OSDs by id

  • OSD id is an index into the OSDDB

    • these IDs are permanent and can never be reused after deletion of an OSDDB entry
    • view the OSDDB using

      osd l
  • a file's metadata is found using the metadataindex stored in the file's vnode

    • view the metadata using

      fs osd <file>

1.6. How to upgrade a cell to AFS+OSD

  1. set up OSDDB on the database servers
  2. set up pristine AFS+OSD fileservers + OSDs
  3. move volumes to the AFS+OSD fileservers
    • volserver is supposed to be armed with a -convertvolumes switch for that purpose

    • otherwise, set the osdflag by hand

      vos setfields <volume> -osd 1

2. Policies

  • policies are going to decide 4 basic things:
    1. whether object storage is to be used for a given file
    2. whether the objects are to be mirrored and how often
    3. how many stripes the file will comprise
    4. the size of the stripes
  • two pieces of information will be used to make decisions:
    1. file size
    2. file name (i.e. prefix and suffix)
  • policies will most likely be held in the OSDDB
    • could they reside right in the volumes? would that be maintainable? /!\

2.1. Open questions

  • will inheritance be supported for policy definitions?
    • how many levels of inheritance are permitted (1 or n)?

      Answer
      No, policies use aggregation instead.
  • could we use policies for other things as well?
    • i.e. information for client: should the cache be bypassed for reading? /!\

      Answer
      No, clients are not concerned with policies at all.
  • could a policy also give a hint about wether wipeable OSDs are acceptable?
    Answer
    Currently not. Such ammendments would add a magnitude of complexity to the user interface.
  • must policies be stored in a database? /!\

    • instead, each directory could have a .policy/ subdirectory

      • policies could be encoded in file names or file contents inside of that
    • instead, policies could live directly inside the large vnodes
      • blobs would then be used not only for filenames anymore
    • Answer
      Has not been considered further.

Using the uniquifier field in the vnode:

  • this field is largely unnecessary as it is redundant (?)
  • the lower 24 bits are unused
    • however, if upper 8 bits are != 0, the file can not be stored in OSD. <!> Why?

2.2. Possible representations for a policy

So far we thought of 3 possible notations for policies, each having implications on the overall expressiveness.

2.2.1. Disjoint Normal Form

A policy consists of an arbitrary number of predicates that can be thought of as logically ORed. Evaluation is interrupted as soon as one predicate evaluates to true. Each predicate consists of a number of atomic predicates which are logically ANDed:

( suffix(".root") ) or ( size > 1M and size < 20M ) or ( size > 20M ) 

Of course, each case needs to return a definite "answer" to all aspects covered, e.g.

( suffix(".root") )         => OSD, 1 stripe, 1 site
( size > 1M and size <20M ) => OSD, 1 stripe, 2 sites
( size > 20M )              => OSD, 2 stripes, 1 site
else                        => No Object Storage

the last case being the default. (!) This would need to be set cell-wide.

Discussion
This data model allows for rather efficient evaluation and might easily be represented to an administrator.

2.2.2. Variable list of rules

Like above, the policy consists of a list of predicates. Each can have arbitrary effects on how to store the file's data wrt. the aspects covered by policies (see above). Here too, the predicates must be evaluated in a fixed order. They are limited to logical AND, too. However, evaluation cannot terminate before reaching the end of the list. The default behaviour has to take effect before evaluation starts:

Default                     => No Object Storage
( suffix(".root") )         => OSD
( size > 1M )               => 2 sites
( size > 20M )              => 1 site, 2 stripes 

(this would have much the same effect as the example policy outlines above for DNF).

Discussion
In places, this might have benefits compared to DNF. In human readable form, it would become a sequence of "but if..."s.

2.2.3. Fixed list of predicates

A policy consists of a fixed number of rules that consist of an arbitrary number of atomic predicates that can be linked using AND, OR and NOT and parenthesized. Each corresponds to a certain piece of info about the storing of OSDs:

OSD           = ( size > 1M or suffix(".root") )
Stripes 1     = true
Stripes 2     = size > 20M
Sites 1       = true
Sites 2       = size < 20M

This appears extremely complex though.

Discussion
Constructing expressions for arbitrary rules is difficult. There are invalid combinations of expressions (e.g., the sets of matches of "Stripes1" and "Stripes2" needs to be disjoint, the matches to "OSD" must superset all others etc.). Apart from printing the logical expressions in semi-mathematical form, it would be difficult to bring this into a human readable form.

2.3. Complex policies

  • policies could be given a full fledged inheritance hierarchy
  • policies could be composed of an arbitrary (or limited) number of "basic policies"
    • this can be seen as single-level multiple inheritance

3. Technical aspects

3.1. Performance

  • client connections to OSDs are cheap because they are unauthorized
  • clients do connect to each single OSD they have business with
  • osd psread acts as though reading stripes from multiple OSDs

    • i.e. it opens several connections, in this case all targets are the same however
    • each stripe has an altered offset (e.g., normally first n stripes start at offset 0 on each OSD, here it is 0+(i*stripesize) etc.)
    • this is not impossible for access to AFS fileservers, but making connections is more costly

3.2. Backwards compatibility

  • file server detects clients that are unable to recognize object storage
  • read OSD contents themselves, forward to client
    • this uses doubled bandwidth
    • low performance

4. Notes on the code (changes)

  • hashing is used for management of DB contents (not a change)
  • choice of OSDs resides in osddbuser.c

  • the DataXchange() function plays a major role

  • vnodes are modified to hold the metadataindex

    • this is without consequence to large vnodes - directories have no meta data

  • original link table entries consisted of 5 columns of 3 bits each
    • thus, 5 different versions of a file were supported, each with a max. link count of 7
  • AFS+OSD (like MR-AFS) supports more places for one and the same file to live:

     *      2 RW volumes (the 2nd during a move operation)
     *      1 BK volume
     *     13 RO volumes
     *      1 clone during move
    which amounts to up to 17 places for a file
  • also, there might be as many as 6 versions of a file:

     *      1 RW volume
     *      1 BK volume 
     *      1 clone during move
     *      1 RO
     *      1 RO-old during vos release
     *      1 may be an old RO which was not reachable during the last vos release.
  • as at least 6 columns are needed with at least 5 bits per column, a link table row now consists of 32 bits instead of 16

The explanations are from vol/namei_ops.c. The new format is used as Linktable version 2, with the original format still being supported as version 1. /!\ Does the code need to support legacy link table format? Volumes are incompatible anyway.

4.2. Technical details on ubik databanks

  • consist of a head and a hash table
    • entries need to be of fixed length to enable hashing
    • data is held in hierarchical structures
  • transactions are used for consistent modifications
  • blocksize currently fixed at 1024 (bytes?)
    • might possibly be changed (would this be wise?)

4.3. Debugging techniques

  • add trace* calls to the code
    • CM_TRACE_WASHERE especially handy
    • RX_AFS_GLOCK needs to be held apparently
  • use

    fstrace
    to enable debugging on the client/server (?)
  • the translate_et command can be handy in case of wild error reports:

    [iokaste] /afs/desytest/osdtest_2 # osd exa io 536870930.2.1413.0 
    RXOSD_examine failed with code 5
    Request aborted.
    [iokaste] /afs/desytest/osdtest_2 # translate_et 5                
    5 ().5 = Input/output error 
  • printfs in the client code will turn up in dmesg / messages

5. Open issues

  • multihomed servers are a problem
    • this actually requires changes to the DB
    • possible coding technique: change existing RPCs, provide the original RPC as "Old..." that the clients can know about
  • the link count is a critical datum
    • it is controlled by UDP-based RPCs <!>

    • this can cause data loss
      • are correcting algorithms necessary? /!\

  • is there any kind of salting to the round robin algorithm used for choosing the storing OSDs?
    • i.e., will a job that writes n files of fixed but different sizes periodically to n OSDs always put the ith file onto the ith OSD? this may be undesirable /!\

  • how can one change the layout of a file segment (e.g., add a mirror)?
    • this would require a new subcommand and a new RPC for the RXOSD server
      Answer
      No. Create a new file instead.
  • can the changes be ported to openafs-1.5.x?
    • this has been done for an earlier version of the 1.5-branch and should be possible in general /!\

  • many osd subcommands require the user to enter IP+LUN, where OSD ID would be better
    • this has to be changed, and care must be taken as these commands are in production at Garching /!\

  • is there a concept for actual permissions on OSD usage?
    • ownership only affects priority
    • in case of a full OSD, other OSDs will be used (?)

    • can OSD usage be forbidden? can usage of owned OSDs be enforced? /!\

  • could the vnode index be a pointer into the metadata table the same way it is a pointer into the vnode table?
    • could the additional metadataindex field in the vnodes thus be saved? /!\

  • is lazy replication conceivable? /!\

AfsOsd/Notes (last edited 2009-06-04 10:49:45 by FelixFrank)