Differences between revisions 1 and 4 (spanning 3 versions)

General concepts

Goodies

the threads subcommand lets you know what each thread is currently doing {{{osd threads io

}}}

Working with an AFS+OSD cell

The following techniques are only useful when using OSD as a frontend for HSM:

{{{vos listobj

}}} is useful for finding all objects stored on a given OSD

{{{fs prefetch

}}} allows the user to schedule tape restore operations

it is useful to have bosserver run a script to wipe objects off the OSDs according to OSD fill rate etc.
the atime can be used to reach wiping decisions
- this requires a filesystem with atime support
the OSDDB can tell non/wipeable OSDs apart

Other HSM related features:

md5 checksums can be retrieved for archival copies of objects (unverified) {{{ osd md5 io 536870930.2.1413.0 4

}}}

These are general techniques:

{{{osd addserver

}}} makes an AFS fileserver known to the OSDDB

this is required in order to make use of ownerships and locations

This is how a fileserver and OSD server can share one machine:

touch the file /vicepx/OnlyRXOSD to forbid the fileserver to use this partition

The new fs ls subcommand can tell files apart:

{{{ [iokaste] /afs/desytest/osdtest_2 # fs ls

m rwx root 2048 2008-05-08 13:21:01 . m rwx root 2048 2008-05-08 16:31:48 .. f rw- bin 44537893 2008-05-08 11:36:40 ascii-file f rw- bin 1048576 2008-05-08 13:21:14 ascii-file.osd d rwx bin 2048 2008-05-08 09:06:52 dir f rw- bin 0 2008-05-08 13:08:46 empty-file o rw- bin 44537893 2008-05-08 13:19:58 new-after-move }}} where m is a mountpoint, f a file, d a directory and o an object. fs ls will also identify files with their objects wiped from on-line object storage (i.e., with archival copies only).

How to migrate data from an OSD

set a low write priority to stop fileservers from storing data on the OSD in question {{{osd setosd -wrprior 0

}}}

use {{{vos listobj

}}} to identify the files (by fid) that have data on the OSD

use {{{fs replaceosd

}}} to move each file's data to another OSD

Priorities and choice of storing OSD

OSDs are used in a Round Robin fashion
priorities add weight to each OSD
static priorities can be set by the administrator {{{osd setosd -wrprior ... -rdprior ...

}}}

priorities are dynamic
- ownerships apply a modifier of +/- 10 to the static priority
- locations apply a modifier of +/- 20
- the fill percentage plays a role if above 50% or 95% respectively

Data held in volumes, DBs etc.

all metadata belonging to files in a volume are stored in a designated volume special file
metadata references OSDs by id
OSD id is an index into the OSDDB
- these IDs are permanent and can never be reused after deletion of an OSDDB entry
- view the OSDDB using {{{osd l

}}}

a file's metadata is found using the metadataindex stored in the file's vnode
- view the metadata using {{{fs osd <file>

}}}

How to upgrade a cell to AFS+OSD

set up OSDDB on the database servers
set up pristine AFS+OSD fileservers + OSDs
move volumes to the AFS+OSD fileservers
- volserver is supposed to be armed with a -convertvolumes switch for that purpose
- otherwise, set the osdflag by hand {{{vos setfields <volume> -osd 1

}}}

Policies

policies are going to decide 4 basic things:
1. whether object storage is to be used for a given file
2. whether the objects are to be mirrored and how often
3. how many stripes the file will comprise
4. the size of the stripes
two pieces of information will be used to make decisions:
1. file size
2. file name (i.e. prefix and suffix)

Open questions

will inheritance be supported for policy definitions?
- how many levels of inheritance are permitted (1 or n)?
in what way can policies be represented/stored?

Technical aspects

Performance

client connections to OSDs are cheap because they are unauthorized
clients do connect to each single OSD they have business with
osd psread acts as though reading stripes from multiple OSDs
- i.e. it opens several connections, in this case all targets are the same however
- each stripe has an altered offset (e.g., normally first n stripes start at offset 0 on each OSD, here it is 0+(i*stripesize) etc.)
- this is not impossible for access to AFS fileservers, but making connections is more costly

Notes on the code (changes)

hashing is used for management of DB contents (not a change)
choice of OSDs resides in osddbuser.c

Link tables

original link table entries consisted of 5 columns of 3 bits each
- thus, 5 different versions of a file were supported, each with a max. link count of 7

AFS+OSD (like MR-AFS) supports more places for one and the same file to live:

 *      2 RW volumes (the 2nd during a move operation)
 *      1 BK volume
 *     13 RO volumes
 *      1 clone during move

which amounts to up to 17 places for a file

also, there might be as many as 6 versions of a file:

 *      1 RW volume
 *      1 BK volume 
 *      1 clone during move
 *      1 RO
 *      1 RO-old during vos release
 *      1 may be an old RO which was not reachable during the last vos release.

as at least 6 columns are needed with at least 5 bits per column, a link table row now consists of 32 bits instead of 16

The explanations are from vol/namei_ops.c. The new format is used as Linktable version 2, with the original format still being supported as version 1.

Debugging techniques

add trace* calls to the code
- CM_TRACE_WASHERE especially handy
use

-  ⇤ ← Revision 1 as of 2008-05-19 17:32:23 → 
  Size: 4722
  Editor: FelixFrank
  Comment: first buch of info from garching and aftermath
+   ← Revision 4 as of 2008-05-26 14:50:55 → ⇥
  Size: 7500
  Editor: FelixFrank
  Comment: notes on md5 sums and fs ls subcommand
-Deletions are marked like this.
+Additions are marked like this.
 Line 4:
+== Goodies ==
 * the `threads` subcommand lets you know what each thread is currently doing {{{osd threads io
}}}
-Line 12:
+Line 16:
+ * the atime can be used to reach wiping decisions
  * this requires a filesystem with atime support
 * the OSDDB can tell non/wipeable OSDs apart

Other HSM related features:
 * md5 checksums can be retrieved for archival copies of objects (unverified) {{{ osd md5 io 536870930.2.1413.0 4
}}}
-Line 17:
+Line 28:
+This is how a fileserver and OSD server can share one machine:
 * touch the file ''/vicepx/OnlyRXOSD'' to forbid the fileserver to use this partition

The new `fs ls` subcommand can tell files apart: 
 {{{ [iokaste] /afs/desytest/osdtest_2 # fs ls
m rwx    root        2048 2008-05-08 13:21:01 .
m rwx    root        2048 2008-05-08 16:31:48 ..
f rw-     bin    44537893 2008-05-08 11:36:40 ascii-file
f rw-     bin     1048576 2008-05-08 13:21:14 ascii-file.osd
d rwx     bin        2048 2008-05-08 09:06:52 dir
f rw-     bin           0 2008-05-08 13:08:46 empty-file
o rw-     bin    44537893 2008-05-08 13:19:58 new-after-move }}} where '''m''' is a mountpoint, '''f''' a file, '''d''' a directory and '''o''' an object. `fs ls` will also identify files with their objects wiped from on-line object storage (i.e., with archival copies only).
-Line 34:
+Line 59:
+  * the fill percentage plays a role if above 50% or 95% respectively
-Line 46:
+Line 72:
+== How to upgrade a cell to AFS+OSD ==
 1. set up OSDDB on the database servers
 2. set up pristine AFS+OSD fileservers + OSDs
 3. move volumes to the AFS+OSD fileservers
  * volserver is supposed to be armed with a `-convertvolumes` switch for that purpose
  * otherwise, set the osdflag by hand {{{vos setfields <volume> -osd 1
}}}
-Line 83:
+Line 115:
- * original linktable was insufficient, further fields were required
  * link table v2 uses 30 bits instead of 15
+ * hashing is used for management of DB contents (not a change)
 * choice of OSDs resides in ''osddbuser.c''
-Line 86:
+Line 118:
- * hashing is used for management of DB contents
 * choice of OSDs resides in ''osddbuser.c''
+== Link tables ==
 * original link table entries consisted of 5 columns of 3 bits each
  * thus, 5 different versions of a file were supported, each with a max. link count of 7
 * AFS+OSD (like MR-AFS) supports more places for one and the same file to live: {{{ 
 *      2 RW volumes (the 2nd during a move operation)
 *      1 BK volume
 *     13 RO volumes
 *      1 clone during move}}} which amounts to up to 17 places for a file
 * also, there might be as many as 6 versions of a file: {{{
 *      1 RW volume
 *      1 BK volume 
 *      1 clone during move
 *      1 RO
 *      1 RO-old during vos release
 *      1 may be an old RO which was not reachable during the last vos release.}}}
 * as at least 6 columns are needed with at least 5 bits per column, a link table row now consists of 32 bits instead of 16
The explanations are from ''vol/namei_ops.c''. The new format is used as ''Linktable version 2'', with the original format still being supported as ''version 1''.
-Line 126:
+Line 172:
+ * many osd subcommands require the user to onter IP+LUN, where OSD ID would be better
  * this has to be changed, and care must be taken as these commands are in production at Garching

Wiki

Page