#acl DvGroup:read,write,revert All:read

[[attachment:Lustre_1.6_man_v1.10.pdf]]

<<TableOfContents>>

== Übersicht ==

|| || John (OSS) || Paul (OSS) || George (OSS) || Ringo (MGS/MDS) ||
||c0d0||{{attachment:DL380-2.gif}}|| {{attachment:DL380-2.gif}}|| {{attachment:DL380-2.gif}}|| {{attachment:DL380-4.gif}}||
||c1d1|| {{attachment:MSA60.gif}} OST3 || {{attachment:MSA60.gif}} OST4|| {{attachment:MSA60.gif}} OST5|| ||
||c1d0|| {{attachment:MSA60.gif}} OST0 || {{attachment:MSA60.gif}} OST1|| {{attachment:MSA60.gif}} OST2|| ||

Die internen 146GB SAS Platten hängen jeweils an einem P400 Controller und laufen als RAID1 bzw. RAID10. Die Arrays wurden im BIOS erstellt. Die Stripe Size ist dann 16kB, was für den Einstaz dieser Arrays ok ist.

Die internen Arrays erfüllen - abgesehen davon dass sie das OS beherbergen - folgende Aufgabe:
 * auf dem MGS/MDS: mgs/mds Filesystem (/dev/vg00/mgsmds)
 * auf den OSS: externes Journal für die OST Filesysteme (1 GB, /dev/vg00/j-ost''x'')
All diese sollten laut Lustre Manual auf RAID1 liegen, nicht RAID5/6.

Die externen MSA60 Shelves hängen an je einem Kanal einer P800 Controllers. Sie beherbergen die Arrays für die OSTs. Die 12 750GB SATA Platten in einem MSA60 Shelf bilden je ein RAID6 Array (10 Platten Netto, also ca. 7.5TB). Ein solches Array ist genau einem OST zugeordnet.

== Inodes ==
Die 6 OSTs haben je ca. 7.5 TB Kapazität. Ringos RAID10 hat nach Abzug des OS ca. 230 GB Platz.

Das System sollte daher für durchschnittlich 1MB/Inode ausgelegt werden. Damit ergeben sich insgesamt ca. 45 Mio. Inodes. Lustre empfiehlt auf dem MDS 4kB/Inode, d.h. das MGS/MDS Filesystem benötigt 180 GB. Dies lässt genug Spielraum für Snapshots dieses Filesystems zwecks Backup der Metadaten und ggf. Volumes für die lfsck Datenbank (die ca. 10GB groß werden kann) falls ein lfsck des gesamten FS nötig wird.

Wir reservieren für diese Zwecke auch eine Partition von ca. 200GB Platz auf jedem der OST-Arrays.

== SELinux ==
SELinux wird auf den Servern deaktiviert. Zwar wurde die Funktion auch mit aktivem SELinux getestet und es konnten keine direkten Probleme gefunden werden. Jedoch zeigt sich dass bei aktivem SELinux auf dem MDS ein erweitertes Attribut ''security.selinux="system_u:object_r:unlabeled_t:s0\000"'' für alle Dateien gespeichert wird. Dies könnte später einmal mit der Verwendung von Labels auf den Clients kollidieren.

== Modul-Parameter ==
In `/etc/modprobe.d/lustre` auf allen Servern:
{{{
options lnet networks=tcp(bond0)
options ost oss_num_threads=256
options mds mds_num_threads=256
}}}
Lnet soll bond0 verwenden (nicht eth0 oder eth1). Die ..._num_threads Option verhindert dass zu viele Threads erzeugt werden.

== Datenarrays auf den externen RAID Arrays ==
Management mit `hpacucli`. Alte Arrays ggf. löschen:
{{{
ctrl slot=4 array B delete
ctrl slot=4 array A delete
}}}
Neue Arrays anlegen:
{{{
ctrl slot=4 create type=ld drives=1E:1:1-1E:1:12 raid=6 ss=128
ctrl slot=4 create type=ld drives=2E:1:1-2E:1:12 raid=6 ss=128
ctrl slot=4 modify ssd=15
}}}
Die Stripe Size von 128 kB hat sich in Tests als günstig erwiesen und entspricht der maximalen Request Size einer SATA Platte (?). Die Kapazität eines "Full Stripe" solch eines Arrays ist damit 10*128kB, was gut zur Design-Inode-Dichte von 1/MB passt: Falls Files auf Lustre-Ebene gestriped werden, sind diese 1280kB sicher die kleinste sinnvolle Stripe Size.

Das Surface Scan Delay von 15 Sekunden entspricht dem Wert den Arrays erhalten die im BIOS eingerichtet werden.

== Partitionen auf OST Arrays anlegen ==
Da > 2 TB, gehen nur GPT Disk Labels mit `parted`. Wir richten den Start der Partition an einer Stripe-Grenze (128kB) aus. Achtung: parted verwendet standardmaessig SI-Einheiten (128kB = 128*10^3^ Bytes), die Controller aber (hoffentlich) binäre Einheiten (128kB = 128 * 2^10^ Bytes), weshalb wir in parted Kommandos kiB (nicht kB) verwenden müssen.
{{{
# parted /dev/cciss/c1d0
GNU Parted 1.8.1
Using /dev/cciss/c1d0
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel
Warning: The existing disk label on /dev/cciss/c1d0 will be destroyed and all
data on this disk will be lost. Do you want to continue?
Yes/No? yes                                                               
New disk label type?  [gpt]?                                              
(parted) mkpart                                                           
Partition name?  []? ost1                                                 
File system type?  [ext2]?                                                
Start? 128kiB                                                             
End? -200GiB                                                              
(parted) p                                                                

Model: Compaq Smart Array (cpqarray)
Disk /dev/cciss/c1d0: 7501GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start  End     Size    File system  Name  Flags
 1      131kB  7286GB  7286GB  ext3         ost1      
 
(parted) mkpart                                                           
Partition name?  []? bup1                                                 
File system type?  [ext2]?                                                
Start? 7286GB                                                             
End? -0                                                                   
(parted) p                                                                

Model: Compaq Smart Array (cpqarray)
Disk /dev/cciss/c1d0: 7501GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      131kB   7286GB  7286GB  ext3         ost1       
 2      7286GB  7501GB  215GB                bup1       

(parted)
}}}

Entsprechend für die anderen Server und Partitionen.

== MGS/MDT anlegen auf ringo ==
Logical Volume erzeugen (10 GB mehr als nötig):
{{{
[ringo] /root # lvcreate -L 190G -n mgsmds vg00
}}}
Optionen für `mkfs`:
 * 16kB Stripe Size => `-E stride=4`
 * 4 Platten RAID10 => 2 data disks => `-E stripe-width=2`
Lustre Filesystem anlegen:
{{{
[ringo] /root # mkfs.lustre --fsname=fs1 --mdt --mgs --mkfsoptions="-E stride=4 -E stripe-width=2" /dev/vg00/mgsmds

   Permanent disk data:
Target:     fs1-MDTffff
Index:      unassigned
Lustre FS:  fs1
Mount type: ldiskfs
Flags:      0x75
              (MDT MGS needs_index first_time update )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mdt.group_upcall=/usr/sbin/l_getgroups

checking for existing Lustre data: not found
device size = 194560MB
formatting backing filesystem ldiskfs on /dev/vg00/mgsmds
        target name  fs1-MDTffff
        4k blocks     0
        options       -E stride=4 -E stripe-width=2 -J size=400 -i 4096 -I 512 -q -O dir_index -F
mkfs_cmd = mkfs.ext2 -j -b 4096 -L fs1-MDTffff -E stride=4 -E stripe-width=2 -J size=400 -i 4096 -I 512 -q -O dir_index -F /dev/vg00/mgsmds
Writing CONFIGS/mountdata
[ringo] /root # tune2fs -i 0 -c 0 /dev/vg00/mgsmds
tune2fs 1.40.4.cfs1 (31-Dec-2007)
Setting maximal mount count to -1
Setting interval between checks to 0 seconds
[ringo] /root #
}}}

fstab Eintrag:
{{{
/dev/vg00/mgsmds        /mds/fs1                lustre  noauto          0 0
}}}

Mounten:
{{{
[ringo] /root # mount /mds/fs1/
[ringo] /root # df -H /mds/fs1
Filesystem             Size   Used  Avail Use% Mounted on
/dev/vg00/mgsmds       179G   483M   168G   1% /mds/fs1
[ringo] /root # df -i /mds/fs1
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/vg00/mgsmds     49807360      22 49807338    1% /mds/fs1
}}}
Die Inodes sollten ausreichen.

== OST0 anlegen auf john ==

Das Journal soll nicht auf RAID5/6 liegen, wird daher auf dem internen RAID1 angelegt:
{{{
[john] /root # lvcreate -L 1G -n j-ost0 vg00
[john] /root # mke2fs -b 4096 -O journal_dev /dev/vg00/j-ost0
mke2fs 1.40.4.cfs1 (31-Dec-2007)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 262144 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks: 

Zeroing journal device: done                            
[john] /root #
}}}

Optionen für `mkfs`:
 * 128kB Stripe Size => `-E stride=32`
 * 12 Platten RAID6 => 10 Datenplatten => `-E stripe-width=10`
 * externes Journal: `-J device=/dev/vg00/j-ost0`
 * 1MB/Inode: `-i 1048576`

OST Filesystem anlegen:
{{{
[john] /root # mkfs.lustre --fsname=fs1 --ost --mgsnode=ringo@tcp0 --mkfsoptions="-E stride=32 -E stripe-width=10 -J device=/dev/vg00/j-ost0 -i 1048576" /dev/cciss/c1d0p1 

   Permanent disk data:
Target:     fs1-OSTffff
Index:      unassigned
Lustre FS:  fs1
Mount type: ldiskfs
Flags:      0x72
              (OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=141.34.22.24@tcp

checking for existing Lustre data: not found
device size = 6948927MB
formatting backing filesystem ldiskfs on /dev/cciss/c1d0p1
        target name  fs1-OSTffff
        4k blocks     0
        options       -E stride=32 -E stripe-width=10 -J device=/dev/vg00/j-ost0 -i 1048576 -I 256 -q -O dir_index -F
mkfs_cmd = mkfs.ext2 -j -b 4096 -L fs1-OSTffff -E stride=32 -E stripe-width=10 -J device=/dev/vg00/j-ost0 -i 1048576 -I 256 -q -O dir_index -F /dev/cciss/c1d0p1
Writing CONFIGS/mountdata
[john] /root # tune2fs -i 0 -c 0 /dev/cciss/c1d0p1
tune2fs 1.40.4.cfs1 (31-Dec-2007)
Setting maximal mount count to -1
Setting interval between checks to 0 seconds
}}}
Entsprechend für c1d1p1 (mit externem Journal j-ost3), und die anderen Server paul (j-ost1/4) und george (j-ost2/5).

Wie beim MDT ist das automatisch vergebene Label zunächst überall ''fs1-OSTffff''. Es wird beim ersten Mounten angepasst. Um die vorgesehene Reihenfolge der OSTs zu erhalten sollten sie in genau dieser auch erstmals gemountet werden:
{{{
[john] /root # mount -t lustre -odata=writeback /dev/cciss/c1d0p1 /ost/0
[john] /root # e2label /dev/cciss/c1d0p1
fs1-OST0000
[paul] /root # mount -t lustre -odata=writeback /dev/cciss/c1d0p1 /ost/1
[paul] /root # e2label /dev/cciss/c1d0p1
fs1-OST0001
[george] /root # mount -t lustre -odata=writeback /dev/cciss/c1d0p1 /ost/2
[george] /root # e2label /dev/cciss/c1d0p1
fs1-OST0002
[john] /root # mount -t lustre -odata=writeback /dev/cciss/c1d1p1 /ost/3
[john] /root # e2label /dev/cciss/c1d1p1
fs1-OST0003
[paul] /root # mount -t lustre -odata=writeback /dev/cciss/c1d1p1 /ost/4
[paul] /root # e2label /dev/cciss/c1d1p1
fs1-OST0004
[george] /root # mount -t lustre -odata=writeback /dev/cciss/c1d1p1 /ost/5
[george] /root # e2label /dev/cciss/c1d1p1
fs1-OST0005
}}}
Danach können auch die entsprechenden fstab Einträge erzeugt werden.

 * John:
 {{{
LABEL=fs1-OST0000       /ost/0                  lustre  noauto,data=writeback 0 0
LABEL=fs1-OST0003       /ost/3                  lustre  noauto,data=writeback 0 0
}}}
 * Paul:
 {{{
LABEL=fs1-OST0001       /ost/1                  lustre  noauto,data=writeback 0 0
LABEL=fs1-OST0004       /ost/4                  lustre  noauto,data=writeback 0 0
}}}
 *George:
 {{{
LABEL=fs1-OST0002       /ost/2                  lustre  noauto,data=writeback 0 0
LABEL=fs1-OST0005       /ost/5                  lustre  noauto,data=writeback 0 0
}}}

== Start/Stop des Filesystems ==
Der MDS ist Client der OSTs. Daher:
 * Start:
  1. alle OSTs auf John/Paul/George mounten, die Reihenfolge ist unerheblich
  2. MDS mounten
 * Stop:
  1. alle Clients stoppen (umount /lustre/fs1; lustre_rmmod)
  2. MDS stoppen (umount /mds/fs1; lustre_rmmod)
  3. OSTs stoppen (umount /ost/*; lustre_rmmod)

== Automounter ==
auto.master: `/lustre /etc/auto.lustre --timeout=0`

1. Problem: autofs mag kein ''@''
 * geht:
 {{{
*  -fstype=lustre ringo:/&
}}}
 * geht nicht:
 {{{
*  -fstype=lustre ringo@tcp:/&
}}}
2. Problem: SELinux. Folgendes Modul wird benötigt:
 {{{
module gaga 1.0;

require {
	type unlabeled_t;
	type automount_t;
	type mount_t;
	class dir getattr;
	class filesystem { mount unmount getattr };
}

#============= automount_t ==============
allow automount_t unlabeled_t:dir getattr;
allow automount_t unlabeled_t:filesystem getattr;

#============= mount_t ==============
allow mount_t unlabeled_t:filesystem mount;
allow mount_t unlabeled_t:filesystem unmount;
}}}

Die bessere Lösung ist allerdings, lustre (und panfs) wie nfs, afs etc. zu behandeln,
d.h. nfs_t per genfs_context zuzuweisen. Dies ist implementiert in selinux-policy-2.4.6-106.el5_1.3.dz.1 (Patch1000). Die entsprechende Änderung wurde Red Hat vorgeschlagen und wird hoffentlich in EL5.3 (Ende 2008) einziehen, s. https://bugzilla.redhat.com/show_bug.cgi?id=437793 . Die modifizierten Pakete (beim Build zu beachten: --define "dist .el_5.1") finden sich in 51/{i386|x86_64}_extra/SL/selinux auf dem Installserver und werden damit automatisch auf allen SL5.1 Systemen ausgerollt. Dese Erweiterung der Policy löst auch das "mv /tmp/xyz /lustre/" Problem.

Potentielles Problem in der Zukunft: HA für mehrere MDS hat die Syntax "srv1@net1:srv2@net2:/fs", und das mag autofs wohl auch nicht :-(.

=== Mounten mit Initskript ===
Auf SL4 wird Lustre mit dem Initskript aus dem Paket DL_lustreclient gemountet. Zuvor müssen die Mountpunkte in /etc/fstab mit noauto eingetragen werden, dafür existiert momentan noch kein Mechanismus.