[Gluster-devel] [RFC] BD xlator - Supporting multi-brick and posix

Fri Apr 12 07:28:32 UTC 2013

bd: [RFC] posix/multi-brick support to BD xlator

Current BD xlator (block backend) has a few limitations such as
* Creation of directories not supported
* Supports only single brick
* Does not use extended attributes (and client gfid) like posix xlator
* Creation of special files (symbolic links, device nodes etc) not
supported

Basic limitation of not allowing directory creation is blocking
oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM
creates multi-level directories when GlusterFS is used as storage
backend for storing VM images.

To overcome these limitations a new BD xlator with following improvements
is suggested.

* New hybrid BD xlator that handles both regular files and block device files
* The volume will have both POSIX and BD bricks. Regular files are
  created on POSIX bricks, block devices are created on the BD brick (VG)
* BD xlator leverages exiting POSIX xlator for most POSIX calls and
  hence sits above the POSIX xlator
* Block device file is differentiated from regular file by an extended attribute
* The xattr 'trusted.glusterfs.bd' (BD_XATTR) plays a role in mapping a
  posix file to Logical Volume (LV).
* When a client sends a request to set BD_XATTR on a posix file, a new
  LV is created and mapped to posix file. So every block device will
  have a representative file in POSIX brick with 'trusted.glusterfs.bd.path'
  (BD_XATTR_PATH) set.
* Here after all operations on this file results in LV related operations.

For example opening a file that has BD_XATTR_PATH set results in opening the LV
block device, reading results in reading the corresponding LV block device.

New BD xlator code is placed in xlators/storage/bd directory. It also
disables existing bd-map xlator (ie you cant create a gluster volume to
use bd_map xlator). But in next version support for bd_map xlator will
be retained.

When BD xlator gets request to set BD_XATTR via setxattr call, it
creates a LV[1] and information about this LV is placed in the xattr of
the posix file. xattr "glusterfs.bd.path" is used to map
"vg_name/lv_name" in the posix file. 

Usage:
Server side:
[root at host1 ~]# gluster volume create bdvol device vg host1:/vg1 host2:/vg2
It creates a distributed gluster volume 'bdvol' with Volume Group vg1 in
host1 and Volume Group vg2 in host2. [2]

[root at host1 ~]# gluster volume start bdvol

Client side:
[root at node ~]# mount -t glusterfs host1:/bdvol /media
[root at node ~]# touch /media/posix
It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick

[root at node ~]# mkdir /media/image
[root at node ~]# touch /media/image/lv1
It also creates regular posix file 'lv1' in either host1:/vg1 or
host2:/vg2 brick

[root at node ~]# setfattr -n "trusted.glusterfs.bd" -v 1 /media/image/lv1
[root at node ~]#
Above setxattr results in creating a new LV in corresponding brick's VG
and it sets 'trusted.glusterfs.bd.path' with value 'vg1/lv1' (assuming
its created in host1:vg1)

[root at node ~]# truncate -s5G /media/image/lv1
It results in resizig LV 'lv1'to 5G

Todos/Fixme:
[1] LV name generation: LV name is generated from the full path of posix
    file. But it may exceed the length of lvname limit. So from the full
    path an unique LV name with limited length has to be generated
[2] As of now posix brick directory is assumed from the VG name. Enhance
    gluster volume creation CLI command to mention the VG name similar to
    this:
        # gluster volume create bdvol host1:/<brick1>:vg1 host2:/<brick2>:vg2
    <brick1> is standard posix brick and :vg1 specifies it has to
    map VG 'vg1' to that brick path. The syntax will be
        # gluster volume create <volname> <NEW-BRICK[:VG]>

    Second ':' in the brick path is used to differentiate between posix
    and BD volume file and to specify the associated VG.
[3] New bd xlator code is not working with open-behind xlator. In order to
    work with open-behind, BD xlator has to use the same approach similar
    to posix xlator where a <brick-path>/.glusterfs/gfid file is opened
    when a fd is needed. But exposing posix brick-path to BD xlator may
    not be an good idea.
[4] BD xlator is not doing BD specific operation in readv/writev fop and it
    can be forwarded to posix readv/writev if posix xlator could handle
    opening BD device also. i.e when a open request comes for a BD mapped
    posix file, posix_open routine has to open the posix file and BD (in
    this case LV) and embed the posix_fd and bd_fd_t structures in fd_t
    structure. So later readv/writev will result in reading/writing to a
    BD (and/or posix file). This also solves open-behind issue with BD
    xlator. But it needs changes in posix xlator code and posix xlator
[5] When a new brick is added a file may be moved to a new brick and
    with BD volume it should also move the mapped LV to the new brick's
    VG.
[6] In BD volume file if VG is served by SAN and its suggested to use
    the posix brick directory also from the same SAN so that data (LVs)
    and meta data (posix files) are stored in the same SAN.
[7] Some fops copy inode->gfid to loc->gfid before forwarding the
    request to posix xlator
[8] Add dm-thin support
[9] Add support to full and linked clone of LV images. 
[10] Retain support for bd_map xlator