[Gluster-devel] [RFC] BD xlator - Supporting multi-brick and posix

Fri Apr 12 07:45:18 UTC 2013

"M.Mohan Kumar" <mohan at in.ibm.com> writes:

Patches posted to Gerrit http://review.gluster.com/4809

Also added interested people in the CC list. 

> bd: [RFC] posix/multi-brick support to BD xlator
>
> Current BD xlator (block backend) has a few limitations such as
> * Creation of directories not supported
> * Supports only single brick
> * Does not use extended attributes (and client gfid) like posix xlator
> * Creation of special files (symbolic links, device nodes etc) not
> supported
>
> Basic limitation of not allowing directory creation is blocking
> oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM
> creates multi-level directories when GlusterFS is used as storage
> backend for storing VM images.
>
> To overcome these limitations a new BD xlator with following improvements
> is suggested.
>
> * New hybrid BD xlator that handles both regular files and block device files
> * The volume will have both POSIX and BD bricks. Regular files are
>   created on POSIX bricks, block devices are created on the BD brick (VG)
> * BD xlator leverages exiting POSIX xlator for most POSIX calls and
>   hence sits above the POSIX xlator
> * Block device file is differentiated from regular file by an extended attribute
> * The xattr 'trusted.glusterfs.bd' (BD_XATTR) plays a role in mapping a
>   posix file to Logical Volume (LV).
> * When a client sends a request to set BD_XATTR on a posix file, a new
>   LV is created and mapped to posix file. So every block device will
>   have a representative file in POSIX brick with 'trusted.glusterfs.bd.path'
>   (BD_XATTR_PATH) set.
> * Here after all operations on this file results in LV related operations.
>
> For example opening a file that has BD_XATTR_PATH set results in opening the LV
> block device, reading results in reading the corresponding LV block device.
>
> New BD xlator code is placed in xlators/storage/bd directory. It also
> disables existing bd-map xlator (ie you cant create a gluster volume to
> use bd_map xlator). But in next version support for bd_map xlator will
> be retained.
>
> When BD xlator gets request to set BD_XATTR via setxattr call, it
> creates a LV[1] and information about this LV is placed in the xattr of
> the posix file. xattr "glusterfs.bd.path" is used to map
> "vg_name/lv_name" in the posix file. 
>
> Usage:
> Server side:
> [root at host1 ~]# gluster volume create bdvol device vg host1:/vg1 host2:/vg2
> It creates a distributed gluster volume 'bdvol' with Volume Group vg1 in
> host1 and Volume Group vg2 in host2. [2]
>
> [root at host1 ~]# gluster volume start bdvol
>
> Client side:
> [root at node ~]# mount -t glusterfs host1:/bdvol /media
> [root at node ~]# touch /media/posix
> It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick
>
> [root at node ~]# mkdir /media/image
> [root at node ~]# touch /media/image/lv1
> It also creates regular posix file 'lv1' in either host1:/vg1 or
> host2:/vg2 brick
>
> [root at node ~]# setfattr -n "trusted.glusterfs.bd" -v 1 /media/image/lv1
> [root at node ~]#
> Above setxattr results in creating a new LV in corresponding brick's VG
> and it sets 'trusted.glusterfs.bd.path' with value 'vg1/lv1' (assuming
> its created in host1:vg1)
>
> [root at node ~]# truncate -s5G /media/image/lv1
> It results in resizig LV 'lv1'to 5G
>
> Todos/Fixme:
> [1] LV name generation: LV name is generated from the full path of posix
>     file. But it may exceed the length of lvname limit. So from the full
>     path an unique LV name with limited length has to be generated
> [2] As of now posix brick directory is assumed from the VG name. Enhance
>     gluster volume creation CLI command to mention the VG name similar to
>     this:
>         # gluster volume create bdvol host1:/<brick1>:vg1 host2:/<brick2>:vg2
>     <brick1> is standard posix brick and :vg1 specifies it has to
>     map VG 'vg1' to that brick path. The syntax will be
>         # gluster volume create <volname> <NEW-BRICK[:VG]>
>
>     Second ':' in the brick path is used to differentiate between posix
>     and BD volume file and to specify the associated VG.
> [3] New bd xlator code is not working with open-behind xlator. In order to
>     work with open-behind, BD xlator has to use the same approach similar
>     to posix xlator where a <brick-path>/.glusterfs/gfid file is opened
>     when a fd is needed. But exposing posix brick-path to BD xlator may
>     not be an good idea.
> [4] BD xlator is not doing BD specific operation in readv/writev fop and it
>     can be forwarded to posix readv/writev if posix xlator could handle
>     opening BD device also. i.e when a open request comes for a BD mapped
>     posix file, posix_open routine has to open the posix file and BD (in
>     this case LV) and embed the posix_fd and bd_fd_t structures in fd_t
>     structure. So later readv/writev will result in reading/writing to a
>     BD (and/or posix file). This also solves open-behind issue with BD
>     xlator. But it needs changes in posix xlator code and posix xlator
> [5] When a new brick is added a file may be moved to a new brick and
>     with BD volume it should also move the mapped LV to the new brick's
>     VG.
> [6] In BD volume file if VG is served by SAN and its suggested to use
>     the posix brick directory also from the same SAN so that data (LVs)
>     and meta data (posix files) are stored in the same SAN.
> [7] Some fops copy inode->gfid to loc->gfid before forwarding the
>     request to posix xlator
> [8] Add dm-thin support
> [9] Add support to full and linked clone of LV images. 
> [10] Retain support for bd_map xlator
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel