[Gluster-devel] [RFC] BD xlator - Supporting multi-brick and posix
M. Mohan Kumar
mohan at in.ibm.com
Fri Apr 12 07:45:18 UTC 2013
"M.Mohan Kumar" <mohan at in.ibm.com> writes:
Patches posted to Gerrit http://review.gluster.com/4809
Also added interested people in the CC list.
> bd: [RFC] posix/multi-brick support to BD xlator
> Current BD xlator (block backend) has a few limitations such as
> * Creation of directories not supported
> * Supports only single brick
> * Does not use extended attributes (and client gfid) like posix xlator
> * Creation of special files (symbolic links, device nodes etc) not
> Basic limitation of not allowing directory creation is blocking
> oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM
> creates multi-level directories when GlusterFS is used as storage
> backend for storing VM images.
> To overcome these limitations a new BD xlator with following improvements
> is suggested.
> * New hybrid BD xlator that handles both regular files and block device files
> * The volume will have both POSIX and BD bricks. Regular files are
> created on POSIX bricks, block devices are created on the BD brick (VG)
> * BD xlator leverages exiting POSIX xlator for most POSIX calls and
> hence sits above the POSIX xlator
> * Block device file is differentiated from regular file by an extended attribute
> * The xattr 'trusted.glusterfs.bd' (BD_XATTR) plays a role in mapping a
> posix file to Logical Volume (LV).
> * When a client sends a request to set BD_XATTR on a posix file, a new
> LV is created and mapped to posix file. So every block device will
> have a representative file in POSIX brick with 'trusted.glusterfs.bd.path'
> (BD_XATTR_PATH) set.
> * Here after all operations on this file results in LV related operations.
> For example opening a file that has BD_XATTR_PATH set results in opening the LV
> block device, reading results in reading the corresponding LV block device.
> New BD xlator code is placed in xlators/storage/bd directory. It also
> disables existing bd-map xlator (ie you cant create a gluster volume to
> use bd_map xlator). But in next version support for bd_map xlator will
> be retained.
> When BD xlator gets request to set BD_XATTR via setxattr call, it
> creates a LV and information about this LV is placed in the xattr of
> the posix file. xattr "glusterfs.bd.path" is used to map
> "vg_name/lv_name" in the posix file.
> Server side:
> [root at host1 ~]# gluster volume create bdvol device vg host1:/vg1 host2:/vg2
> It creates a distributed gluster volume 'bdvol' with Volume Group vg1 in
> host1 and Volume Group vg2 in host2. 
> [root at host1 ~]# gluster volume start bdvol
> Client side:
> [root at node ~]# mount -t glusterfs host1:/bdvol /media
> [root at node ~]# touch /media/posix
> It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick
> [root at node ~]# mkdir /media/image
> [root at node ~]# touch /media/image/lv1
> It also creates regular posix file 'lv1' in either host1:/vg1 or
> host2:/vg2 brick
> [root at node ~]# setfattr -n "trusted.glusterfs.bd" -v 1 /media/image/lv1
> [root at node ~]#
> Above setxattr results in creating a new LV in corresponding brick's VG
> and it sets 'trusted.glusterfs.bd.path' with value 'vg1/lv1' (assuming
> its created in host1:vg1)
> [root at node ~]# truncate -s5G /media/image/lv1
> It results in resizig LV 'lv1'to 5G
>  LV name generation: LV name is generated from the full path of posix
> file. But it may exceed the length of lvname limit. So from the full
> path an unique LV name with limited length has to be generated
>  As of now posix brick directory is assumed from the VG name. Enhance
> gluster volume creation CLI command to mention the VG name similar to
> # gluster volume create bdvol host1:/<brick1>:vg1 host2:/<brick2>:vg2
> <brick1> is standard posix brick and :vg1 specifies it has to
> map VG 'vg1' to that brick path. The syntax will be
> # gluster volume create <volname> <NEW-BRICK[:VG]>
> Second ':' in the brick path is used to differentiate between posix
> and BD volume file and to specify the associated VG.
>  New bd xlator code is not working with open-behind xlator. In order to
> work with open-behind, BD xlator has to use the same approach similar
> to posix xlator where a <brick-path>/.glusterfs/gfid file is opened
> when a fd is needed. But exposing posix brick-path to BD xlator may
> not be an good idea.
>  BD xlator is not doing BD specific operation in readv/writev fop and it
> can be forwarded to posix readv/writev if posix xlator could handle
> opening BD device also. i.e when a open request comes for a BD mapped
> posix file, posix_open routine has to open the posix file and BD (in
> this case LV) and embed the posix_fd and bd_fd_t structures in fd_t
> structure. So later readv/writev will result in reading/writing to a
> BD (and/or posix file). This also solves open-behind issue with BD
> xlator. But it needs changes in posix xlator code and posix xlator
>  When a new brick is added a file may be moved to a new brick and
> with BD volume it should also move the mapped LV to the new brick's
>  In BD volume file if VG is served by SAN and its suggested to use
> the posix brick directory also from the same SAN so that data (LVs)
> and meta data (posix files) are stored in the same SAN.
>  Some fops copy inode->gfid to loc->gfid before forwarding the
> request to posix xlator
>  Add dm-thin support
>  Add support to full and linked clone of LV images.
>  Retain support for bd_map xlator
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
More information about the Gluster-devel