[Gluster-devel] Change in glusterfs[master]: bd: posix/multi-brick support to BD xlator

Anand Avati avati at redhat.com
Thu Sep 5 23:50:17 UTC 2013


On 09/01/2013 11:26 AM, M. Mohan Kumar (Code Review) wrote:
> Hello Anand Avati, Gluster Build System,
>
> I'd like you to reexamine a change.  Please visit
>
>      http://review.gluster.org/4809
>
> to look at the new patch set (#4).
>
> Change subject: bd: posix/multi-brick support to BD xlator
> ......................................................................
>
> bd: posix/multi-brick support to BD xlator
>
> Current BD xlator (block backend) has a few limitations such as
> * Creation of directories not supported
> * Supports only single brick
> * Does not use extended attributes (and client gfid) like posix xlator
> * Creation of special files (symbolic links, device nodes etc) not
> supported
>
> Basic limitation of not allowing directory creation is blocking
> oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM
> creates multi-level directories when GlusterFS is used as storage
> backend for storing VM images.
>
> To overcome these limitations a new BD xlator with following
> improvements is suggested.
>
> * New hybrid BD xlator that handles both regular files and block device files
> * The volume will have both POSIX and BD bricks. Regular files are
>    created on POSIX bricks, block devices are created on the BD brick (VG)
> * BD xlator leverages exiting POSIX xlator for most POSIX calls and
>    hence sits above the POSIX xlator
> * Block device file is differentiated from regular file by an extended attribute
> * The xattr 'user.glusterfs.bd' (BD_XATTR) plays a role in mapping a
>    posix file to Logical Volume (LV).
> * When a client sends a request to set BD_XATTR on a posix file, a new
>    LV is created and mapped to posix file. So every block device will
>    have a representative file in POSIX brick with 'user.glusterfs.bd'
>    (BD_XATTR) and 'user.glusterfs.bd.size' (BD_XATTR_SIZE) set.
> * Here after all operations on this file results in LV related operations.
>
> New BD xlator code is placed in xlators/storage/bd directory.
>
> For example opening a file that has BD_XATTR_PATH set results in opening
> the LV block device, reading results in reading the corresponding LV block
> device.
>
> When BD xlator gets request to set BD_XATTR via setxattr call, it
> creates a LV and information about this LV is placed in the xattr of the
> posix file. xattr "user.glusterfs.bd", "user.glusterfs.bd.size" used to
> identify that posix file is mapped to BD.
>
> Usage:
> Server side:
> [root at host1 ~]# gluster volume create bdvol device vg host1:/storage/vg1_info?vg1 host2:/storage/vg2_info?vg2
> It creates a distributed gluster volume 'bdvol' with Volume Group vg1
> using posix brick /storage/vg1_info in host1 and Volume Group vg2 using
> /storage/vg2_info in host2.
>
> [root at host1 ~]# gluster volume start bdvol
>
> Client side:
> [root at node ~]# mount -t glusterfs host1:/bdvol /media
> [root at node ~]# touch /media/posix
> It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2
> brick
>
> [root at node ~]# mkdir /media/image
> [root at node ~]# touch /media/image/lv1
> It also creates regular posix file 'lv1' in either host1:/vg1 or
> host2:/vg2 brick
>
> [root at node ~]# setfattr -n "user.glusterfs.bd" -v "lv" /media/image/lv1
> [root at node ~]#
> Above setxattr results in creating a new LV in corresponding brick's VG
> and it sets 'user.glusterfs.bd' with value 'lv' and
> 'user.glusterfs.size' with default extent size.
>
> [root at node ~]# truncate -s5G /media/image/lv1
> It results in resizig LV 'lv1'to 5G
>
> Changes from previous version V3:
> * Added support in FUSE to support full/linked clone
> * Added support to merge snapshots and provide information about origin
> * bd_map xlator removed
> * iatt structure used in inode_ctx. iatt is cached and updated during
> fsync/flush
> * aio support
> * Type and capabilities of volume are exported through getxattr
>
> Changes from version 2:
> * Used inode_context for caching BD size and to check if loc/fd is BD or
>    not.
> * Added GlusterFS server offloaded copy and snapshot through setfattr
>    FOP. As part of this libgfapi is modified.
> * BD xlator supports stripe
> * During unlinking if a LV file is already opened, its added to delete
>    list and bd_del_thread tries to delete from this list when a last
>    reference to that file is closed.
>
> Changes from previous version:
> * gfid is used as name of LV
> * ? is used to specify VG name for creating BD volume in volume
>    create, add-brick. gluster volume create volname host:/path?vg
> * open-behind issue is fixed
> * A replicate brick can be added dynamically and LVs from source brick are
>    replicated to destination brick
> * A distribute brick can be added dynamically and rebalance operation
>    distributes existing LVs/files to the new brick
> * Thin provisioning support added.
> * bd_map xlator support retained
> * setfattr -n user.glusterfs.bd -v "lv" creates a regular LV and
>    setfattr -n user.glusterfs.bd -v "thin" creates thin LV
> * Capability and backend information added to gluster volume info (and --xml) so
>    that management tools can exploit BD xlator.
> * tracing support for bd xlator added
>
> TODO:
> * Add support to display snapshots for a given LV
> * Display posix filename for list-origin instead of gfid
>
> Change-Id: I00d32dfbab3b7c806e0841515c86c3aa519332f2
> Signed-off-by: M. Mohan Kumar <mohan at in.ibm.com>
> ---
> M configure.ac
> M xlators/storage/Makefile.am
> A xlators/storage/bd/Makefile.am
> A xlators/storage/bd/src/Makefile.am
> A xlators/storage/bd/src/bd-helper.c
> A xlators/storage/bd/src/bd.c
> A xlators/storage/bd/src/bd.h
> 7 files changed, 638 insertions(+), 1 deletion(-)
>
>
>    git pull ssh://git.gluster.org/glusterfs refs/changes/09/4809/4
>


Mohan,
In general, other than the specific comments in the various patches, we 
should probably squash some of the patches into a smaller set (from 15) -

0 - remove old bd_map xlator
1 - implement basic new bd xlator (include everything not listed below)
2 - other translators' changes to support BD
3 - add snapshot/clone support
4 - add aio support

There are a lot of instances in the patch set which are an earlier patch 
does things a certain way and a later patch changes it. All this seems 
quite redundant and hard to review for a new feature which is adding 
code from scratch.

Avati





More information about the Gluster-devel mailing list