[Gluster-devel] Progress on adding support for SEEK_DATA and SEEK_HOLE

Mon Jul 6 07:57:47 UTC 2015

On Mon, Jul 06, 2015 at 09:37:08AM +0200, Xavier Hernandez wrote:
> On 07/06/2015 01:15 AM, Niels de Vos wrote:
> >On Wed, Jul 01, 2015 at 09:41:19PM +0200, Niels de Vos wrote:
> >>On Wed, Jul 01, 2015 at 07:15:12PM +0200, Xavier Hernandez wrote:
> >>>On 07/01/2015 08:53 AM, Niels de Vos wrote:
> >>>>On Tue, Jun 30, 2015 at 11:48:20PM +0530, Ravishankar N wrote:
> >>>>>
> >>>>>
> >>>>>On 06/22/2015 03:22 PM, Ravishankar N wrote:
> >>>>>>
> >>>>>>
> >>>>>>On 06/22/2015 01:41 PM, Miklos Szeredi wrote:
> >>>>>>>On Sun, Jun 21, 2015 at 6:20 PM, Niels de Vos <ndevos at redhat.com> wrote:
> >>>>>>>>Hi,
> >>>>>>>>
> >>>>>>>>it seems that there could be a reasonable benefit for virtual machine
> >>>>>>>>images on a FUSE mountpoint when SEEK_DATA and SEEK_HOLE would be
> >>>>>>>>available. At the moment, FUSE does not pass lseek() on to the
> >>>>>>>>userspace
> >>>>>>>>process that handles the I/O.
> >>>>>>>>
> >>>>>>>>Other filesystems that do not (need to) track the position in the
> >>>>>>>>file-descriptor are starting to support SEEK_DATA/HOLE. One example is
> >>>>>>>>NFS:
> >>>>>>>>
> >>>>>>>>https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-38#section-15.11
> >>>>>>>>
> >>>>>>>>I would like to add this feature to Gluster, and am wondering if there
> >>>>>>>>are any reasons why it should/could not be added to FUSE.
> >>>>>>>I don't see any reason why it couldn't be added.  Please go ahead.
> >>>>>>
> >>>>>>Thanks for bouncing the mail to me Niels, I would be happy to work on
> >>>>>>this. I'll submit a patch by Monday next.
> >>>>>>
> >>>>>
> >>>>>
> >>>>>Sent a patch @
> >>>>>http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/14752
> >>>>>I've tested it with some skeleton code in gluster-fuse to handle lseek().
> >>>>
> >>>>Ravi also sent his patch for glusterfs-fuse:
> >>>>
> >>>>   http://review.gluster.org/11474
> >>>>
> >>>>I have posted my COMPLETELY UNTESTED patches to their own Gerrit topic
> >>>>so that we can easily track the progress:
> >>>>
> >>>>   http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:wip/SEEK_HOLE
> >>>>
> >>>>My preference goes to share things early and make everyone able to
> >>>>follow progress (know where to find the latest patches). Assistance in
> >>>>testing, reviewing and improving is welcome! There are some outstanding
> >>>>things like seek() for ec and sharding, and probably more.
> >>>>
> >>>>This all was done as a suggestion from Christopher (kripper) Pereira,
> >>>>for improving the handling of sparse files (like most VM images).
> >>>
> >>>I've posted the patch for ec in the same Gerrit topic:
> >>>
> >>>     http://review.gluster.org/11494/
> >>
> >>Thanks!
> >>
> >>>It has not been tested and some discussion about if it's really needed to
> >>>send the request to all subvolumes will be needed.
> >>>
> >>>The lock and the xattrop are absolutely needed. Even if we send the request
> >>>to only one subvolume, we need to know which ones are healthy (to avoid
> >>>sending the request to a brick that could have invalid hole information).
> >>>This could have been done in open, but since NFS does not issue open calls,
> >>>we cannot rely on that.
> >>
> >>Ok, yes, that makes sense. We will likely have SEEK as an operation in
> >>NFS-Ganesha at one point, and that will use the handle-based gfapi
> >>functions.
> >>
> >>>Once we know which bricks are healthy we could opt for sending the request
> >>>only to one of them. In this case we need to be aware that even healthy
> >>>bricks could have different hole locations.
> >>
> >>I'm not sure if I understand what you mean, but that likely has to do
> >>that I dont know much about ec. I'll try to think it through later this
> >>week.
> >
> >The only thing that would need to be guaranteed is that the offset of
> >the hole/data is safe. The whole purpose is to improve handling of
> >sparse files, this does not need to be perfect. The holes themselves are
> >not important, but the non-holes are.
> >
> >When a sparse file (think VM image) is copied, the goal is to not read
> >the holes which would return NUL bytes. If calculating the start of a
> >hole or the end is not exact, that is not a fatal issue. Reading and
> >backing up a series of NUL bytes before/after the hole should be
> >acceptable.
> >
> >A drawing can probably explain things a little better.
> >
> >
> >                         lseek(SEEK_HOLE)
> >                           |       |
> >                   perfect |       | acceptable
> >                     match |       | match
> >                           |       |
> >      .....................|.......|.....................
> >      :file                |       |                    :
> >      : .----------------. v       v           .------. :
> >      : | DATA DATA DATA | NUL NUL NUL NUL NUL | DATA | :
> >      : '----------------'                 ^   '------' :
> >      :                                    |   ^        :
> >      .....................................|...|.........
> >                                           |   |
> >                                acceptable |   | perfect
> >                                     match |   | match
> >                                           |   |
> >                                         lseek(SEEK_DATA)
> >
> >
> >I have no idea how ec can figure out the offset of holes/data, that
> >would be interesting to know. Is it something that is available in a
> >design document somewhere?
> 
> EC splits the file in chunks of 512 * #data bricks. Each brick receives a
> fragment of 512 bytes for each chunk. These fragments are the minimal units
> of data and they are a hole or they contain data but not a mix (if part of
> the fragment should be a hole, it's filled with 0's). This means that
> backend filesystems can only have data/holes aligned to offsets multiple of
> 512 bytes.

Thanks for the explanation, makes sense to me.

> Reading some other information and your explanation, I will need to change
> the logic to detect data/holes. I'll update the patch as soon as possible.

No need to hurry. This is a nice feature to have, but I am not aware of
a schedule where we want to include this.

> >My inclination is to have the same consistency for the seek() FOP as for
> >read(). The same locking and health-checks would apply. Does that help?
> 
> What provides consistency to read() is the initial check done just after the
> locking. I think this is enough to choose one healthy brick, so I'll also
> update the patch to only use a single brick for seek() instead of sending
> the request to multiple bricks.

Sounds good!

> Even if the data/hole positions can be different between healthy bricks,
> those bricks that have data where others have holes, must contain 0's
> (otherwise they shouldn't be healthy). So I think it's not so important to
> query multiple bricks to obtain more accurate information.

Indeed, and the handling of files with holes is already dependent on the
behaviour of the actual filesystems used. Some filesystems may (or at
least did) anticipate that (small?) holes get filled, and allocate
subsequent blocks anyway.

Niels