[Gluster-devel] Progress on adding support for SEEK_DATA and SEEK_HOLE

Mon Jul 6 07:37:08 UTC 2015

On 07/06/2015 01:15 AM, Niels de Vos wrote:
> On Wed, Jul 01, 2015 at 09:41:19PM +0200, Niels de Vos wrote:
>> On Wed, Jul 01, 2015 at 07:15:12PM +0200, Xavier Hernandez wrote:
>>> On 07/01/2015 08:53 AM, Niels de Vos wrote:
>>>> On Tue, Jun 30, 2015 at 11:48:20PM +0530, Ravishankar N wrote:
>>>>>
>>>>>
>>>>> On 06/22/2015 03:22 PM, Ravishankar N wrote:
>>>>>>
>>>>>>
>>>>>> On 06/22/2015 01:41 PM, Miklos Szeredi wrote:
>>>>>>> On Sun, Jun 21, 2015 at 6:20 PM, Niels de Vos <ndevos at redhat.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> it seems that there could be a reasonable benefit for virtual machine
>>>>>>>> images on a FUSE mountpoint when SEEK_DATA and SEEK_HOLE would be
>>>>>>>> available. At the moment, FUSE does not pass lseek() on to the
>>>>>>>> userspace
>>>>>>>> process that handles the I/O.
>>>>>>>>
>>>>>>>> Other filesystems that do not (need to) track the position in the
>>>>>>>> file-descriptor are starting to support SEEK_DATA/HOLE. One example is
>>>>>>>> NFS:
>>>>>>>>
>>>>>>>> https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-38#section-15.11
>>>>>>>>
>>>>>>>> I would like to add this feature to Gluster, and am wondering if there
>>>>>>>> are any reasons why it should/could not be added to FUSE.
>>>>>>> I don't see any reason why it couldn't be added.  Please go ahead.
>>>>>>
>>>>>> Thanks for bouncing the mail to me Niels, I would be happy to work on
>>>>>> this. I'll submit a patch by Monday next.
>>>>>>
>>>>>
>>>>>
>>>>> Sent a patch @
>>>>> http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/14752
>>>>> I've tested it with some skeleton code in gluster-fuse to handle lseek().
>>>>
>>>> Ravi also sent his patch for glusterfs-fuse:
>>>>
>>>>    http://review.gluster.org/11474
>>>>
>>>> I have posted my COMPLETELY UNTESTED patches to their own Gerrit topic
>>>> so that we can easily track the progress:
>>>>
>>>>    http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:wip/SEEK_HOLE
>>>>
>>>> My preference goes to share things early and make everyone able to
>>>> follow progress (know where to find the latest patches). Assistance in
>>>> testing, reviewing and improving is welcome! There are some outstanding
>>>> things like seek() for ec and sharding, and probably more.
>>>>
>>>> This all was done as a suggestion from Christopher (kripper) Pereira,
>>>> for improving the handling of sparse files (like most VM images).
>>>
>>> I've posted the patch for ec in the same Gerrit topic:
>>>
>>>      http://review.gluster.org/11494/
>>
>> Thanks!
>>
>>> It has not been tested and some discussion about if it's really needed to
>>> send the request to all subvolumes will be needed.
>>>
>>> The lock and the xattrop are absolutely needed. Even if we send the request
>>> to only one subvolume, we need to know which ones are healthy (to avoid
>>> sending the request to a brick that could have invalid hole information).
>>> This could have been done in open, but since NFS does not issue open calls,
>>> we cannot rely on that.
>>
>> Ok, yes, that makes sense. We will likely have SEEK as an operation in
>> NFS-Ganesha at one point, and that will use the handle-based gfapi
>> functions.
>>
>>> Once we know which bricks are healthy we could opt for sending the request
>>> only to one of them. In this case we need to be aware that even healthy
>>> bricks could have different hole locations.
>>
>> I'm not sure if I understand what you mean, but that likely has to do
>> that I dont know much about ec. I'll try to think it through later this
>> week.
>
> The only thing that would need to be guaranteed is that the offset of
> the hole/data is safe. The whole purpose is to improve handling of
> sparse files, this does not need to be perfect. The holes themselves are
> not important, but the non-holes are.
>
> When a sparse file (think VM image) is copied, the goal is to not read
> the holes which would return NUL bytes. If calculating the start of a
> hole or the end is not exact, that is not a fatal issue. Reading and
> backing up a series of NUL bytes before/after the hole should be
> acceptable.
>
> A drawing can probably explain things a little better.
>
>
>                          lseek(SEEK_HOLE)
>                            |       |
>                    perfect |       | acceptable
>                      match |       | match
>                            |       |
>       .....................|.......|.....................
>       :file                |       |                    :
>       : .----------------. v       v           .------. :
>       : | DATA DATA DATA | NUL NUL NUL NUL NUL | DATA | :
>       : '----------------'                 ^   '------' :
>       :                                    |   ^        :
>       .....................................|...|.........
>                                            |   |
>                                 acceptable |   | perfect
>                                      match |   | match
>                                            |   |
>                                          lseek(SEEK_DATA)
>
>
> I have no idea how ec can figure out the offset of holes/data, that
> would be interesting to know. Is it something that is available in a
> design document somewhere?

EC splits the file in chunks of 512 * #data bricks. Each brick receives 
a fragment of 512 bytes for each chunk. These fragments are the minimal 
units of data and they are a hole or they contain data but not a mix (if 
part of the fragment should be a hole, it's filled with 0's). This means 
that backend filesystems can only have data/holes aligned to offsets 
multiple of 512 bytes.

Reading some other information and your explanation, I will need to 
change the logic to detect data/holes. I'll update the patch as soon as 
possible.

>
> My inclination is to have the same consistency for the seek() FOP as for
> read(). The same locking and health-checks would apply. Does that help?

What provides consistency to read() is the initial check done just after 
the locking. I think this is enough to choose one healthy brick, so I'll 
also update the patch to only use a single brick for seek() instead of 
sending the request to multiple bricks.

Even if the data/hole positions can be different between healthy bricks, 
those bricks that have data where others have holes, must contain 0's 
(otherwise they shouldn't be healthy). So I think it's not so important 
to query multiple bricks to obtain more accurate information.

Xavi