[Gluster-devel] fallocate behavior in glusterfs

Mon Jul 8 16:00:24 UTC 2019

I have sent a rfc patch [1] for review.

https://review.gluster.org/#/c/glusterfs/+/23011/

On Thu, Jul 4, 2019 at 1:13 AM Pranith Kumar Karampuri <pkarampu at redhat.com>
wrote:

>
>
> On Wed, Jul 3, 2019 at 10:59 PM FNU Raghavendra Manjunath <
> rabhat at redhat.com> wrote:
>
>>
>>
>> On Wed, Jul 3, 2019 at 3:28 AM Pranith Kumar Karampuri <
>> pkarampu at redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Jul 3, 2019 at 10:14 AM Ravishankar N <ravishankar at redhat.com>
>>> wrote:
>>>
>>>>
>>>> On 02/07/19 8:52 PM, FNU Raghavendra Manjunath wrote:
>>>>
>>>>
>>>> Hi All,
>>>>
>>>> In glusterfs, there is an issue regarding the fallocate behavior. In
>>>> short, if someone does fallocate from the mount point with some size that
>>>> is greater than the available size in the backend filesystem where the file
>>>> is present, then fallocate can fail with a subset of the required number of
>>>> blocks allocated and then failing in the backend filesystem with ENOSPC
>>>> error.
>>>>
>>>> The behavior of fallocate in itself is simlar to how it would have been
>>>> on a disk filesystem (atleast xfs where it was checked). i.e. allocates
>>>> subset of the required number of blocks and then fail with ENOSPC. And the
>>>> file in itself would show the number of blocks in stat to be whatever was
>>>> allocated as part of fallocate. Please refer [1] where the issue is
>>>> explained.
>>>>
>>>> Now, there is one small difference between how the behavior is between
>>>> glusterfs and xfs.
>>>> In xfs after fallocate fails, doing 'stat' on the file shows the number
>>>> of blocks that have been allocated. Whereas in glusterfs, the number of
>>>> blocks is shown as zero which makes tools like "du" show zero consumption.
>>>> This difference in behavior in glusterfs is because of libglusterfs on how
>>>> it handles sparse files etc for calculating number of blocks (mentioned in
>>>> [1])
>>>>
>>>> At this point I can think of 3 things on how to handle this.
>>>>
>>>> 1) Except for how many blocks are shown in the stat output for the file
>>>> from the mount point (on which fallocate was done), the remaining behavior
>>>> of attempting to allocate the requested size and failing when the
>>>> filesystem becomes full is similar to that of XFS.
>>>>
>>>> Hence, what is required is to come up with a solution on how
>>>> libglusterfs calculate blocks for sparse files etc (without breaking any of
>>>> the existing components and features). This makes the behavior similar to
>>>> that of backend filesystem. This might require its own time to fix
>>>> libglusterfs logic without impacting anything else.
>>>>
>>>> I think we should just revert the commit
>>>> b1a5fa55695f497952264e35a9c8eb2bbf1ec4c3 (BZ 817343) and see if it really
>>>> breaks anything (or check whatever it breaks is something that we can live
>>>> with). XFS speculative preallocation is not permanent and the extra space
>>>> is freed up eventually. It can be sped up via procfs tunable:
>>>> http://xfs.org/index.php/XFS_FAQ#Q:_How_can_I_speed_up_or_avoid_delayed_removal_of_speculative_preallocation.3F.
>>>> We could also tune the allocsize option to a low value like 4k so that
>>>> glusterfs quota is not affected.
>>>>
>>>> FWIW, ENOSPC is not the only fallocate problem in gluster because of
>>>> 'iatt->ia_block' tweaking. It also breaks the --keep-size option (i.e. the
>>>> FALLOC_FL_KEEP_SIZE flag in fallocate(2)) and reports incorrect du size.
>>>>
>>> Regards,
>>>> Ravi
>>>>
>>>>
>>>> OR
>>>>
>>>> 2) Once the fallocate fails in the backend filesystem, make posix
>>>> xlator in the brick truncate the file to the previous size of the file
>>>> before attempting fallocate. A patch [2] has been sent for this. But there
>>>> is an issue with this when there are parallel writes and fallocate
>>>> operations happening on the same file. It can lead to a data loss.
>>>>
>>>> a) statpre is obtained ===> before fallocate is attempted, get the stat
>>>> hence the size of the file b) A parrallel Write fop on the same file that
>>>> extends the file is successful c) Fallocate fails d) ftruncate truncates it
>>>> to size given by statpre (i.e. the previous stat and the size obtained in
>>>> step a)
>>>>
>>>> OR
>>>>
>>>> 3) Make posix check for available disk size before doing fallocate.
>>>> i.e. in fallocate once posix gets the number of bytes to be allocated for
>>>> the file from a particular offset, it checks whether so many bytes are
>>>> available or not in the disk. If not, fail the fallocate fop with ENOSPC
>>>> (without attempting it on the backend filesystem).
>>>>
>>>> There still is a probability of a parallel write happening while this
>>>> fallocate is happening and by the time falllocate system call is attempted
>>>> on the disk, the available space might have been less than what was
>>>> calculated before fallocate.
>>>> i.e. following things can happen
>>>>
>>>>  a) statfs ===> get the available space of the backend filesystem
>>>>  b) a parallel write succeeds and extends the file
>>>>  c) fallocate is attempted assuming there is sufficient space in the
>>>> backend
>>>>
>>>> While the above situation can arise, I think we are still fine. Because
>>>> fallocate is attempted from the offset received in the fop. So,
>>>> irrespective of whether write extended the file or not, the fallocate
>>>> itself will be attempted for so many bytes from the offset which we found
>>>> to be available by getting statfs information.
>>>>
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1724754#c3
>>>> [2] https://review.gluster.org/#/c/glusterfs/+/22969/
>>>>
>>>>
>>> option 2) will affect performance if we have to serialize all the data
>>> operations on the file.
>>> option 3) can still lead to the same problem we are trying to solve in a
>>> different way.
>>>          - thread-1: fallocate came with 1MB size, Statfs says there is
>>> 1MB space.
>>>          - thread-2: Write on a different file is attempted with 128KB
>>> and succeeds
>>>          - thread-1: fallocate fails on the file after partially
>>> allocating size because there doesn't exist 1MB anymore.
>>>
>>>
>> Here I have a doubt. Even if a 128K write on the file succeeds, IIUC
>> fallocate will try to reserve 1MB of space relative to the offset that was
>> received as part of the fallocate call which was found to be available.
>> So, despite write succeeding, the region fallocate aimed at was 1MB of
>> space from a particular offset. As long as that is available, can posix
>> still go ahead and perform the fallocate operation?
>>
>
> It can go ahead and perform the operation. Just that in the case I
> mentioned it will lead to partial success because the size fallocate wants
> to reserve is not available.
>
>
>>
>> Regards,
>> Raghavendra
>>
>>
>>
>>
>>> So option-1 is what we need to explore and fix it so that the behavior
>>> is closer to other posix filesystems. Maybe start with what Ravi suggested?
>>>
>>>
>>>> Please provide feedback.
>>>>
>>>> Regards,
>>>> Raghavendra
>>>>
>>>> _______________________________________________
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> APAC Schedule -
>>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>>> Bridge: https://bluejeans.com/836554017
>>>>
>>>> NA/EMEA Schedule -
>>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>>> Bridge: https://bluejeans.com/486278655
>>>>
>>>> Gluster-devel mailing listGluster-devel at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>> _______________________________________________
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> APAC Schedule -
>>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>>> Bridge: https://bluejeans.com/836554017
>>>>
>>>> NA/EMEA Schedule -
>>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>>> Bridge: https://bluejeans.com/486278655
>>>>
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>>
>>>
>>> --
>>> Pranith
>>>
>>
>
> --
> Pranith
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190708/82144011/attachment-0001.html>