[Gluster-devel] fallocate behavior in glusterfs

Pranith Kumar Karampuri pkarampu at redhat.com
Thu Jul 4 05:13:00 UTC 2019


On Wed, Jul 3, 2019 at 10:59 PM FNU Raghavendra Manjunath <rabhat at redhat.com>
wrote:

>
>
> On Wed, Jul 3, 2019 at 3:28 AM Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
>
>>
>>
>> On Wed, Jul 3, 2019 at 10:14 AM Ravishankar N <ravishankar at redhat.com>
>> wrote:
>>
>>>
>>> On 02/07/19 8:52 PM, FNU Raghavendra Manjunath wrote:
>>>
>>>
>>> Hi All,
>>>
>>> In glusterfs, there is an issue regarding the fallocate behavior. In
>>> short, if someone does fallocate from the mount point with some size that
>>> is greater than the available size in the backend filesystem where the file
>>> is present, then fallocate can fail with a subset of the required number of
>>> blocks allocated and then failing in the backend filesystem with ENOSPC
>>> error.
>>>
>>> The behavior of fallocate in itself is simlar to how it would have been
>>> on a disk filesystem (atleast xfs where it was checked). i.e. allocates
>>> subset of the required number of blocks and then fail with ENOSPC. And the
>>> file in itself would show the number of blocks in stat to be whatever was
>>> allocated as part of fallocate. Please refer [1] where the issue is
>>> explained.
>>>
>>> Now, there is one small difference between how the behavior is between
>>> glusterfs and xfs.
>>> In xfs after fallocate fails, doing 'stat' on the file shows the number
>>> of blocks that have been allocated. Whereas in glusterfs, the number of
>>> blocks is shown as zero which makes tools like "du" show zero consumption.
>>> This difference in behavior in glusterfs is because of libglusterfs on how
>>> it handles sparse files etc for calculating number of blocks (mentioned in
>>> [1])
>>>
>>> At this point I can think of 3 things on how to handle this.
>>>
>>> 1) Except for how many blocks are shown in the stat output for the file
>>> from the mount point (on which fallocate was done), the remaining behavior
>>> of attempting to allocate the requested size and failing when the
>>> filesystem becomes full is similar to that of XFS.
>>>
>>> Hence, what is required is to come up with a solution on how
>>> libglusterfs calculate blocks for sparse files etc (without breaking any of
>>> the existing components and features). This makes the behavior similar to
>>> that of backend filesystem. This might require its own time to fix
>>> libglusterfs logic without impacting anything else.
>>>
>>> I think we should just revert the commit
>>> b1a5fa55695f497952264e35a9c8eb2bbf1ec4c3 (BZ 817343) and see if it really
>>> breaks anything (or check whatever it breaks is something that we can live
>>> with). XFS speculative preallocation is not permanent and the extra space
>>> is freed up eventually. It can be sped up via procfs tunable:
>>> http://xfs.org/index.php/XFS_FAQ#Q:_How_can_I_speed_up_or_avoid_delayed_removal_of_speculative_preallocation.3F.
>>> We could also tune the allocsize option to a low value like 4k so that
>>> glusterfs quota is not affected.
>>>
>>> FWIW, ENOSPC is not the only fallocate problem in gluster because of
>>> 'iatt->ia_block' tweaking. It also breaks the --keep-size option (i.e. the
>>> FALLOC_FL_KEEP_SIZE flag in fallocate(2)) and reports incorrect du size.
>>>
>> Regards,
>>> Ravi
>>>
>>>
>>> OR
>>>
>>> 2) Once the fallocate fails in the backend filesystem, make posix xlator
>>> in the brick truncate the file to the previous size of the file before
>>> attempting fallocate. A patch [2] has been sent for this. But there is an
>>> issue with this when there are parallel writes and fallocate operations
>>> happening on the same file. It can lead to a data loss.
>>>
>>> a) statpre is obtained ===> before fallocate is attempted, get the stat
>>> hence the size of the file b) A parrallel Write fop on the same file that
>>> extends the file is successful c) Fallocate fails d) ftruncate truncates it
>>> to size given by statpre (i.e. the previous stat and the size obtained in
>>> step a)
>>>
>>> OR
>>>
>>> 3) Make posix check for available disk size before doing fallocate. i.e.
>>> in fallocate once posix gets the number of bytes to be allocated for the
>>> file from a particular offset, it checks whether so many bytes are
>>> available or not in the disk. If not, fail the fallocate fop with ENOSPC
>>> (without attempting it on the backend filesystem).
>>>
>>> There still is a probability of a parallel write happening while this
>>> fallocate is happening and by the time falllocate system call is attempted
>>> on the disk, the available space might have been less than what was
>>> calculated before fallocate.
>>> i.e. following things can happen
>>>
>>>  a) statfs ===> get the available space of the backend filesystem
>>>  b) a parallel write succeeds and extends the file
>>>  c) fallocate is attempted assuming there is sufficient space in the
>>> backend
>>>
>>> While the above situation can arise, I think we are still fine. Because
>>> fallocate is attempted from the offset received in the fop. So,
>>> irrespective of whether write extended the file or not, the fallocate
>>> itself will be attempted for so many bytes from the offset which we found
>>> to be available by getting statfs information.
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1724754#c3
>>> [2] https://review.gluster.org/#/c/glusterfs/+/22969/
>>>
>>>
>> option 2) will affect performance if we have to serialize all the data
>> operations on the file.
>> option 3) can still lead to the same problem we are trying to solve in a
>> different way.
>>          - thread-1: fallocate came with 1MB size, Statfs says there is
>> 1MB space.
>>          - thread-2: Write on a different file is attempted with 128KB
>> and succeeds
>>          - thread-1: fallocate fails on the file after partially
>> allocating size because there doesn't exist 1MB anymore.
>>
>>
> Here I have a doubt. Even if a 128K write on the file succeeds, IIUC
> fallocate will try to reserve 1MB of space relative to the offset that was
> received as part of the fallocate call which was found to be available.
> So, despite write succeeding, the region fallocate aimed at was 1MB of
> space from a particular offset. As long as that is available, can posix
> still go ahead and perform the fallocate operation?
>

It can go ahead and perform the operation. Just that in the case I
mentioned it will lead to partial success because the size fallocate wants
to reserve is not available.


>
> Regards,
> Raghavendra
>
>
>
>
>> So option-1 is what we need to explore and fix it so that the behavior is
>> closer to other posix filesystems. Maybe start with what Ravi suggested?
>>
>>
>>> Please provide feedback.
>>>
>>> Regards,
>>> Raghavendra
>>>
>>> _______________________________________________
>>>
>>> Community Meeting Calendar:
>>>
>>> APAC Schedule -
>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>> Bridge: https://bluejeans.com/836554017
>>>
>>> NA/EMEA Schedule -
>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>> Bridge: https://bluejeans.com/486278655
>>>
>>> Gluster-devel mailing listGluster-devel at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>> _______________________________________________
>>>
>>> Community Meeting Calendar:
>>>
>>> APAC Schedule -
>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>> Bridge: https://bluejeans.com/836554017
>>>
>>> NA/EMEA Schedule -
>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>> Bridge: https://bluejeans.com/486278655
>>>
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>>
>>
>> --
>> Pranith
>>
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190704/d26d1a01/attachment.html>


More information about the Gluster-devel mailing list