<div dir="ltr"><div dir="ltr"><br></div><div dir="ltr"></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 3, 2019 at 3:28 AM Pranith Kumar Karampuri &lt;<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 3, 2019 at 10:14 AM Ravishankar N &lt;<a href="mailto:ravishankar@redhat.com" target="_blank">ravishankar@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div bgcolor="#FFFFFF">

    <p><br>

    </p>

    <div class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail-m_-7097077499810939718moz-cite-prefix">On 02/07/19 8:52 PM, FNU Raghavendra

      Manjunath wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr"><br>

        <div>Hi All,</div>

        <div><br>

        </div>

        <div>In glusterfs, there is an issue regarding the fallocate

          behavior. In short, if someone does fallocate from the mount

          point with some size that is greater than the available size

          in the backend filesystem where the file is present, then

          fallocate can fail with a subset of the required number of

          blocks allocated and then failing in the backend filesystem

          with ENOSPC error.</div>

        <div><br>

        </div>

        <div>The behavior of fallocate in itself is simlar to how it

          would have been on a disk filesystem (atleast xfs where it was

          checked). i.e. allocates subset of the required number of

          blocks and then fail with ENOSPC. And the file in itself would

          show the number of blocks in stat to be whatever was allocated

          as part of fallocate. Please refer [1] where the issue is

          explained.</div>

        <div><br>

        </div>

        <div>Now, there is one small difference between how the behavior

          is between glusterfs and xfs.</div>

        <div>In xfs after fallocate fails, doing &#39;stat&#39; on the file

          shows the number of blocks that have been allocated. Whereas

          in glusterfs, the number of blocks is shown as zero which

          makes tools like &quot;du&quot; show zero consumption. This difference

          in behavior in glusterfs is because of libglusterfs on how it

          handles sparse files etc for calculating number of blocks

          (mentioned in [1])</div>

        <div><br>

        </div>

        <div>At this point I can think of 3 things on how to handle

          this.</div>

        <div><br>

        </div>

        <div>1) Except for how many blocks are shown in the stat output

          for the file from the mount point (on which fallocate was

          done), the remaining behavior of attempting to allocate the

          requested size and failing when the filesystem becomes full is

          similar to that of XFS. </div>

        <div><br>

        </div>

        <div>Hence, what is required is to come up with a solution on

          how libglusterfs calculate blocks for sparse files etc

          (without breaking any of the existing components and

          features). This makes the behavior similar to that of backend

          filesystem. This might require its own time to fix

          libglusterfs logic without impacting anything else.</div>

      </div>

    </blockquote>

    <p>I think we should just revert the commit

      b1a5fa55695f497952264e35a9c8eb2bbf1ec4c3 (BZ 817343) and see if it

      really breaks anything (or check whatever it breaks is something

      that we can live with). XFS speculative preallocation is not

      permanent and the extra space is freed up eventually. It can be

      sped up via procfs tunable:

<a class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail-m_-7097077499810939718moz-txt-link-freetext" href="http://xfs.org/index.php/XFS_FAQ#Q:_How_can_I_speed_up_or_avoid_delayed_removal_of_speculative_preallocation.3F" target="_blank">http://xfs.org/index.php/XFS_FAQ#Q:_How_can_I_speed_up_or_avoid_delayed_removal_of_speculative_preallocation.3F</a>. 

      We could also tune the allocsize option to a low value like 4k so

      that glusterfs quota is not affected.<br>

    </p>

    <p> FWIW, ENOSPC is not the only fallocate problem in gluster

      because of  &#39;iatt-&gt;ia_block&#39; tweaking. It also breaks the

      --keep-size option (i.e. the FALLOC_FL_KEEP_SIZE flag in

      fallocate(2)) and reports incorrect du size. <br></p></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><p>

    </p>

    Regards,<br>

    Ravi<br>

    <blockquote type="cite">

      <div dir="ltr">

        <div><br>

        </div>

        <div>OR</div>

        <div><br>

        </div>

        <div>2) Once the fallocate fails in the backend filesystem, make

          posix xlator in the brick truncate the file to the previous

          size of the file before attempting fallocate. A patch [2] has

          been sent for this. But there is an issue with this when there

          are parallel writes and fallocate operations happening on the

          same file. It can lead to a data loss.</div>

        <div><br>

        </div>

        <div><span style="color:rgb(53,53,53);font-family:sans-serif;white-space:pre-wrap">a) statpre is obtained ===&gt; before fallocate is attempted, get the stat hence the size of the file

b) A parrallel Write fop on the same file that extends the file is successful

c) Fallocate fails 

d) ftruncate truncates it to size given by statpre (i.e. the previous stat and the size obtained in step a)</span><br>

        </div>

        <div><br>

        </div>

        <div>OR </div>

        <div><br>

        </div>

        <div>3) Make posix check for available disk size before doing

          fallocate. i.e. in fallocate once posix gets the number of

          bytes to be allocated for the file from a particular offset,

          it checks whether so many bytes are available or not in the

          disk. If not, fail the fallocate fop with ENOSPC (without

          attempting it on the backend filesystem). </div>

        <div><br>

        </div>

        <div>There still is a probability of a parallel write happening

          while this fallocate is happening and by the time falllocate

          system call is attempted on the disk, the available space

          might have been less than what was calculated before

          fallocate.</div>

        <div>i.e. following things can happen</div>

        <div><br>

        </div>

        <div> a) statfs ===&gt; get the available space of the backend

          filesystem</div>

        <div> b) a parallel write succeeds and extends the file</div>

        <div> c) fallocate is attempted assuming there is sufficient

          space in the backend</div>

        <div><br>

        </div>

        <div>While the above situation can arise, I think we are still

          fine. Because fallocate is attempted from the offset received

          in the fop. So, irrespective of whether write extended the

          file or not, the fallocate itself will be attempted for so

          many bytes from the offset which we found to be available by

          getting statfs information.</div>

        <div><br>

        </div>

        <div>[1] <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1724754#c3" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1724754#c3</a></div>

        <div>[2] <a href="https://review.gluster.org/#/c/glusterfs/+/22969/" target="_blank">https://review.gluster.org/#/c/glusterfs/+/22969/</a></div></div></blockquote></div></blockquote><div><br></div><div>option 2) will affect performance  if we have to serialize all the data operations on the file.</div><div>option 3) can still lead to the same problem we are trying to solve in a different way.</div><div>         - thread-1: fallocate came with 1MB size, Statfs says there is 1MB space.</div><div>         - thread-2: Write on a different file is attempted with 128KB and succeeds</div><div>         - thread-1: fallocate fails on the file after partially allocating size because there doesn&#39;t exist 1MB anymore.</div><div><br></div></div></div></blockquote><div><br></div><div>Here I have a doubt. Even if a 128K write on the file succeeds, IIUC fallocate will try to reserve 1MB of space relative to the offset that was received as part of the fallocate call which was found to be available.</div><div>So, despite write succeeding, the region fallocate aimed at was 1MB of space from a particular offset. As long as that is available, can posix still go ahead and perform the fallocate operation?</div><div><br></div><div>Regards,</div><div>Raghavendra</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div></div><div>So option-1 is what we need to explore and fix it so that the behavior is closer to other posix filesystems. Maybe start with what Ravi suggested?<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><blockquote type="cite"><div dir="ltr">

        <div><br>

        </div>

        <div>Please provide feedback.</div>

        <div><br>

        </div>

        <div>Regards,</div>

        <div>Raghavendra</div>

      </div>

      <br>

      <fieldset class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail-m_-7097077499810939718mimeAttachmentHeader"></fieldset>

      <pre class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail-m_-7097077499810939718moz-quote-pre">_______________________________________________

Community Meeting Calendar:

APAC Schedule -

Every 2nd and 4th Tuesday at 11:30 AM IST

Bridge: <a class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail-m_-7097077499810939718moz-txt-link-freetext" href="https://bluejeans.com/836554017" target="_blank">https://bluejeans.com/836554017</a>

NA/EMEA Schedule -

Every 1st and 3rd Tuesday at 01:00 PM EDT

Bridge: <a class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail-m_-7097077499810939718moz-txt-link-freetext" href="https://bluejeans.com/486278655" target="_blank">https://bluejeans.com/486278655</a>

Gluster-devel mailing list

<a class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail-m_-7097077499810939718moz-txt-link-abbreviated" href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a>

<a class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail-m_-7097077499810939718moz-txt-link-freetext" href="https://lists.gluster.org/mailman/listinfo/gluster-devel" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a>

</pre>

    </blockquote>

  </div>

_______________________________________________<br>

<br>

Community Meeting Calendar:<br>

<br>

APAC Schedule -<br>

Every 2nd and 4th Tuesday at 11:30 AM IST<br>

Bridge: <a href="https://bluejeans.com/836554017" rel="noreferrer" target="_blank">https://bluejeans.com/836554017</a><br>

<br>

NA/EMEA Schedule -<br>

Every 1st and 3rd Tuesday at 01:00 PM EDT<br>

Bridge: <a href="https://bluejeans.com/486278655" rel="noreferrer" target="_blank">https://bluejeans.com/486278655</a><br>

<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a><br>

<br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail-m_-7593982123975099570gmail-m_7510057219184194703gmail_signature"><div dir="ltr">Pranith<br></div></div></div>

</blockquote></div>

</div>