[Gluster-devel] Moratorium on new patch acceptance

Thu May 21 08:04:14 UTC 2015

On Tuesday 19 May 2015 09:50 PM, Shyam wrote:
> On 05/19/2015 11:23 AM, Vijaikumar M wrote:
>>
>>
>> On Tuesday 19 May 2015 08:36 PM, Shyam wrote:
>>> On 05/19/2015 08:10 AM, Raghavendra G wrote:
>>>> After discussion with Vijaykumar mallikarjuna and other inputs in this
>>>> thread, we are proposing all quota tests to comply to following
>>>> criteria:
>>>>
>>>> * use dd always with oflag=append (to make sure there are no parallel
>>>> writes) and conv=fdatasync (to make sure errors, if any are 
>>>> delivered to
>>>> application. Turning off flush-behind is optional since fdatasync acts
>>>> as a barrier)
>>>>
>>>> OR
>>>>
>>>> * turn off write-behind in nfs client and glusterfs server.
>>>>
>>>> What do you people think is a better test scenario?
>>>>
>>>> Also, we don't have confirmation on the RCA that parallel writes are
>>>> indeed the culprits. We are trying to reproduce the issue locally.
>>>> @Shyam, it would be helpful if you can confirm the hypothesis :).
>>>
>>> Ummm... I thought we acknowledge that quota checks are done during the
>>> WIND and updated during UNWIND, and we have io threads doing in flight
>>> IOs (as well as possible IOs in io threads queue) and we have 256K
>>> writes in the case mentioned. Put together, in my head this forms a
>>> good RCA that we write more than needed due to the in flight IOs on
>>> the brick. We need to control the in flight IOs as a resolution for
>>> this from the application.
>>>
>>> In terms of actual proof, we would need to instrument the code and
>>> check. When you say it does not fail for you, does the file stop once
>>> quota is reached or is a random size greater than quota? Which itself
>>> may explain or point to the RCA.
>>>
>>> The basic thing needed from an application is,
>>> - Sync IOs, so that there aren't too many in flight IOs and the
>>> application waits for each IO to complete
>>> - Based on tests below if we keep block size in dd lower and use
>>> oflag=sync we can achieve the same, if we use higher block sizes we
>>> cannot
>>>
>>> Test results:
>>> 1) noac:
>>>   - NFS sends a COMMIT (internally translates to a flush) post each IO
>>> request (NFS WRITES are still with the UNSTABLE flag)
>>>   - Ensures prior IO is complete before next IO request is sent (due
>>> to waiting on the COMMIT)
>>>   - Fails if IO size is large, i.e in the test case being discussed I
>>> changed the dd line that was failing as "TEST ! dd if=/dev/zero
>>> of=$N0/$mydir/newfile_2 *bs=10M* count=1 conv=fdatasync" and this
>>> fails at times, as the writes here are sent as 256k chunks to the
>>> server and we still see the same behavior
>>>   - noac + performance.nfs.flush-behind: off +
>>> performance.flush-behind: off + performance.nfs.strict-write-ordering:
>>> on + performance.strict-write-ordering: on +
>>> performance.nfs.write-behind: off + performance.write-behind: off
>>>     - Still see similar failures, i.e at times 10MB file is created
>>> successfully in the modified dd command above
>>>
>>> Overall, the switch works, but not always. If we are to use this
>>> variant then we need to announce that all quota tests using dd not try
>>> to go beyond the quota limit set in a single IO from dd.
>>>
>>> 2) oflag=sync:
>>>   - Exactly the same behavior as above.
>>>
>>> 3) Added all (and possibly the kitches sink) to the test case, as
>>> attached, and still see failures,
>>>   - Yes, I have made the test fail intentionally (of sorts) by using
>>> 3M per dd IO and 2 IOs to go beyond the quota limit.
>>>   - The intention is to demonstrate that we still get parallel IOs
>>> from NFS client
>>>   - The test would work if we reduce the block size per IO (reliably
>>> is a border condition here, and we need specific rules like block size
>>> and how many blocks before we state quota is exceeded etc.)
>>>   - The test would work if we just go beyond the quota, and then check
>>> a separate dd instance as being able to *not* exceed the quota. Which
>>> is why I put up that patch.
>>>
>>> What next?
>>>
>> Hi Shyam,
>>
>> I tried running the test with dd option 'oflag=append' and didn't see
>> the issue.Can you please try this option and see if it works?
>
> Did that (in the attached script that I sent) and it still failed.
>
> Please note:
> - This dd command passes (or fails with EDQUOT)
>   - dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=512 count=10240 
> oflag=append oflag=sync conv=fdatasync
>   - We can even drop append and fdatasync, as sync sends a commit per 
> block written which is better for the test and quota enforcement, 
> whereas fdatasync does one in the end and sometimes fails (with larger 
> block sizes, say 1M)
>   - We can change bs to [512 - 256k]
>
> - This dd command fails (or writes all the data)
>   - dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 oflag=append 
> oflag=sync conv=fdatasync
>
> The reasoning is that when we write a larger block size, NFS sends in 
> multiple 256k chunks to write and then sends the commit before the 
> next block. As a result if we exceed quota in the *last block* that we 
> are writing, we *may* fail. If we exceed quota in the last but one 
> block we will pass.
>
> Hope this shorter version explains it better.
>
> (VijayM is educating me on quota (over IM), and it looks like the 
> quota update happens as a synctask in the background, so post the 
> flush (NFS commit) we may still have a race)
>
> Post education solution:
> - Quota updates on disk xattr as a sync task, as a result if we 
> exceeded quota in the n-1th block there is no guarantee that the nth 
> block would fail, as the sync task may not have completed
>
> So I think we need to do the following for the quota based tests 
> (expanding on the provided patch, http://review.gluster.org/#/c/10811/ )
> - First dd that exceeds quota (with either oflag=sync or 
> conv=fdatasync so that we do not see any flush behind or write behind 
> effects) to be done without checks
> - Next check in an EXPECT_WITHIN that quota is exceeded (maybe add 
> checks on the just created/appended file w.r.t its minimum size that 
> would make it exceed the quota)
> - Then do a further dd to a new file or append to an existing file to 
> get the EDQUOT error
> - Proceed with whatever the test case needs to do next
>
> Suggestions?
>

Here is my analysis on spurious failure with testcase: 
tests/bugs/distribute/bug-1161156.t
In release-3.7, marker is re-factored to use synctask to do background 
accounting.
I have done below tests with different combination and found that 
parallel writes is causing the spurious failure.
I have filed a bug# 1223658 to track parallel write issue with quota.

1) Parallel writes and Marker background update (Test always fails)
     TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 
conv=fdatasync oflag=sync oflag=append

     NFS client breaks 3M writes into multiple 256k chunks and does 
parallel writes

2) Parallel writes and Marker foreground update (Test always fails)
     TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 
conv=fdatasync oflag=sync oflag=append

     Made a marker code change to account quota in foreground (without 
synctask)

3) Serial writes and Marker background update (Test passed 100/100 times)
     TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=256k count=24 
conv=fdatasync oflag=sync oflag=append

     Using smaller block size (256k), so that NFS client reduces 
parallel writes

4) Serial writes and Marker foreground update (Test passed 100/100 times)
     TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=256k count=24 
conv=fdatasync oflag=sync oflag=append

     Using smaller block size (256k), so that NFS client reduces 
parallel writes
     Made a marker code change to account quota in foreground (without 
synctask)

5) Parallel writes on release-3.6 (Test always fails)
     TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 
conv=fdatasync oflag=sync oflag=append
     Moved marker xlator above IO-Threads in the graph.

Thanks,
Vijay

>>
>> Thanks,
>> Vijay
>>
>>>>
>>>> regards,
>>>> Raghavendra.
>>>>
>>>> On Tue, May 19, 2015 at 5:27 PM, Raghavendra G 
>>>> <raghavendra at gluster.com
>>>> <mailto:raghavendra at gluster.com>> wrote:
>>>>
>>>>
>>>>
>>>>     On Tue, May 19, 2015 at 4:26 PM, Jeff Darcy <jdarcy at redhat.com
>>>> <mailto:jdarcy at redhat.com>> wrote:
>>>>
>>>>         > No, my suggestion was aimed at not having parallel writes.
>>>> In this case quota
>>>>         > won't even fail the writes with EDQUOT because of reasons
>>>> explained above.
>>>>         > Yes, we need to disable flush-behind along with this so
>>>> that errors are
>>>>         > delivered to application.
>>>>
>>>>         Would conv=sync help here?  That should prevent any kind of
>>>>         write parallelism.
>>>>
>>>>
>>>>     An strace of dd shows that
>>>>
>>>>     * fdatasync is issued only once at the end of all writes when
>>>>     conv=fdatasync
>>>>     * for some strange reason no fsync or fdatasync is issued at all
>>>>     when conv=sync
>>>>
>>>>     So, using conv=fdatasync in the test cannot prevent
>>>>     write-parallelism induced by write-behind. Parallelism would've 
>>>> been
>>>>     prevented only if dd had issued fdatasync after each write or 
>>>> opened
>>>>     the file with O_SYNC.
>>>>
>>>>         If it doesn't, I'd say that's a true test failure somewhere in
>>>>         our stack.  A
>>>>         similar possibility would be to invoke dd multiple times with
>>>>         oflag=append.
>>>>
>>>>
>>>>     Yes, appending writes curb parallelism (at least in glusterfs, but
>>>>     not sure how nfs client behaves) and hence can be used as an
>>>>     alternative solution.
>>>>
>>>>     On a slightly unrelated note flush-behind is immaterial in this 
>>>> test
>>>>     since fdatasync is anyways acting as a barrier.
>>>>
>>>>         _______________________________________________
>>>>         Gluster-devel mailing list
>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>>
>>>>
>>>>
>>>>     --
>>>>     Raghavendra G
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Raghavendra G
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>