[Gluster-Maintainers] [Gluster-devel] Release 5: Master branch health report (Week of 30th July)

Fri Aug 3 11:41:22 UTC 2018

On Fri, Aug 3, 2018 at 4:01 PM, Kotresh Hiremath Ravishankar <
khiremat at redhat.com> wrote:

> Hi Du/Poornima,
>
> I was analysing bitrot and geo-rep failures and I suspect there is a bug
> in some perf xlator
> that was one of the cause. I was seeing following behaviour in few runs.
>
> 1. Geo-rep synced data to slave. It creats empty file and then rsync syncs
> data.
>     But test does "stat --format "%F" <file>" to confirm. If it's empty,
> it returns
>     "regular empty file" else "regular file". I believe it did get the
> "regular empty file"
>     instead of "regular file" until timeout.
>

https://review.gluster.org/20549 might be relevant.

> 2. Other behaviour is with bitrot, with brick-mux. If a file is deleted on
> the back end on one brick
>     and the look up is done. What all performance xlators needs to be
> disabled to get the lookup/revalidate
>     on the brick where the file was deleted. Earlier, only md-cache was
> disable and it used to work.
>     No it's failing intermittently.
>

You need to disable readdirplus in the entire stack. Refer to
https://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html

> Are there any pending patches around these areas that needs to be merged ?
> If there are, then it could be affecting other tests as well.
>
> Thanks,
> Kotresh HR
>
> On Fri, Aug 3, 2018 at 3:07 PM, Karthik Subrahmanya <ksubrahm at redhat.com>
> wrote:
>
>>
>>
>> On Fri, Aug 3, 2018 at 2:12 PM Karthik Subrahmanya <ksubrahm at redhat.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Aug 2, 2018 at 11:00 PM Karthik Subrahmanya <ksubrahm at redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee, <amukherj at redhat.com>
>>>> wrote:
>>>>
>>>>> I just went through the nightly regression report of brick mux runs
>>>>> and here's what I can summarize.
>>>>>
>>>>> ============================================================
>>>>> ============================================================
>>>>> =================================================
>>>>> Fails only with brick-mux
>>>>> ============================================================
>>>>> ============================================================
>>>>> =================================================
>>>>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
>>>>> after 400 secs. Refer https://fstat.gluster.org/fail
>>>>> ure/209?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all,
>>>>> specifically the latest report https://build.gluster.org/job/
>>>>> regression-test-burn-in/4051/consoleText . Wasn't timing out as
>>>>> frequently as it was till 12 July. But since 27 July, it has timed out
>>>>> twice. Beginning to believe commit 9400b6f2c8aa219a493961e0ab9770b7f12e80d2
>>>>> has added the delay and now 400 secs isn't sufficient enough (Mohit?)
>>>>>
>>>>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>>>>> (Ref - https://build.gluster.org/job/regression-test-with-multiplex
>>>>> /814/console) -  Test fails only in brick-mux mode, AI on Atin to
>>>>> look at and get back.
>>>>>
>>>>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>>>>> https://build.gluster.org/job/regression-test-with-multiple
>>>>> x/813/console) - Seems like failed just twice in last 30 days as per
>>>>> https://fstat.gluster.org/failure/251?state=2&start_date=
>>>>> 2018-06-30&end_date=2018-07-31&branch=all. Need help from AFR team.
>>>>>
>>>>> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/job
>>>>> /regression-test-with-multiplex/812/console) - Hasn't failed after 26
>>>>> July and earlier it was failing regularly. Did we fix this test through any
>>>>> patch (Mohit?)
>>>>>
>>>>> tests/bitrot/bug-1373520.t - (https://build.gluster.org/job
>>>>> /regression-test-with-multiplex/811/console)  - Hasn't failed after
>>>>> 27 July and earlier it was failing regularly. Did we fix this test through
>>>>> any patch (Mohit?)
>>>>>
>>>>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
>>>>> core, not sure if related to brick mux or not, so not sure if brick mux is
>>>>> culprit here or not. Ref - https://build.gluster.org/job/
>>>>> regression-test-with-multiplex/806/console . Seems to be a glustershd
>>>>> crash. Need help from AFR folks.
>>>>>
>>>>> ============================================================
>>>>> ============================================================
>>>>> =================================================
>>>>> Fails for non-brick mux case too
>>>>> ============================================================
>>>>> ============================================================
>>>>> =================================================
>>>>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
>>>>> very often, with out brick mux as well. Refer
>>>>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
>>>>> . There's an email in gluster-devel and a BZ 1610240 for the same.
>>>>>
>>>>> tests/bugs/bug-1368312.t - Seems to be recent failures (
>>>>> https://build.gluster.org/job/regression-test-with-multiple
>>>>> x/815/console) - seems to be a new failure, however seen this for a
>>>>> non-brick-mux case too - https://build.gluster.org/job/
>>>>> regression-test-burn-in/4039/consoleText . Need some eyes from AFR
>>>>> folks.
>>>>>
>>>>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to
>>>>> brick mux, have seen this failing at multiple default regression runs.
>>>>> Refer https://fstat.gluster.org/failure/392?state=2&start_date=
>>>>> 2018-06-30&end_date=2018-07-31&branch=all . We need help from geo-rep
>>>>> dev to root cause this earlier than later
>>>>>
>>>>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to
>>>>> brick mux, have seen this failing at multiple default regression runs.
>>>>> Refer https://fstat.gluster.org/failure/393?state=2&start_date=
>>>>> 2018-06-30&end_date=2018-07-31&branch=all . We need help from geo-rep
>>>>> dev to root cause this earlier than later
>>>>>
>>>>> tests/bugs/glusterd/validating-server-quorum.t (
>>>>> https://build.gluster.org/job/regression-test-with-multiple
>>>>> x/810/console) - Fails for non-brick-mux cases too,
>>>>> https://fstat.gluster.org/failure/580?state=2&start_date=
>>>>> 2018-06-30&end_date=2018-07-31&branch=all .  Atin has a patch
>>>>> https://review.gluster.org/20584 which resolves it but patch is
>>>>> failing regression for a different test which is unrelated.
>>>>>
>>>>> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
>>>>> (Ref - https://build.gluster.org/job/regression-test-with-multiplex
>>>>> /809/console) - fails for non brick mux case too -
>>>>> https://build.gluster.org/job/regression-test-burn-in/4049/consoleText
>>>>> - Need some eyes from AFR folks.
>>>>>
>>>> I am looking at this. It is not reproducible locally. Trying to do this
>>>> on soft serve.
>>>>
>>>
>>> In soft serve machine also it is not failing where the regression has
>>> failed. But I found some other problem in the script.
>>> Will fix that and add some extra logs so that it should be easier to
>>> debug when it fails next time.
>>>
>>
>> RCA for tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
>> failure:
>> This test case basically completely fills 2 out of 3 bricks and
>> provisions one brick with some extra space so that entry creation succeeds
>> only on one brick and fails on other bricks.
>> Since 2 of the bricks gets filled, only the entry creation succeeds on
>> those bricks but the creation of gfid hard link inside the ".glusterfs"
>> fails. This is a bug in "posix" code with entry transactions.
>> If the gfid link creation fails we are just logging an error message and
>> continuing. Since we depend on that gfid, the entry should be deleted if
>> this fails.
>> When the shd tries to heal those files it sees that the gfid link is not
>> present for those files and it fails to heal.
>>
>> I will send a fix for this, which deletes the entry if it fails to create
>> the link inside .glusterfs.
>>
>> Regards,
>> Karthik
>>
>>>
>>>> Regards,
>>>> Karthik
>>>>
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/maintainers/attachments/20180803/bc870485/attachment.html>