[Gluster-devel] test failure reports for last 15 days

Xavi Hernandez jahernan at redhat.com
Thu Apr 11 09:28:48 UTC 2019


On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez <jahernan at redhat.com> wrote:

> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee <amukherj at redhat.com>
> wrote:
>
>> And now for last 15 days:
>>
>>
>> https://fstat.gluster.org/summary?start_date=2019-03-25&end_date=2019-04-10
>>
>> ./tests/bitrot/bug-1373520.t     18  ==> Fixed through
>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
>> failing in brick mux post 5th April
>> ./tests/bugs/ec/bug-1236065.t     17  ==> happens only in brick mux,
>> needs analysis.
>>
>
> I've identified the problem here, but not the cause yet. There's a stale
> inodelk acquired by a process that is already dead, which causes inodelk
> requests from self-heal and other processes to block.
>
> The reason why it seemed to block in random places is that all commands
> are executed with the working directory pointing to a gluster directory
> which needs healing after the initial tests. Because of the stale inodelk,
> when any application tries to open a file in the working directory, it's
> blocked.
>
> I'll investigate what causes this.
>

I think I've found the problem. This is a fragment of the brick log that
includes script steps, connections and disconnections of brick 0, and lock
requests to the problematic lock:

[2019-04-11 08:22:20.381398]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++++++++++
[2019-04-11 08:22:22.532646]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++++++++++
[2019-04-11 08:22:23.709655] I [MSGID: 115029]
[server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
from
CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
(version: 7dev) with subvol /d/backends/patchy1
[2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
[REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f0000,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks:
[GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f0000,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:24.628478]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 68 5 online_brick_count ++++++++++
[2019-04-11 08:22:26.097092]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++
[2019-04-11 08:22:26.333740]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 71 ec_test_make ++++++++++
[2019-04-11 08:22:27.718963] I [MSGID: 115029]
[server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
from
CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
(version: 7dev) with subvol /d/backends/patchy1
[2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
[REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f0000,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
by 68580998b47f0000 on 0x7f0ed0029190
[2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks:
[Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f0000,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
patchy-disperse-0, start=0, len=0, pid=0}

This is a fragment of the client log:

[2019-04-11 08:22:20.381398]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++++++++++
[2019-04-11 08:22:20.675938] I [MSGID: 114018]
[client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from
patchy-client-1. Client process will keep trying to connect to glusterd
until brick's port is available
[2019-04-11 08:22:21.674772] W [MSGID: 122035]
[ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
with some subvolumes unavailable. (6). FOP : 'INODELK' failed on '/test'
with gfid 35743386-b7c2-41c9-aafd-6b13de216704
[2019-04-11 08:22:22.532646]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++++++++++
[2019-04-11 08:22:23.691171] W [MSGID: 122035]
[ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
with some subvolumes unavailable. (8). FOP : 'INODELK' failed on '/test'
with gfid 35743386-b7c2-41c9-aafd-6b13de216704
[2019-04-11 08:22:23.710420] I [MSGID: 114046]
[client-handshake.c:1106:client_setvolume_cbk] 0-patchy-client-1: Connected
to patchy-client-1, attached to remote volume '/d/backends/patchy1'.
[2019-04-11 08:22:23.791635] W [MSGID: 122035]
[ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
with some subvolumes unavailable. (C). FOP : 'INODELK' failed on '/test'
with gfid 35743386-b7c2-41c9-aafd-6b13de216704
[2019-04-11 08:22:24.460529] I [MSGID: 114018]
[client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from
patchy-client-1. Client process will keep trying to connect to glusterd
until brick's port is available
[2019-04-11 08:22:24.628478]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 68 5 online_brick_count ++++++++++
[2019-04-11 08:22:26.097092]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++
[2019-04-11 08:22:26.333740]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 71 ec_test_make ++++++++++
[2019-04-11 08:22:27.719299] I [MSGID: 114046]
[client-handshake.c:1106:client_setvolume_cbk] 0-patchy-client-1: Connected
to patchy-client-1, attached to remote volume '/d/backends/patchy1'.
[2019-04-11 08:22:27.840342] W [MSGID: 122035]
[ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
with some subvolumes unavailable. (C). FOP : 'INODELK' failed on '/test'
with gfid 35743386-b7c2-41c9-aafd-6b13de216704

The problem happens for two things:

1. Brick 0 gets disconnected randomly (apparently), but the server side is
not aware of these disconnections. This causes that at 08:22:24.460529, the
client has already sent a successful INODELK request to brick 0. At this
point the connection is broken on the client side, but server side doesn't
get any notification, so it doesn't clear the locks.
2. When client reconnects at 08:22:27.719299, a new connection is created,
and the servers does see this new connection (it creates a new client_t
structure). Then the client sends the unlock request, which fails on brick
0 because locks xlators checks if the client is the same by comparing the
pointers, but they are different because of the reconnection. So the lock
is not unlocked and remains there, blocking all future inodelk requests.

The first problem is why the client gets disconnected and the server
doesn't get any notification. The script is stopping bricks 2 and 3 when
this happens. Brick 0 shouldn't fail here. It seems related to the

The second problem is that when we receive a new connection from a client
we already consider connected, we don't cleanup the old connection, which
should take care of the stale locks.

The third problem is that locks xlator is checking if the client is the
same by comparing pointers of client_t structs instead of comparing
client_uid field, which remains the same.

Adding +Raghavendra Gowdappa <rgowdapp at redhat.com>, +Pranith Kumar Karampuri
<pkarampu at redhat.com>, +Krutika Dhananjay <kdhananj at redhat.com>, +Shyam
Ranganathan <srangana at redhat.com> and +Amar Tumballi <atumball at redhat.com> to
help me identify why this is happening and what's the best way to solve it.

Xavi


> Xavi
>
> ./tests/basic/uss.t             15  ==> happens in both brick mux and non
>> brick mux runs, test just simply times out. Needs urgent analysis.
>> ./tests/basic/ec/ec-fix-openfd.t 13  ==> Fixed through
>> https://review.gluster.org/#/c/22508/ , patch merged today.
>> ./tests/basic/volfile-sanity.t      8  ==> Some race, though this
>> succeeds in second attempt every time.
>>
>> There're plenty more with 5 instances of failure from many tests. We need
>> all maintainers/owners to look through these failures and fix them, we
>> certainly don't want to get into a stage where master is unstable and we
>> have to lock down the merges till all these failures are resolved. So
>> please help.
>>
>> (Please note fstat stats show up the retries as failures too which in a
>> way is right)
>>
>>
>> On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee <amukherj at redhat.com>
>> wrote:
>>
>>> [1] captures the test failures report since last 30 days and we'd need
>>> volunteers/component owners to see why the number of failures are so high
>>> against few tests.
>>>
>>> [1]
>>> https://fstat.gluster.org/summary?start_date=2019-01-26&end_date=2019-02-25&job=all
>>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190411/15d96b2e/attachment.html>


More information about the Gluster-devel mailing list