[Bugs] [Bug 1808256] New: Glusterfs create a flock lock by anonymous fd, but can't release it forever.

bugzilla at redhat.com bugzilla at redhat.com
Fri Feb 28 06:25:04 UTC 2020


https://bugzilla.redhat.com/show_bug.cgi?id=1808256

            Bug ID: 1808256
           Summary: Glusterfs create a flock lock by anonymous fd, but
                    can't release it forever.
           Product: GlusterFS
           Version: 5
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: protocol
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: moagrawa at redhat.com
                CC: bugs at gluster.org, pkarampu at redhat.com,
                    sasundar at redhat.com, skoduri at redhat.com,
                    xiaoping.wu at nokia.com
        Depends On: 1390914
            Blocks: 1699198
  Target Milestone: ---
    Classification: Community



+++ This bug was initially created as a clone of Bug #1390914 +++

Description of problem:
 - when tcp connection is connected between server and client, but AFR didn't
get CHILD_UP event. if application open a file for flock operation. AFR would
send the open fop to the CHILD client.
 - But the flock fop could be sent to server with a anonymous fd(-2), server
could flock successfully.
 - When application release the flock, client can't send the release request to
server, so server can't release the flock forever.

Version-Release number of selected component (if applicable):
3.6.9

How reproducible:
 - In my env, there are a replicate volume, one brick locate in sn-0 VM,
another brick locate in sn-1 VM.
 - glusterfs client locate in lmn-0 VM, sn-0, sn-1.
 - This issue isn't easy to reproduce.

Steps to Reproduce:
1. when restart sn-0/sn-1 at the same time.
2. glusterfs server on sn-0/sn-1 startup.
3. application on lmn-0 VM will open a file to do flock.

Actual results:
1. AFR get flock from sn-0 successfully by actual fd.
2  AFR get flock form sn-1 successfully by anonymous fd(-2).
3. when application release flock, the flock on sn-1 can't be release because
release request didn't sent to server.


Additional info:

--- Additional comment from xiaopwu on 2016-11-03 02:25:02 UTC ---

sn-0_mnt-bricks-services-brick.1062.dump.1477882903 is dumped from glusterfs
server on sn-0.
sn-1mnt-bricks-services-brick.1066.dump.1477882904 is dumped from glusterfs
server on sn-1.

--- Additional comment from xiaopwu on 2016-11-03 03:10:10 UTC ---

Attachments analyse as below:
1. below logs are copied from
sn-1mnt-bricks-services-brick.1066.dump.1477882904. the granted flock didn't
released on sn-1, but the flock was release on sn-0. 

[xlator.features.locks.services-locks.inode]
path=/lightcm/locks/nodes.all
mandatory=0
posixlk-count=2
posixlk.posixlk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 17537,
owner=18a27aa2d1d2944f, client=0x19e6b60, connection-id=(null), granted at
2016-10-31 02:58:07
posixlk.posixlk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 17559,
owner=18764b24e15b0520, client=0x19e6b60, connection-id=(null), blocked at
2016-10-31 02:58:07


2. This flock comes from lmn VM. below logs are copied from
lmn_mnt-services.log.

// application opened "nodes.all" file, the OPEN request was sent to sn-0, but
didn't send to sn-1 because 0-services-client-1 didn't UP. 

[2016-10-31 02:57:50.816359] T [rpc-clnt.c:1381:rpc_clnt_record]
0-services-client-0: Auth Info: pid: 17536, uid: 0, gid: 0, owner:
0000000000000000
[2016-10-31 02:57:50.816371] T [rpc-clnt.c:1238:rpc_clnt_record_build_header]
0-rpc-clnt: Request fraglen 92, payload: 24, rpc hdr: 68
[2016-10-31 02:57:50.816390] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt:
submitted request (XID: 0xc7b7aa Program: GlusterFS 3.3, ProgVers: 330, Proc:
11) to rpc-transport (services-client-0)

//application flock the file. the FLOCK request was sent to sn-0. and got lock
from sn-0 successfully.

[2016-10-31 02:57:50.817424] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt:
submitted request (XID: 0xc7b7ab Program: GlusterFS 3.3, ProgVers: 330, Proc:
26) to rpc-transport (services-client-0)
[2016-10-31 02:57:51.277349] T [rpc-clnt.c:660:rpc_clnt_reply_init]
0-services-client-0: received rpc message (RPC XID: 0xc7b7ab Program: GlusterFS
3.3, ProgVers: 330, Proc: 26) from rpc-transport (services-client-0)

//AFR xlator send FLOCK fop to sn-1 although 0-services-client-1 still didn't
UP. Because AFR didn't open "nodes.all" on SN-1, client xlator sent the fop to
server with a anonymous fd(GF_ANON_FD_NO -2), and got lock from server
successfully too. this lock didn't released.

[2016-10-31 02:57:51.277397] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt:
submitted request (XID: 0xbe5479 Program: GlusterFS 3.3, ProgVers: 330, Proc:
26) to rpc-transport (services-client-1)
[2016-10-31 02:57:51.324258] T [rpc-clnt.c:660:rpc_clnt_reply_init]
0-services-client-1: received rpc message (RPC XID: 0xbe5479 Program: GlusterFS
3.3, ProgVers: 330, Proc: 26) from rpc-transport (services-client-1)

//application released flock, RELEASE fop was sent to sn-0 and lock is released
on sn-0. But the RELEASE fop wasn't sent to sn-1. 

[2016-10-31 02:57:51.404366] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt:
submitted request (XID: 0xc7b7ec Program: GlusterFS 3.3, ProgVers: 330, Proc:
41) to rpc-transport (services-client-0)
[2016-10-31 02:57:51.406273] T [rpc-clnt.c:660:rpc_clnt_reply_init]
0-services-client-0: received rpc message (RPC XID: 0xc7b7ec Program: GlusterFS
3.3, ProgVers: 330, Proc: 41) from rpc-transport (services-client-0)

//notice 0-services-client-0 UP time, it was before "nodes.all" open.

[2016-10-31 02:57:49.802345] I [client-handshake.c:1052:client_post_handshake]
0-services-client-0: 22 fds open - Delaying child_up until they are re-opened
[2016-10-31 02:57:49.983331] I
[client-handshake.c:674:client_child_up_reopen_done] 0-services-client-0: last
fd open'd/lock-self-heal'd - notifying CHILD-UP

//notice 0-services-client-1 UP time, it was after flock release.

[2016-10-31 02:57:51.244367] I [client-handshake.c:1052:client_post_handshake]
0-services-client-1: 21 fds open - Delaying child_up until they are re-opened
[2016-10-31 02:57:51.731500] I
[client-handshake.c:674:client_child_up_reopen_done] 0-services-client-1: last
fd open'd/lock-self-heal'd - notifying CHILD-UP

3. code 

//FLOCK fop was sent to server with a anonymous fd,  if flock file didn't
opened.

client3_3_lk (call_frame_t *frame, xlator_t *this, void *data)
{
}

// RELEASE fop was sent to server, if the file didn't opened.

client3_3_release (call_frame_t *frame, xlator_t *this,
                   void *data)
{
}

4. If any way to fix the issue?

--- Additional comment from xiaopwu on 2016-11-08 08:33:05 UTC ---

We added a patch for this issue, please check if it is ok.

--- a/old/client-rpc-fops.c
+++ b/new/client-rpc-fops.c
@@ -5260,6 +5260,14 @@ client3_3_lk (call_frame_t *frame, xlator_t *this,
         CLIENT_GET_REMOTE_FD (this, args->fd, DEFAULT_REMOTE_FD,
                               remote_fd, op_errno, unwind);

+       if(remote_fd < 0)
+        {
+            gf_log (this->name, GF_LOG_INFO, "Didn't open remote fd(%ld), but
return EBADFD, AFR shall ignore such error. pid: %u ", remote_fd,
frame->root->pid);
+            op_errno = EBADFD;
+            CLIENT_STACK_UNWIND (lk, frame, -1, op_errno, NULL, NULL);
+            return 0;
+        }
+
         ret = client_cmd_to_gf_cmd (args->cmd, &gf_cmd);
         if (ret) {
                 op_errno = EINVAL;

--- Additional comment from Soumya Koduri on 2016-11-08 12:18:15 UTC ---

CCin Pranith (AFR and Posix-locks code maintainer)

--- Additional comment from Worker Ant on 2016-11-08 12:49:01 UTC ---

REVIEW: http://review.gluster.org/15804 (protocol/client: Do not fallback to
anon-fd if fd is not open) posted (#1) for review on master by Pranith Kumar
Karampuri (pkarampu at redhat.com)

--- Additional comment from Pranith Kumar K on 2016-11-08 12:53:27 UTC ---

(In reply to xiaopwu from comment #3)
> We added a patch for this issue, please check if it is ok.
> 
> --- a/old/client-rpc-fops.c
> +++ b/new/client-rpc-fops.c
> @@ -5260,6 +5260,14 @@ client3_3_lk (call_frame_t *frame, xlator_t *this,
>          CLIENT_GET_REMOTE_FD (this, args->fd, DEFAULT_REMOTE_FD,
>                                remote_fd, op_errno, unwind);
> 
> +       if(remote_fd < 0)
> +        {
> +            gf_log (this->name, GF_LOG_INFO, "Didn't open remote fd(%ld),
> but return EBADFD, AFR shall ignore such error. pid: %u ", remote_fd,
> frame->root->pid);
> +            op_errno = EBADFD;
> +            CLIENT_STACK_UNWIND (lk, frame, -1, op_errno, NULL, NULL);
> +            return 0;
> +        }
> +
>          ret = client_cmd_to_gf_cmd (args->cmd, &gf_cmd);
>          if (ret) {
>                  op_errno = EINVAL;

Awesome debugging!!
I think the fix can be generic, I posted the fix to the bug I introduced a
while back! Thanks for raising this bug!. I will port this patch to lower
branches once this one passes regressions.

Pranith

--- Additional comment from Pranith Kumar K on 2016-11-09 11:11:44 UTC ---

Regressions are failing on the change I submitted, will look into it a bit more
and update

--- Additional comment from Worker Ant on 2017-09-04 10:57:10 UTC ---

REVIEW: https://review.gluster.org/15804 (protocol/client: Do not fallback to
anon-fd if fd is not open) posted (#2) for review on master by Pranith Kumar
Karampuri (pkarampu at redhat.com)

--- Additional comment from Worker Ant on 2019-03-28 12:28:26 UTC ---

REVIEW: https://review.gluster.org/15804 (protocol/client: Do not fallback to
anon-fd if fd is not open) posted (#3) for review on master by Pranith Kumar
Karampuri

--- Additional comment from Worker Ant on 2019-03-31 02:57:49 UTC ---

REVIEW: https://review.gluster.org/15804 (protocol/client: Do not fallback to
anon-fd if fd is not open) merged (#8) on master by Raghavendra G


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1390914
[Bug 1390914] Glusterfs create a flock lock by anonymous fd, but can't release
it forever.
https://bugzilla.redhat.com/show_bug.cgi?id=1699198
[Bug 1699198] Glusterfs create a flock lock by anonymous fd, but can't release
it forever.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list