[Bugs] [Bug 1812849] New: Setting volume option when one of the glusterd is stopped in the cluster, post glusterd restart seeing couldn't find vol info in glusterd logs and shd, brick process offline

bugzilla at redhat.com bugzilla at redhat.com
Thu Mar 12 10:44:32 UTC 2020


https://bugzilla.redhat.com/show_bug.cgi?id=1812849

            Bug ID: 1812849
           Summary: Setting volume option when one of the glusterd is
                    stopped in the cluster, post glusterd restart seeing
                    couldn't find vol info in glusterd logs and shd, brick
                    process offline
           Product: GlusterFS
           Version: 7
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: glusterd
          Severity: high
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: srakonde at redhat.com
                CC: amukherj at redhat.com, atumball at redhat.com,
                    bmekala at redhat.com, bugs at gluster.org,
                    nchilaka at redhat.com, rhs-bugs at redhat.com,
                    sankarshan at redhat.com, storage-qa-internal at redhat.com,
                    vbellur at redhat.com, vdas at redhat.com
        Depends On: 1733425
            Blocks: 1727269
  Target Milestone: ---
    Classification: Community



Description of problem:
When glusterd is stopped on one node and performed volume set operation on only
two volumes(one is upgraded and another volume is newly created.). Restarted
glusterd on the node where it stopped on that node seeing "Couldn't find
volinfo" errors and shd and brick process on the two volume where volume option
is set are offline.


Version-Release number of selected component (if applicable):
glusterfs-6.0-7

How reproducible:
1/1

Steps to Reproduce:
1. Create 1000 volumes and start them
2. Performed in-service upgrade on the cluster. 
3. Create 1000 more volumes (1X3 replicate) and start them
4. Stop glusterd on node N1.
5. Perform volume set operation on two volumes from n2
'performance.readdir-ahead on'. One which is being used for upgrade and another
newly created volume. 
6. Start glusterd on node N1.

Actual results:
glusterfsd and shd on these two volumes are offline.
seeing "couldn't find vol info" in glusterd logs. 

Expected results:
glusterfsd and shd should be online and should not see any errors with
"Couldn't find volinfo" in glusterd log.

Additional info:

glusterd log snippet
#########################################
[2019-07-05 06:49:29.078858] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:29.078892] E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)
[2019-07-05 06:49:30.079325] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:31.079862] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:32.080355] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:33.080899] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:34.081374] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
The message "E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)" repeated 5 times between
[2019-07-05 06:49:29.078892] and [2019-07-05 06:49:34.081406]
[2019-07-05 06:49:35.081508] W [MSGID: 106619]
[glusterd-svc-helper.c:931:glusterd_attach_svc] 0-glusterd: attach failed for
glustershd(volume=ronnie_-999)
[2019-07-05 06:49:35.081583] E [MSGID: 106048]
[glusterd-shd-svc.c:477:glusterd_shdsvc_start] 0-glusterd: Failed to attach shd
svc(volume=ronnie_-999) to pid=-1
[2019-07-05 06:49:44.928310] E [MSGID: 106558]
[glusterd-volgen.c:3289:volgen_graph_build_clients] 0-glusterd: volume
inconsistency: brick count is 0
[2019-07-05 06:49:44.928395] E [MSGID: 106068]
[glusterd-shd-svc.c:191:glusterd_shdsvc_create_volfile] 0-management: Failed to
create volfile
[2019-07-05 06:49:44.941696] I [glusterd-utils.c:6560:glusterd_brick_start]
0-management: discovered already-running brick /bricks/brick9/testvol_-999
[2019-07-05 06:49:44.941768] I [MSGID: 106142]
[glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick
/bricks/brick9/testvol_-999 on port 49155
[2019-07-05 06:49:44.941824] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2019-07-05 06:49:44.945117] I [MSGID: 106620]
[glusterd-svc-helper.c:900:glusterd_attach_svc] 0-glusterd: adding svc
glustershd (volume=testvol_-999) to existing process with pid 590
[2019-07-05 06:49:44.945285] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:44.945316] E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)
[2019-07-05 06:49:45.018727] I [socket.c:3837:socket_submit_outgoing_msg]
0-socket.management: not connected (priv->connected = -1)
[2019-07-05 06:49:45.018748] E [rpcsvc.c:1583:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli,
ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
[2019-07-05 06:49:45.018761] E [MSGID: 106430]
[glusterd-utils.c:600:glusterd_submit_reply] 0-glusterd: Reply submission
failed
[2019-07-05 06:49:45.023858] W [MSGID: 106056]
[glusterd-snapshot-utils.c:376:glusterd_snap_volinfo_find_by_volume_id]
0-management: Snap volume not found
[2019-07-05 06:49:45.945716] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:45.945750] E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)
[2019-07-05 06:49:46.447404] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:46.946292] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:47.946837] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:48.947321] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:49.947840] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:50.948339] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:51.948895] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:52.949406] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:53.950006] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:54.950527] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:55.951051] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:56.951573] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:57.952095] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:58.952597] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
The message "E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)" repeated 13 times between
[2019-07-05 06:49:45.945750] and [2019-07-05 06:49:58.952676]
[2019-07-05 06:49:59.952756] W [MSGID: 106619]
[glusterd-svc-helper.c:931:glusterd_attach_svc] 0-glusterd: attach failed for
glustershd(volume=testvol_-999)
[2019-07-05 06:49:59.953137] E [MSGID: 106048]
[glusterd-shd-svc.c:477:glusterd_shdsvc_start] 0-glusterd: Failed to attach shd
svc(volume=testvol_-999) to pid=590

--- Additional comment from RHEL Product and Program Management on 2019-07-05
10:08:50 UTC ---

This bug is automatically being proposed for the next minor release of Red Hat
Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Bala Konda Reddy M on 2019-07-05 12:26:34 UTC ---

Sosreports are available here:
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/bmekala/bug.1727269/

--- Additional comment from Atin Mukherjee on 2019-07-08 06:09:51 UTC ---

Proposing this as blocker.

--- Additional comment from Atin Mukherjee on 2019-07-09 04:06:01 UTC ---

This is reproducible. Based on my testing following is the observation:

1. "Couldn't find volinfo" message comes for the volumes where they have been
modified (i.e. volinfo->version is bumped up in other nodes) and when the
glusterd instance comes up online and do a handshake during shd volfile
building, following error is seen:

[2019-07-09 03:56:24.715943] E [MSGID: 106558]
[glusterd-volgen.c:3289:volgen_graph_build_clients] 0-glusterd: volume
inconsistency: brick count is 0
[2019-07-09 03:56:24.716003] E [MSGID: 106068]
[glusterd-shd-svc.c:191:glusterd_shdsvc_create_volfile] 0-management: Failed to
create volfile
[2019-07-09 03:56:24.802515] I [glusterd-utils.c:6562:glusterd_brick_start]
0-management: discovered already-running brick /tmp/b5
[2019-07-09 03:56:24.802604] I [MSGID: 106142]
[glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /tmp/b5 on port
49152
[2019-07-09 03:56:24.802687] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2019-07-09 03:56:24.806137] I [MSGID: 106131]
[glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: glustershd already
stopped
[2019-07-09 03:56:24.820172] I [MSGID: 106620]
[glusterd-svc-helper.c:922:glusterd_attach_svc] 0-glusterd: adding svc
glustershd (volume=test-vol5) to existing process with pid 18713
[2019-07-09 03:56:24.820288] E [MSGID: 106048]
[glusterd-handshake.c:344:build_volfile_path] 0-management: Couldn't find
volinfo for volid=test-vol5
[2019-07-09 03:56:24.820340] E [MSGID: 106619]
[glusterd-svc-helper.c:796:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)

And then we see [2019-07-09 03:56:24.820288] E [MSGID: 106048]
[glusterd-handshake.c:344:build_volfile_path] 0-management: Couldn't find
volinfo for volid=test-vol5 (not to get confused, we modified the log to ensure
volume name is also dumped)

2. This happens only when brick mux is on as per my testing, so probably the
race is more exposed with brick mux.

This is definitely a release blocker and need to be fixed.

--- Additional comment from RHEL Product and Program Management on 2019-07-09
16:15:59 UTC ---

This BZ is being approved for the RHGS 3.5.0 release, upon receipt of the 3
ACKs (PM,Devel,QA) for the release flag 'rhgs-3.5.0', and on being attached to
an official RHGS 3.5.0 BZ Tracker

--- Additional comment from Atin Mukherjee on 2019-07-12 06:37:08 UTC ---

This bug needs to be dropped from 3.5.0 as it's originated due to shd
multiplexing and it was agreed upon that this feature will be dropped from
3.5.0 given it still needs some more soak time and stabilization effort.

--- Additional comment from Sunil Kumar Acharya on 2019-07-12 17:46:52 UTC ---

Moving this bug out of RHGS-3.5.0 as SHD-MUX has been dropped from RHGS-3.5.0
based on the discussion carried out during the program call held on 10-07-2019.

--- Additional comment from INVALID USER on 2019-07-26 10:39:14 IST ---

Description of problem:
When glusterd is stopped on one node and performed volume set operation on only
two volumes(one is upgraded and another volume is newly created.). Restarted
glusterd on the node where it stopped on that node seeing "Couldn't find
volinfo" errors and shd and brick process on the two volume where volume option
is set are offline.


Version-Release number of selected component (if applicable):
mainline

How reproducible:
1/1

Steps to Reproduce:
1. Create 1000 volumes and start them
2. Performed in-service upgrade on the cluster. 
3. Create 1000 more volumes (1X3 replicate) and start them
4. Stop glusterd on node N1.
5. Perform volume set operation on two volumes from n2
'performance.readdir-ahead on'. One which is being used for upgrade and another
newly created volume. 
6. Start glusterd on node N1.

Actual results:
glusterfsd and shd on these two volumes are offline.
seeing "couldn't find vol info" in glusterd logs. 

Expected results:
glusterfsd and shd should be online and should not see any errors with
"Couldn't find volinfo" in glusterd log.

Additional info:

glusterd log snippet
#########################################
[2019-07-05 06:49:29.078858] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:29.078892] E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)
[2019-07-05 06:49:30.079325] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:31.079862] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:32.080355] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:33.080899] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:34.081374] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
The message "E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)" repeated 5 times between
[2019-07-05 06:49:29.078892] and [2019-07-05 06:49:34.081406]
[2019-07-05 06:49:35.081508] W [MSGID: 106619]
[glusterd-svc-helper.c:931:glusterd_attach_svc] 0-glusterd: attach failed for
glustershd(volume=ronnie_-999)
[2019-07-05 06:49:35.081583] E [MSGID: 106048]
[glusterd-shd-svc.c:477:glusterd_shdsvc_start] 0-glusterd: Failed to attach shd
svc(volume=ronnie_-999) to pid=-1
[2019-07-05 06:49:44.928310] E [MSGID: 106558]
[glusterd-volgen.c:3289:volgen_graph_build_clients] 0-glusterd: volume
inconsistency: brick count is 0
[2019-07-05 06:49:44.928395] E [MSGID: 106068]
[glusterd-shd-svc.c:191:glusterd_shdsvc_create_volfile] 0-management: Failed to
create volfile
[2019-07-05 06:49:44.941696] I [glusterd-utils.c:6560:glusterd_brick_start]
0-management: discovered already-running brick /bricks/brick9/testvol_-999
[2019-07-05 06:49:44.941768] I [MSGID: 106142]
[glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick
/bricks/brick9/testvol_-999 on port 49155
[2019-07-05 06:49:44.941824] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2019-07-05 06:49:44.945117] I [MSGID: 106620]
[glusterd-svc-helper.c:900:glusterd_attach_svc] 0-glusterd: adding svc
glustershd (volume=testvol_-999) to existing process with pid 590
[2019-07-05 06:49:44.945285] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:44.945316] E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)
[2019-07-05 06:49:45.018727] I [socket.c:3837:socket_submit_outgoing_msg]
0-socket.management: not connected (priv->connected = -1)
[2019-07-05 06:49:45.018748] E [rpcsvc.c:1583:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli,
ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
[2019-07-05 06:49:45.018761] E [MSGID: 106430]
[glusterd-utils.c:600:glusterd_submit_reply] 0-glusterd: Reply submission
failed
[2019-07-05 06:49:45.023858] W [MSGID: 106056]
[glusterd-snapshot-utils.c:376:glusterd_snap_volinfo_find_by_volume_id]
0-management: Snap volume not found
[2019-07-05 06:49:45.945716] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:45.945750] E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)
[2019-07-05 06:49:46.447404] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:46.946292] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:47.946837] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:48.947321] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:49.947840] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:50.948339] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:51.948895] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:52.949406] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:53.950006] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:54.950527] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:55.951051] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:56.951573] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:57.952095] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
[2019-07-05 06:49:58.952597] E [glusterd-handshake.c:343:build_volfile_path]
0-management: Couldn't find volinfo
The message "E [MSGID: 106619]
[glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management:
Unable to stat  (No such file or directory)" repeated 13 times between
[2019-07-05 06:49:45.945750] and [2019-07-05 06:49:58.952676]
[2019-07-05 06:49:59.952756] W [MSGID: 106619]
[glusterd-svc-helper.c:931:glusterd_attach_svc] 0-glusterd: attach failed for
glustershd(volume=testvol_-999)
[2019-07-05 06:49:59.953137] E [MSGID: 106048]
[glusterd-shd-svc.c:477:glusterd_shdsvc_start] 0-glusterd: Failed to attach shd
svc(volume=testvol_-999) to pid=590

--- Additional comment from Worker Ant on 2019-07-26 10:44:31 IST ---

REVIEW: https://review.gluster.org/23042 (glusterd: stop stale bricks during
handshaking in brick mux mode) posted (#3) for review on master by Atin
Mukherjee

--- Additional comment from Worker Ant on 2019-08-26 11:57:50 IST ---

REVIEW: https://review.gluster.org/23042 (glusterd: stop stale bricks during
handshaking in brick mux mode) merged (#6) on master by Atin Mukherjee


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1727269
[Bug 1727269] Setting volume option when one of the glusterd is stopped in the
cluster, post glusterd restart seeing couldn't find vol info in glusterd logs
and shd, brick process offline
https://bugzilla.redhat.com/show_bug.cgi?id=1733425
[Bug 1733425] Setting volume option when one of the glusterd is stopped in the
cluster, post glusterd restart seeing couldn't find vol info in glusterd logs
and shd, brick process offline
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list