[Bugs] [Bug 1769216] New: glusterfsd fail to get online after reboot two storage node at the same time

Wed Nov 6 07:59:31 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1769216

            Bug ID: 1769216
           Summary: glusterfsd fail to get online after reboot two storage
                    node at the same time
           Product: GlusterFS
           Version: 7
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: glusterd
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: zz.sh.cynthia at gmail.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Created attachment 1633224
  --> https://bugzilla.redhat.com/attachment.cgi?id=1633224&action=edit
glusterfsd process log

Description of problem:

During my recent test on glusterfs7, still found in case of reboot storage
nodes, often, after glusterd and glusterfsd get up, the volume status is wrong!
Glusterd and glusterfsd process are both alive however gluster v status command
showd glusterfsd process N/A 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.reboot all storage node at the same time
2.wait for all nodes getup
3.execute "gluster v status all"

Actual results:

some volume glusterfsd fail to get online
Expected results:
all glsuterfsd get online

Additional info:
# gluster v status ccs
Status of volume: ccs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick mn-0.local:/mnt/bricks/ccs/brick      N/A       N/A        N       N/A  
Brick mn-1.local:/mnt/bricks/ccs/brick      53952     0          Y       2065 
Brick dbm-0.local:/mnt/bricks/ccs/brick     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       4940 
Self-heal Daemon on dbm-0.local             N/A       N/A        N       N/A  
Self-heal Daemon on mn-1.local              N/A       N/A        Y       2537 

Task Status of Volume ccs
------------------------------------------------------------------------------
There are no active volume tasks
# ps -ef | grep glusterfsd| grep ccs
root      1764     1  0 09:10 ?        00:00:07 /usr/sbin/glusterfsd -s
mn-0.local --volfile-id ccs.mn-0.local.mnt-bricks-ccs-brick -p
/var/run/gluster/vols/ccs/mn-0.local-mnt-bricks-ccs-brick.pid -S
/var/run/gluster/7ea87ceb0a781684.socket --brick-name /mnt/bricks/ccs/brick -l
/var/log/glusterfs/bricks/mnt-bricks-ccs-brick.log --log-level TRACE
--xlator-option *-posix.glusterd-uuid=ebaded6d-91d5-4873-a60a-59bbcc813714
--process-name brick --brick-port 53952 --xlator-option
ccs-server.listen-port=53952 --xlator-option
transport.socket.bind-address=mn-0.local
[root at mn-0:/var/log/storageinfo/symptom_log]

[root at mn-0:/var/log/storageinfo/symptom_log]
# netstat -anlp| grep 1764 
tcp        0      0 192.168.1.6:53952       0.0.0.0:*               LISTEN     
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.11:49058      ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49069       ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.33:49139      ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.12:49136      ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.16:49139      ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.23:49145      ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.5:49052       ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.8:49113       ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.7:49104       ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49056       ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49082       ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.29:49144      ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.5:49045       ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.11:49100      ESTABLISHED
1764/glusterfsd     
tcp        0      0 192.168.1.6:49149       192.168.1.6:24007       ESTABLISHED
1764/glusterfsd     
unix  2      [ ACC ]     STREAM     LISTENING     25405    1764/glusterfsd     
/var/run/gluster/7ea87ceb0a781684.socket
unix  2      [ ACC ]     STREAM     LISTENING     40159    1764/glusterfsd     
/var/run/gluster/changelog-25ddbf533d927939.sock
unix  3      [ ]         STREAM     CONNECTED     41282    1764/glusterfsd     
/var/run/gluster/7ea87ceb0a781684.socket
unix  2      [ ]         DGRAM                    26910    1764/glusterfsd      
[root at mn-0:/var/log/storageinfo/symptom_log]  

[root at mn-0:/var/log/storageinfo/symptom_log]
# gluster v info ccs

Volume Name: ccs
Type: Replicate
Volume ID: 521261bc-2cba-4e7b-a21a-8486712d7a31
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: mn-0.local:/mnt/bricks/ccs/brick
Brick2: mn-1.local:/mnt/bricks/ccs/brick
Brick3: dbm-0.local:/mnt/bricks/ccs/brick
Options Reconfigured:
diagnostics.brick-log-level: TRACE
cluster.self-heal-daemon: on
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
cluster.server-quorum-type: none
cluster.quorum-type: auto
cluster.quorum-reads: true
cluster.consistent-metadata: on
server.allow-insecure: on
network.ping-timeout: 42
cluster.favorite-child-policy: mtime
cluster.heal-timeout: 60
performance.client-io-threads: off
cluster.metadata-self-heal: on
cluster.data-self-heal: on
cluster.entry-self-heal: on
cluster.server-quorum-ratio: 51%

[some analysis based on enclosed log]
>From glusterd.log
[2019-11-06 07:10:42.708849] D [MSGID: 0]
[glusterd-utils.c:6625:glusterd_restart_bricks] 0-management: starting the
volume ccs  --------- glusterd start glusterfsd process here
…
[2019-11-06 07:10:43.710937] T [socket.c:226:socket_dump_info] 0-management:
$$$ client: connecting to (af:1,sock:12)
/var/run/gluster/7ea87ceb0a781684.socket non-SSL (errno:0:Success)  -- does
this mean connection with glusterfsd is successful ?

>From glusterfsd.log
[2019-11-06 07:10:42.779208] T [socket.c:226:socket_dump_info]
0-socket.glusterfsd: $$$ client: listening on (af:1,sock:7)
/var/run/gluster/7ea87ceb0a781684.socket non-SSL (errno:0:Success)  ------I
think this means glusterfsd unix domain socket is ready to receive

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.