[Gluster-users] brick is down but gluster volume status says it's fine

Alastair Neil ajneil.tech at gmail.com
Tue Oct 24 19:32:12 UTC 2017


It looks like this is to do with the stale port issue.

I think it's pretty clear from the below that the digitalcorpora brick
process is shown by volume status as having the same TCP port as the public
volume brick on gluster-2, 49156. But is actually listening on 49154.  So
although the brick process is technically up nothing is talking to it.  I
am surprised I don't see more errors in the brick log for brick8/public.
It also explains the wack-a-mole problem,  Every time I kill and restart
the daemon it must be grabbing the port of another brick and then that
volume brick  goes silent.

I killed all the brick processes and restarted glusterd and everything came
up ok.


[root at gluster-2 ~]# glv status digitalcorpora | grep -v ^Self
Status of volume: digitalcorpora
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster-2:/export/brick7/digitalcorpo
ra                                          49156     0          Y
125708
Brick gluster1.vsnet.gmu.edu:/export/brick7
/digitalcorpora                             49152     0          Y
12345
Brick gluster0:/export/brick7/digitalcorpor
a                                           49152     0          Y
16098

Task Status of Volume digitalcorpora
------------------------------------------------------------------------------
There are no active volume tasks

[root at gluster-2 ~]# glv status public  | grep -v ^Self
Status of volume: public
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster1:/export/brick8/public        49156     0          Y
3519
Brick gluster2:/export/brick8/public        49156     0          Y
8578
Brick gluster0:/export/brick8/public        49156     0          Y
3176

Task Status of Volume public
------------------------------------------------------------------------------
There are no active volume tasks

[root at gluster-2 ~]# netstat -pant | grep 8578 | grep 0.0.0.0
tcp        0      0 0.0.0.0:49156           0.0.0.0:*
LISTEN      8578/glusterfsd
[root at gluster-2 ~]# netstat -pant | grep 125708 | grep 0.0.0.0
tcp        0      0 0.0.0.0:49154           0.0.0.0:*
LISTEN      125708/glusterfsd
[root at gluster-2 ~]# ps -c  --pid  125708 8578
   PID CLS PRI TTY      STAT   TIME COMMAND
  8578 TS   19 ?        Ssl  224:20 /usr/sbin/glusterfsd -s gluster2
--volfile-id public.gluster2.export-brick8-public -p
/var/lib/glusterd/vols/public/run/gluster2-export-bric
125708 TS   19 ?        Ssl    0:08 /usr/sbin/glusterfsd -s gluster-2
--volfile-id digitalcorpora.gluster-2.export-brick7-digitalcorpora -p
/var/lib/glusterd/vols/digitalcorpor
[root at gluster-2 ~]#


On 24 October 2017 at 13:56, Atin Mukherjee <amukherj at redhat.com> wrote:

>
>
> On Tue, Oct 24, 2017 at 11:13 PM, Alastair Neil <ajneil.tech at gmail.com>
> wrote:
>
>> gluster version 3.10.6, replica 3 volume, daemon is present but does not
>> appear to be functioning
>>
>> peculiar behaviour.  If I kill the glusterfs brick daemon and restart
>> glusterd then the brick becomes available - but one of my other volumes
>> bricks on the same server goes down in the same way it's like wack-a-mole.
>>
>> any ideas?
>>
>
> The subject and the data looks to be contradictory to me. Brick log (what
> you shared) doesn't have a cleanup_and_exit () trigger for a shutdown. Are
> you sure brick is down? OTOH, I see a mismatch of port for
> brick7/digitalcorpora where the brick process has 49154 but gluster volume
> status shows 49152. There is an issue with stale port which we're trying to
> address through https://review.gluster.org/18541 . But could you specify
> what exactly the problem is? Is it the stale port  or the conflict between
> volume status output and actual brick health? If it's the latter, I'd need
> further information like output of "gluster get-state" command from the
> same node.
>
>
>>
>> [root at gluster-2 bricks]# glv status digitalcorpora
>>
>>> Status of volume: digitalcorpora
>>> Gluster process                             TCP Port  RDMA Port
>>> Online  Pid
>>> ------------------------------------------------------------
>>> ------------------
>>> Brick gluster-2:/export/brick7/digitalcorpo
>>> ra                                          49156     0
>>> Y       125708
>>> Brick gluster1.vsnet.gmu.edu:/export/brick7
>>> /digitalcorpora                             49152     0
>>> Y       12345
>>> Brick gluster0:/export/brick7/digitalcorpor
>>> a                                           49152     0
>>> Y       16098
>>> Self-heal Daemon on localhost               N/A       N/A        Y
>>> 126625
>>> Self-heal Daemon on gluster1                N/A       N/A        Y
>>> 15405
>>> Self-heal Daemon on gluster0                N/A       N/A        Y
>>> 18584
>>>
>>> Task Status of Volume digitalcorpora
>>> ------------------------------------------------------------
>>> ------------------
>>> There are no active volume tasks
>>>
>>> [root at gluster-2 bricks]# glv heal digitalcorpora info
>>> Brick gluster-2:/export/brick7/digitalcorpora
>>> Status: Transport endpoint is not connected
>>> Number of entries: -
>>>
>>> Brick gluster1.vsnet.gmu.edu:/export/brick7/digitalcorpora
>>> /.trashcan
>>> /DigitalCorpora/hello2.txt
>>> /DigitalCorpora
>>> Status: Connected
>>> Number of entries: 3
>>>
>>> Brick gluster0:/export/brick7/digitalcorpora
>>> /.trashcan
>>> /DigitalCorpora/hello2.txt
>>> /DigitalCorpora
>>> Status: Connected
>>> Number of entries: 3
>>>
>>> [2017-10-24 17:18:48.288505] W [glusterfsd.c:1360:cleanup_and_exit]
>>> (-->/lib64/libpthread.so.0(+0x7e25) [0x7f6f83c9de25]
>>> -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x55a148eeb135]
>>> -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55a148eeaf5b] ) 0-:
>>> received signum (15), shutting down
>>> [2017-10-24 17:18:59.270384] I [MSGID: 100030] [glusterfsd.c:2503:main]
>>> 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.10.6
>>> (args: /usr/sbin/glusterfsd -s gluster-2 --volfile-id
>>> digitalcorpora.gluster-2.export-brick7-digitalcorpora -p
>>> /var/lib/glusterd/vols/digitalcorpora/run/gluster-2-export-brick7-digitalcorpora.pid
>>> -S /var/run/gluster/f8e0b3393e47dc51a07c6609f9b40841.socket
>>> --brick-name /export/brick7/digitalcorpora -l /var/log/glusterfs/bricks/export-brick7-digitalcorpora.log
>>> --xlator-option *-posix.glusterd-uuid=032c17f5-8cc9-445f-aa45-897b5a066b43
>>> --brick-port 49154 --xlator-option digitalcorpora-server.listen-p
>>> ort=49154)
>>> [2017-10-24 17:18:59.285279] I [MSGID: 101190]
>>> [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread
>>> with index 1
>>> [2017-10-24 17:19:04.611723] I [rpcsvc.c:2237:rpcsvc_set_outstanding_rpc_limit]
>>> 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
>>> [2017-10-24 17:19:04.611815] W [MSGID: 101002]
>>> [options.c:954:xl_opt_validate] 0-digitalcorpora-server: option
>>> 'listen-port' is deprecated, preferred is 'transport.socket.listen-port',
>>> continuing with correction
>>> [2017-10-24 17:19:04.615974] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>>> 'rpc-auth.auth-glusterfs' is not recognized
>>> [2017-10-24 17:19:04.616033] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>>> 'rpc-auth.auth-unix' is not recognized
>>> [2017-10-24 17:19:04.616070] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>>> 'rpc-auth.auth-null' is not recognized
>>> [2017-10-24 17:19:04.616134] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>>> 'auth-path' is not recognized
>>> [2017-10-24 17:19:04.616177] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>>> 'ping-timeout' is not recognized
>>> [2017-10-24 17:19:04.616203] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
>>> option 'rpc-auth-allow-insecure' is not recognized
>>> [2017-10-24 17:19:04.616215] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
>>> option 'auth.addr./export/brick7/digitalcorpora.allow' is not recognized
>>> [2017-10-24 17:19:04.616226] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
>>> option 'auth-path' is not recognized
>>> [2017-10-24 17:19:04.616237] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
>>> option 'auth.login.b17f2513-7d9c-4174-a0c5-de4a752d46ca.password' is
>>> not recognized
>>> [2017-10-24 17:19:04.616248] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
>>> option 'auth.login./export/brick7/digitalcorpora.allow' is not
>>> recognized
>>> [2017-10-24 17:19:04.616283] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-quota: option
>>> 'timeout' is not recognized
>>> [2017-10-24 17:19:04.616367] W [MSGID: 101174]
>>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-trash: option
>>> 'brick-path' is not recognized
>>> Final graph:
>>> +-----------------------------------------------------------
>>> -------------------+
>>>   1: volume digitalcorpora-posix
>>>   2:     type storage/posix
>>>   3:     option glusterd-uuid 032c17f5-8cc9-445f-aa45-897b5a066b43
>>>   4:     option directory /export/brick7/digitalcorpora
>>>   5:     option volume-id 61efe58a-ae5b-4d8b-b9f9-67829867c442
>>>   6:     option brick-uid 36
>>>   7:     option brick-gid 36
>>>   8: end-volume
>>>   9:
>>>  10: volume digitalcorpora-trash
>>>  11:     type features/trash
>>>  12:     option trash-dir .trashcan
>>>  13:     option brick-path /export/brick7/digitalcorpora
>>>  14:     option trash-internal-op off
>>>  15:     subvolumes digitalcorpora-posix
>>>  16: end-volume
>>>  17:
>>>  18: volume digitalcorpora-changetimerecorder
>>>  19:     type features/changetimerecorder
>>>  20:     option db-type sqlite3
>>>  21:     option hot-brick off
>>>  22:     option db-name digitalcorpora.db
>>>  23:     option db-path /export/brick7/digitalcorpora/.glusterfs/
>>>  24:     option record-exit off
>>>  25:     option ctr_link_consistency off
>>>  26:     option ctr_lookupheal_link_timeout 300
>>>  27:     option ctr_lookupheal_inode_timeout 300
>>>  28:     option record-entry on
>>>  29:     option ctr-enabled off
>>>  30:     option record-counters off
>>>  31:     option ctr-record-metadata-heat off
>>>  32:     option sql-db-cachesize 12500
>>>  33:     option sql-db-wal-autocheckpoint 25000
>>>  34:     subvolumes digitalcorpora-trash
>>>  35: end-volume
>>>  36:
>>>  37: volume digitalcorpora-changelog
>>>  38:     type features/changelog
>>>  39:     option changelog-brick /export/brick7/digitalcorpora
>>>  40:     option changelog-dir /export/brick7/digitalcorpora/
>>> .glusterfs/changelogs
>>>  41:     option changelog-barrier-timeout 120
>>>  42:     subvolumes digitalcorpora-changetimerecorder
>>>  43: end-volume
>>>  44:
>>>  45: volume digitalcorpora-bitrot-stub
>>>  46:     type features/bitrot-stub
>>>  47:     option export /export/brick7/digitalcorpora
>>>  48:     subvolumes digitalcorpora-changelog
>>>  49: end-volume
>>>  50:
>>>  51: volume digitalcorpora-access-control
>>>  52:     type features/access-control
>>>  53:     subvolumes digitalcorpora-bitrot-stub
>>>  54: end-volume
>>>  55:
>>>  56: volume digitalcorpora-locks
>>>  57:     type features/locks
>>>  58:     subvolumes digitalcorpora-access-control
>>>  59: end-volume
>>>  60:
>>>  61: volume digitalcorpora-worm
>>>  62:     type features/worm
>>>  63:     option worm off
>>>  64:     option worm-file-level off
>>>  65:     subvolumes digitalcorpora-locks
>>>  66: end-volume
>>>  67:
>>>  68: volume digitalcorpora-read-only
>>>  69:     type features/read-only
>>>  70:     option read-only off
>>>  71:     subvolumes digitalcorpora-worm
>>>  72: end-volume
>>>  73:
>>>  74: volume digitalcorpora-leases
>>>  75:     type features/leases
>>>  76:     option leases off
>>>  77:     subvolumes digitalcorpora-read-only
>>>  78: end-volume
>>>  79:
>>>  80: volume digitalcorpora-upcall
>>>  81:     type features/upcall
>>>  82:     option cache-invalidation off
>>>  83:     subvolumes digitalcorpora-leases
>>>  84: end-volume
>>>  85:
>>>  86: volume digitalcorpora-io-threads
>>>  87:     type performance/io-threads
>>>  88:     subvolumes digitalcorpora-upcall
>>>  89: end-volume
>>>  90:
>>>  91: volume digitalcorpora-marker
>>>  92:     type features/marker
>>>  93:     option volume-uuid 61efe58a-ae5b-4d8b-b9f9-67829867c442
>>>  94:     option timestamp-file /var/lib/glusterd/vols/digital
>>> corpora/marker.tstamp
>>>  95:     option quota-version 0
>>>  96:     option xtime off
>>>  97:     option gsync-force-xtime off
>>>  98:     option quota off
>>>  99:     option inode-quota off
>>> 100:     subvolumes digitalcorpora-io-threads
>>> 101: end-volume
>>> 102:
>>> 103: volume digitalcorpora-barrier
>>> 104:     type features/barrier
>>> 105:     option barrier disable
>>> 106:     option barrier-timeout 120
>>> 107:     subvolumes digitalcorpora-marker
>>> 108: end-volume
>>> 109:
>>> 110: volume digitalcorpora-index
>>> 111:     type features/index
>>> 112:     option index-base /export/brick7/digitalcorpora/
>>> .glusterfs/indices
>>> 113:     option xattrop-dirty-watchlist trusted.afr.dirty
>>> 114:     option xattrop-pending-watchlist trusted.afr.digitalcorpora-
>>> 115:     subvolumes digitalcorpora-barrier
>>> 116: end-volume
>>> 117:
>>> 118: volume digitalcorpora-quota
>>> 119:     type features/quota
>>> 120:     option volume-uuid digitalcorpora
>>> 121:     option server-quota off
>>> 122:     option timeout 0
>>> 123:     option deem-statfs off
>>> 124:     subvolumes digitalcorpora-index
>>> 125: end-volume
>>> 126:
>>> 127: volume digitalcorpora-io-stats
>>> 128:     type debug/io-stats
>>> 129:     option unique-id /export/brick7/digitalcorpora
>>> 130:     option log-level WARNING
>>> 131:     option latency-measurement off
>>> 132:     option count-fop-hits off
>>> 133:     subvolumes digitalcorpora-quota
>>> 134: end-volume
>>> 135:
>>> 136: volume /export/brick7/digitalcorpora
>>> 137:     type performance/decompounder
>>> 138:     option rpc-auth-allow-insecure on
>>> 139:     option auth.addr./export/brick7/digitalcorpora.allow
>>> 129.174.125.204,129.174.93.204
>>> 140:     option auth-path /export/brick7/digitalcorpora
>>> 141:     option auth.login.b17f2513-7d9c-4174-a0c5-de4a752d46ca.password
>>> 6c007ad0-b5a2-4564-8464-300f8317e5c7
>>> 142:     option auth.login./export/brick7/digitalcorpora.allow
>>> b17f2513-7d9c-4174-a0c5-de4a752d46ca
>>> 143:     subvolumes digitalcorpora-io-stats
>>> 144: end-volume
>>> 145:
>>> 146: volume digitalcorpora-server
>>> 147:     type protocol/server
>>> 148:     option transport.socket.listen-port 49154
>>> 149:     option rpc-auth.auth-glusterfs on
>>> 150:     option rpc-auth.auth-unix on
>>> 151:     option rpc-auth.auth-null on
>>> 152:     option transport-type tcp
>>> 153:     option transport.address-family inet
>>> 154:     option auth.login./export/brick7/digitalcorpora.allow
>>> b17f2513-7d9c-4174-a0c5-de4a752d46ca
>>> 155:     option auth.login.b17f2513-7d9c-4174-a0c5-de4a752d46ca.password
>>> 6c007ad0-b5a2-4564-8464-300f8317e5c7
>>> 156:     option auth-path /export/brick7/digitalcorpora
>>> 157:     option auth.addr./export/brick7/digitalcorpora.allow
>>> 129.174.125.204,129.174.93.204
>>> 158:     option ping-timeout 42
>>> 159:     option transport.socket.keepalive 1
>>> 160:     option rpc-auth-allow-insecure on
>>> 161:     option transport.tcp-user-timeout 0
>>> 162:     option transport.socket.keepalive-time 20
>>> 163:     option transport.socket.keepalive-interval 2
>>> 164:     option transport.socket.keepalive-count 9
>>> 165:     subvolumes /export/brick7/digitalcorpora
>>> 166: end-volume
>>> 167:
>>> +-----------------------------------------------------------
>>> -------------------+
>>> [2017-10-24 17:22:21.438620] W [socket.c:593:__socket_rwv] 0-glusterfs:
>>> readv on 129.174.126.87:24007 failed (No data available)
>>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171024/f9580462/attachment.html>


More information about the Gluster-users mailing list