[Gluster-users] Gluster long healing process

Mahdi Adnan mahdi.adnan at outlook.com
Wed May 3 07:15:45 UTC 2017


Hi,


I have a 4 node Gluster volume, each has 24 SSD brick running Gluster 3.8.10 (two volumes), i updated one of the nodes to 3.8.11 and rebooted the node, after it came back online the healing process started and it never ended.

It has been 24 hours and the healing is still going, gluster vol heal $VOL info return number of entries that need healing and it decrees and increase randomly.

The node is writing lots of Gigabytes and i dont know if this is normal or something im missing.

Volume details;


Volume Name: ovirt_imgs
Type: Distributed-Replicate
Volume ID: 40d1354b-8e85-4464-8c71-9e2efbe10a63
Status: Started
Snapshot Count: 0
Number of Bricks: 26 x 2 = 52
Transport-type: tcp
Bricks:
Brick1: gluster01:/mnt/ovirt_disk1/ovirt_imgs
Brick2: gluster03:/mnt/ovirt_disk1/ovirt_imgs
Brick3: gluster02:/mnt/ovirt_disk1/ovirt_imgs
Brick4: gluster04:/mnt/ovirt_disk1/ovirt_imgs
Brick5: gluster01:/mnt/ovirt_disk2/ovirt_imgs
Brick6: gluster03:/mnt/ovirt_disk2/ovirt_imgs
Brick7: gluster02:/mnt/ovirt_disk2/ovirt_imgs
Brick8: gluster04:/mnt/ovirt_disk2/ovirt_imgs
Brick9: gluster01:/mnt/ovirt_disk3/ovirt_imgs
Brick10: gluster03:/mnt/ovirt_disk3/ovirt_imgs
Brick11: gluster02:/mnt/ovirt_disk3/ovirt_imgs
Brick12: gluster04:/mnt/ovirt_disk3/ovirt_imgs
Brick13: gluster01:/mnt/ovirt_disk4/ovirt_imgs
Brick14: gluster03:/mnt/ovirt_disk4/ovirt_imgs
Brick15: gluster02:/mnt/ovirt_disk4/ovirt_imgs
Brick16: gluster04:/mnt/ovirt_disk4/ovirt_imgs
Brick17: gluster01:/mnt/ovirt_disk5/ovirt_imgs
Brick18: gluster03:/mnt/ovirt_disk5/ovirt_imgs
Brick19: gluster02:/mnt/ovirt_disk5/ovirt_imgs
Brick20: gluster04:/mnt/ovirt_disk5/ovirt_imgs
Brick21: gluster01:/mnt/ovirt_disk6/ovirt_imgs
Brick22: gluster03:/mnt/ovirt_disk6/ovirt_imgs
Brick23: gluster02:/mnt/ovirt_disk6/ovirt_imgs
Brick24: gluster04:/mnt/ovirt_disk6/ovirt_imgs
Brick25: gluster01:/mnt/ovirt_disk7/ovirt_imgs
Brick26: gluster03:/mnt/ovirt_disk7/ovirt_imgs
Brick27: gluster02:/mnt/ovirt_disk7/ovirt_imgs
Brick28: gluster04:/mnt/ovirt_disk7/ovirt_imgs
Brick29: gluster01:/mnt/ovirt_disk8/ovirt_imgs
Brick30: gluster03:/mnt/ovirt_disk8/ovirt_imgs
Brick31: gluster02:/mnt/ovirt_disk8/ovirt_imgs
Brick32: gluster04:/mnt/ovirt_disk8/ovirt_imgs
Brick33: gluster01:/mnt/ovirt_disk9/ovirt_imgs
Brick34: gluster03:/mnt/ovirt_disk9/ovirt_imgs
Brick35: gluster02:/mnt/ovirt_disk9/ovirt_imgs
Brick36: gluster04:/mnt/ovirt_disk9/ovirt_imgs
Brick37: gluster01:/mnt/ovirt_disk10/ovirt_imgs
Brick38: gluster03:/mnt/ovirt_disk10/ovirt_imgs
Brick39: gluster02:/mnt/ovirt_disk10/ovirt_imgs
Brick40: gluster04:/mnt/ovirt_disk10/ovirt_imgs
Brick41: gluster01:/mnt/ovirt_disk11/ovirt_imgs
Brick42: gluster03:/mnt/ovirt_disk11/ovirt_imgs
Brick43: gluster02:/mnt/ovirt_disk11/ovirt_imgs
Brick44: gluster04:/mnt/ovirt_disk11/ovirt_imgs
Brick45: gluster01:/mnt/ovirt_disk12/ovirt_imgs
Brick46: gluster03:/mnt/ovirt_disk12/ovirt_imgs
Brick47: gluster02:/mnt/ovirt_disk12/ovirt_imgs
Brick48: gluster04:/mnt/ovirt_disk12/ovirt_imgs
Brick49: gluster01:/mnt/ovirt_disk13/ovirt_imgs
Brick50: gluster03:/mnt/ovirt_disk13/ovirt_imgs
Brick51: gluster02:/mnt/ovirt_disk13/ovirt_imgs
Brick52: gluster04:/mnt/ovirt_disk13/ovirt_imgs
Options Reconfigured:
ganesha.enable: off
features.cache-invalidation: off
features.shard-block-size: 256MB
storage.owner-gid: 36
storage.owner-uid: 36
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: none
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.server-quorum-ratio: 51%
nfs-ganesha: enable
cluster.enable-shared-storage: enable



OS: Centos 7.3 latest.



gluster heal log sample;


[2017-05-03 07:01:29.487108] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-ovirt_imgs-client-45: changing port to 49571 (from 0)
[2017-05-03 07:01:29.489004] I [MSGID: 114020] [client.c:2356:notify] 0-ovirt_imgs-client-47: parent translators are ready, attempting connect on transport
[2017-05-03 07:01:29.491077] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-44: Connected to ovirt_imgs-client-44, attached to remote volume '/mnt/ovirt_disk12/ovirt_imgs'.
[2017-05-03 07:01:29.491092] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-44: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:29.491123] I [MSGID: 108005] [afr-common.c:4387:afr_notify] 0-ovirt_imgs-replicate-22: Subvolume 'ovirt_imgs-client-44' came back up; going online.
[2017-05-03 07:01:29.491173] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-44: Server lk version = 1
[2017-05-03 07:01:29.491280] I [MSGID: 114057] [client-handshake.c:1440:select_server_supported_programs] 0-ovirt_imgs-client-45: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-03 07:01:29.491331] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-ovirt_imgs-client-46: changing port to 49521 (from 0)
[2017-05-03 07:01:29.493119] I [MSGID: 114020] [client.c:2356:notify] 0-ovirt_imgs-client-48: parent translators are ready, attempting connect on transport
[2017-05-03 07:01:29.495480] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-45: Connected to ovirt_imgs-client-45, attached to remote volume '/mnt/ovirt_disk12/ovirt_imgs'.
[2017-05-03 07:01:29.495496] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-45: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:29.495670] I [MSGID: 114057] [client-handshake.c:1440:select_server_supported_programs] 0-ovirt_imgs-client-46: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-03 07:01:29.495729] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-45: Server lk version = 1
[2017-05-03 07:01:29.495798] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-ovirt_imgs-client-47: changing port to 49465 (from 0)
[2017-05-03 07:01:29.497438] I [MSGID: 114020] [client.c:2356:notify] 0-ovirt_imgs-client-49: parent translators are ready, attempting connect on transport
[2017-05-03 07:01:29.499871] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-46: Connected to ovirt_imgs-client-46, attached to remote volume '/mnt/ovirt_disk12/ovirt_imgs'.
[2017-05-03 07:01:29.499887] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-46: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:29.499915] I [MSGID: 108005] [afr-common.c:4387:afr_notify] 0-ovirt_imgs-replicate-23: Subvolume 'ovirt_imgs-client-46' came back up; going online.
[2017-05-03 07:01:29.500015] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-46: Server lk version = 1
[2017-05-03 07:01:29.500032] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-ovirt_imgs-client-48: changing port to 49645 (from 0)
[2017-05-03 07:01:29.500052] I [MSGID: 114057] [client-handshake.c:1440:select_server_supported_programs] 0-ovirt_imgs-client-47: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-03 07:01:29.501776] I [MSGID: 114020] [client.c:2356:notify] 0-ovirt_imgs-client-50: parent translators are ready, attempting connect on transport
[2017-05-03 07:01:29.504191] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-47: Connected to ovirt_imgs-client-47, attached to remote volume '/mnt/ovirt_disk12/ovirt_imgs'.
[2017-05-03 07:01:29.504208] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-47: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:29.504313] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-47: Server lk version = 1
[2017-05-03 07:01:29.504330] I [MSGID: 114057] [client-handshake.c:1440:select_server_supported_programs] 0-ovirt_imgs-client-48: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-03 07:01:29.504462] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-ovirt_imgs-client-49: changing port to 49572 (from 0)
[2017-05-03 07:01:29.506374] I [MSGID: 114020] [client.c:2356:notify] 0-ovirt_imgs-client-51: parent translators are ready, attempting connect on transport
[2017-05-03 07:01:29.508431] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-48: Connected to ovirt_imgs-client-48, attached to remote volume '/mnt/ovirt_disk13/ovirt_imgs'.
[2017-05-03 07:01:29.508456] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-48: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:29.508498] I [MSGID: 108005] [afr-common.c:4387:afr_notify] 0-ovirt_imgs-replicate-24: Subvolume 'ovirt_imgs-client-48' came back up; going online.
[2017-05-03 07:01:29.508556] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-48: Server lk version = 1
[2017-05-03 07:01:29.508603] I [MSGID: 114057] [client-handshake.c:1440:select_server_supported_programs] 0-ovirt_imgs-client-49: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-03 07:01:29.508725] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-ovirt_imgs-client-50: changing port to 49522 (from 0)
[2017-05-03 07:01:29.510779] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-49: Connected to ovirt_imgs-client-49, attached to remote volume '/mnt/ovirt_disk13/ovirt_imgs'.
[2017-05-03 07:01:29.510796] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-49: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:29.510903] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-49: Server lk version = 1
[2017-05-03 07:01:29.511062] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-ovirt_imgs-client-51: changing port to 49466 (from 0)
[2017-05-03 07:01:29.512828] I [MSGID: 114057] [client-handshake.c:1440:select_server_supported_programs] 0-ovirt_imgs-client-50: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-03 07:01:29.513197] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-50: Connected to ovirt_imgs-client-50, attached to remote volume '/mnt/ovirt_disk13/ovirt_imgs'.
[2017-05-03 07:01:29.513214] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-50: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:29.513236] I [MSGID: 108005] [afr-common.c:4387:afr_notify] 0-ovirt_imgs-replicate-25: Subvolume 'ovirt_imgs-client-50' came back up; going online.
[2017-05-03 07:01:29.513314] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-50: Server lk version = 1
[2017-05-03 07:01:29.515127] I [MSGID: 114057] [client-handshake.c:1440:select_server_supported_programs] 0-ovirt_imgs-client-51: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-03 07:01:29.515520] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-51: Connected to ovirt_imgs-client-51, attached to remote volume '/mnt/ovirt_disk13/ovirt_imgs'.
[2017-05-03 07:01:29.515530] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-51: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:29.515628] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-51: Server lk version = 1
[2017-05-03 07:01:30.009624] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-ovirt_imgs-client-40: Connected to ovirt_imgs-client-40, attached to remote volume '/mnt/ovirt_disk11/ovirt_imgs'.
[2017-05-03 07:01:30.009653] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-ovirt_imgs-client-40: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-03 07:01:30.234722] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ovirt_imgs-client-40: Server lk version = 1
[2017-05-03 07:01:30.235633] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-0: selecting local read_child ovirt_imgs-client-0
[2017-05-03 07:01:30.236983] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-2: selecting local read_child ovirt_imgs-client-4
[2017-05-03 07:01:30.237492] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-4: selecting local read_child ovirt_imgs-client-8
[2017-05-03 07:01:30.238310] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-6: selecting local read_child ovirt_imgs-client-12
[2017-05-03 07:01:30.238553] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-8: selecting local read_child ovirt_imgs-client-16
[2017-05-03 07:01:30.238670] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-10: selecting local read_child ovirt_imgs-client-20
[2017-05-03 07:01:30.238791] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-12: selecting local read_child ovirt_imgs-client-24
[2017-05-03 07:01:30.238881] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-14: selecting local read_child ovirt_imgs-client-28
[2017-05-03 07:01:30.238961] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-16: selecting local read_child ovirt_imgs-client-32
[2017-05-03 07:01:30.239014] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-18: selecting local read_child ovirt_imgs-client-36
[2017-05-03 07:01:30.239100] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-22: selecting local read_child ovirt_imgs-client-44
[2017-05-03 07:01:30.239140] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-20: selecting local read_child ovirt_imgs-client-40
[2017-05-03 07:01:30.239150] I [MSGID: 104041] [glfs-resolve.c:885:__glfs_active_subvol] 0-ovirt_imgs: switched to graph 676c7573-7465-7230-312d-31333836322d (0)
[2017-05-03 07:01:30.239200] I [MSGID: 108031] [afr-common.c:2157:afr_local_discovery_cbk] 0-ovirt_imgs-replicate-24: selecting local read_child ovirt_imgs-client-48






i appreciate the help.


Thanks

--

Respectfully
Mahdi A. Mahdi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170503/7d4c6553/attachment.html>


More information about the Gluster-users mailing list