<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On 17 October 2017 at 14:48, Stephen Remde <span dir="ltr">&lt;<a href="mailto:stephen.remde@gaist.co.uk" target="_blank">stephen.remde@gaist.co.uk</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><font color="#000000">Hi,</font></pre><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><font color="#000000">

I have a rebalance that has failed on one peer twice now. Rebalance logs below (directories anonomised and some irrelevant log lines cut). It looks like it loses connection to the brick, but immediately stops the rebalance on that peer instead of waiting for reconnection - which happens a second or so later.

Is this normal behaviour? So far it has been the same server and the same (remote) brick. </font></pre><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><font color="#000000">

The brick shows a high number of disconnects compared to the other bricks on the same server</font></pre><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><font color="#000000">

./export-md0-brick.log.1      2      

./export-md1-brick.log.1      2      

./export-md2-brick.log.1    181 

./export-md3-brick.log.1      2      </font></pre><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><font color="#000000">

Any clues? What could be causing this because there is nothing in the log to indicate cause.</font></pre><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><font color="#000000"></font></pre></pre></pre></div></div></blockquote><div>The rebalance process requires that all DHT child subvols be up during the operation as it needs to reapply the directory layouts (which requires all child subvols to be up). As this is a pure distribute volume, even a single brick getting disconnected is enough to cause the process to stop.</div><div><br></div><div>You would need to figure out why that brick is disconnecting so often. The brick logs might help with that.</div><div><br></div><div>Regards,</div><div>Nithya</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><pre id="m_-3761321276761938441gmail-hterm:copy-to-clipboard-source"><font color="#000000">

Steve

gluster volume info video                                                                                                                                                                                                                     

Volume Name: video

Type: Distribute

Volume ID: ccdac37f-9b0e-415f-b62e-<wbr>9071d8168199

Status: Started

Snapshot Count: 0

Number of Bricks: 9

Transport-type: tcp

Bricks:

Brick1: 10.0.0.31:/export/md0/brick

Brick2: 10.0.0.32:/export/md0/brick

Brick3: 10.0.0.31:/export/md1/brick

Brick4: 10.0.0.32:/export/md1/brick

Brick5: 10.0.0.31:/export/md2/brick

Brick6: 10.0.0.32:/export/md2/brick

Brick7: 10.0.0.31:/export/md3/brick

Brick8: 10.0.0.32:/export/md3/brick

Brick9: 10.0.0.33:/export/md0/brick

Options Reconfigured:

network.ping-timeout: 10

cluster.min-free-disk: 1%

transport.address-family: inet

performance.readdir-ahead: on

nfs.disable: on

cluster.rebal-throttle: lazy

[2017-10-12 23:00:55.099153] W [socket.c:590:__socket_rwv] 0-video-client-4: readv on <a href="http://10.0.0.31:49164" target="_blank">10.0.0.31:49164</a> failed (Connection reset by peer)

[2017-10-12 23:00:55.099709] I [MSGID: 114018] [client.c:2280:client_rpc_<wbr>notify] 0-video-client-4: disconnected from video-client-4. Client process will keep trying to connect to glusterd until brick&#39;s port is available

[2017-10-12 23:00:55.099741] W [MSGID: 109073] [dht-common.c:8839:dht_notify] 0-video-dht: Received CHILD_DOWN. Exiting

[2017-10-12 23:00:55.099752] I [MSGID: 109029] [dht-rebalance.c:4195:gf_<wbr>defrag_stop] 0-: Received stop command on rebalance

[2017-10-12 23:01:05.478462] I [rpc-clnt.c:1947:rpc_clnt_<wbr>reconfig] 0-video-client-4: changing port to 49164 (from 0)

[2017-10-12 23:01:05.481180] I [MSGID: 114057] [client-handshake.c:1446:<wbr>select_server_supported_<wbr>programs] 0-video-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330)

[2017-10-12 23:01:05.482630] I [MSGID: 114046] [client-handshake.c:1222:<wbr>client_setvolume_cbk] 0-video-client-4: Connected to video-client-4, attached to remote volume &#39;/export/md2/brick&#39;.

[2017-10-12 23:01:05.482659] I [MSGID: 114047] [client-handshake.c:1233:<wbr>client_setvolume_cbk] 0-video-client-4: Server and Client lk-version numbers are not same, reopening the fds

[2017-10-12 23:01:05.483365] I [MSGID: 114035] [client-handshake.c:201:<wbr>client_set_lk_version_cbk] 0-video-client-4: Server lk version = 1

[2017-10-12 23:01:30.310089] I [dht-rebalance.c:2819:gf_<wbr>defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry

[2017-10-12 23:01:30.310166] E [MSGID: 109111] [dht-rebalance.c:3090:gf_<wbr>defrag_fix_layout] 0-video-dht: gf_defrag_process_dir failed for directory: /y/y/y/y/y

[2017-10-12 23:01:30.380574] E [MSGID: 109016] [dht-rebalance.c:3267:gf_<wbr>defrag_fix_layout] 0-video-dht: Fix layout failed for /y/y/y/y/y

[2017-10-12 23:01:30.380756] E [MSGID: 109016] [dht-rebalance.c:3267:gf_<wbr>defrag_fix_layout] 0-video-dht: Fix layout failed for /y/y/y/y

[2017-10-12 23:01:30.380879] E [MSGID: 109016] [dht-rebalance.c:3267:gf_<wbr>defrag_fix_layout] 0-video-dht: Fix layout failed for /y/y/y

[2017-10-12 23:01:30.380965] E [MSGID: 109016] [dht-rebalance.c:3267:gf_<wbr>defrag_fix_layout] 0-video-dht: Fix layout failed for /y/y

[2017-10-12 23:03:09.285157] W [glusterfsd.c:1327:cleanup_<wbr>and_exit] (--&gt;/lib/x86_64-linux-gnu/<wbr>libpthread.so.0(+0x76ba) [0x7f112b6d16ba] --&gt;/usr/sbin/glusterfs(<wbr>glusterfs_sigwaiter+0xe5) [0x55b325019545] --&gt;/usr/sbin/glusterfs(<wbr>cleanup_and_exit+0x54) [0x55b3250193b4] ) 0-: received signum (15), shutting down

[2017-10-17 03:20:28.921512] W [socket.c:590:__socket_rwv] 0-video-client-4: readv on <a href="http://10.0.0.31:49164" target="_blank">10.0.0.31:49164</a> failed (Connection reset by peer)

[2017-10-17 03:20:28.921554] I [MSGID: 114018] [client.c:2280:client_rpc_<wbr>notify] 0-video-client-4: disconnected from video-client-4. Client process will keep trying to connect to glusterd until brick&#39;s port is available

[2017-10-17 03:20:28.921570] W [MSGID: 109073] [dht-common.c:8839:dht_notify] 0-video-dht: Received CHILD_DOWN. Exiting

[2017-10-17 03:20:28.921578] I [MSGID: 109029] [dht-rebalance.c:4195:gf_<wbr>defrag_stop] 0-: Received stop command on rebalance

[2017-10-17 03:20:39.344417] I [rpc-clnt.c:1947:rpc_clnt_<wbr>reconfig] 0-video-client-4: changing port to 49164 (from 0)

[2017-10-17 03:20:39.347440] I [MSGID: 114057] [client-handshake.c:1446:<wbr>select_server_supported_<wbr>programs] 0-video-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330)

[2017-10-17 03:20:39.349244] I [MSGID: 114046] [client-handshake.c:1222:<wbr>client_setvolume_cbk] 0-video-client-4: Connected to video-client-4, attached to remote volume &#39;/export/md2/brick&#39;.

[2017-10-17 03:20:39.349261] I [MSGID: 114047] [client-handshake.c:1233:<wbr>client_setvolume_cbk] 0-video-client-4: Server and Client lk-version numbers are not same, reopening the fds

[2017-10-17 03:20:39.350611] I [MSGID: 114035] [client-handshake.c:201:<wbr>client_set_lk_version_cbk] 0-video-client-4: Server lk version = 1

[2017-10-17 03:27:17.231133] I [dht-rebalance.c:2819:gf_<wbr>defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry

[2017-10-17 03:27:17.231214] E [MSGID: 109111] [dht-rebalance.c:3090:gf_<wbr>defrag_fix_layout] 0-video-dht: gf_defrag_process_dir failed for directory: /x/x/x/x/x

[2017-10-17 03:27:17.562481] E [MSGID: 109016] [dht-rebalance.c:3267:gf_<wbr>defrag_fix_layout] 0-video-dht: Fix layout failed for /x/x/x/x/x

[2017-10-17 03:27:17.562619] E [MSGID: 109016] [dht-rebalance.c:3267:gf_<wbr>defrag_fix_layout] 0-video-dht: Fix layout failed for /x/x/x/x

[2017-10-17 03:27:17.562726] E [MSGID: 109016] [dht-rebalance.c:3267:gf_<wbr>defrag_fix_layout] 0-video-dht: Fix layout failed for /x/x/x

[2017-10-17 03:27:17.562810] E [MSGID: 109016] [dht-rebalance.c:3267:gf_<wbr>defrag_fix_layout] 0-video-dht: Fix layout failed for /x/x

[2017-10-17 03:27:18.379825] W [glusterfsd.c:1327:cleanup_<wbr>and_exit] (--&gt;/lib/x86_64-linux-gnu/<wbr>libpthread.so.0(+0x76ba) [0x7f700b9696ba] --&gt;/usr/sbin/glusterfs(<wbr>glusterfs_sigwaiter+0xe5) [0x55f9c0022545] --&gt;/usr/sbin/glusterfs(<wbr>cleanup_and_exit+0x54) [0x55f9c00223b4] ) 0-: received signum (15), shutting down

</font></pre></pre></pre></div><div class="m_-3761321276761938441gmail_signature"><div dir="ltr"><div><div dir="ltr"></div></div></div></div>

</div>

<br>______________________________<wbr>_________________<br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>

<a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br></blockquote></div><br></div></div>