[Gluster-users] gluster peer probe error (v3.6.2)

Tue Mar 24 15:05:03 UTC 2015

Hi,

The problem seems to be solved now(!). We discovered that the global options file
(/var/lib/glusterd/options) generated an error in the log file:
> [1970-01-01 00:01:24.423024] E [glusterd-utils.c:5760:glusterd_compare_friend_da
> ta] 0-management: Importing global options failed
For some reason the file was missing previously, but that didn't cause any major
problems to glusterd (except for an error message about the missing file). However,
when an empty options file was created to get rid of that error message, this new
message appeared which seems to be more serious than the previous. When the contents
'global-option-version=0' was added to that options file, all those error messages
disappeared and 'gluster peer probe' started to work as expected again. Not that
obvious, at least not for me.

Anyway, thanks for your efforts in trying to solve this problem.

Regards
Andreas

On 03/24/15 10:32, Andreas wrote:
> Sure I am. Unfortunately it didn't change the result...
>
> # killall glusterd
> # ps -ef | grep gluster
> root     15755   657  0 18:35 ttyS0    00:00:00 grep gluster
> # rm /var/lib/glusterd/peers/*  
> #  /usr/sbin/glusterd -p /var/run/glusterd.pid
> # gluster peer probe 10.32.1.144
> #
> (I killed glusterd and removed the files on both servers.)
>
> Regards
> Andreas
>
>
> On 03/24/15 05:36, Atin Mukherjee wrote:
>> If you are okay to do a fresh set up I would recommend you to clean up
>> /var/lib/glusterd/peers/* and then restart glusterd in both the nodes
>> and then try peer probing.
>>
>> ~Atin
>>
>> On 03/23/2015 06:44 PM, Andreas wrote:
>>> Hi,
>>>
>>> # gluster peer detach 10.32.1.144
>>> (No output here. Similar to the problem with 'gluster peer probe'.)
>>> # gluster peer detach 10.32.1.144 force
>>> peer detach: failed: Peer is already being detached from cluster.
>>> Check peer status by running gluster peer status
>>> # gluster peer status
>>> Number of Peers: 1
>>>
>>> Hostname: 10.32.1.144
>>> Uuid: 82cdb873-28cc-4ed0-8cfe-2b6275770429
>>> State: Probe Sent to Peer (Connected)
>>>
>>> # ping 10.32.1.144
>>> PING 10.32.1.144 (10.32.1.144): 56 data bytes
>>> 64 bytes from 10.32.1.144: seq=0 ttl=64 time=1.811 ms
>>> 64 bytes from 10.32.1.144: seq=1 ttl=64 time=1.834 ms
>>> ^C
>>> --- 10.32.1.144 ping statistics ---
>>> 2 packets transmitted, 2 packets received, 0% packet loss
>>> round-trip min/avg/max = 1.811/1.822/1.834 ms
>>>
>>>
>>> As previously stated, this problem seems to be similar to what I experienced with
>>> 'gluster peer probe'. I can reboot the server, but the situation will be the same
>>> (I've tried this many times).
>>> Any ideas of which ports to investigate and how to do it to get the most reliable result?
>>> Anything else that could cause this?
>>>
>>>
>>>
>>> Regards
>>> Andreas
>>>
>>>
>>> On 03/23/15 11:10, Atin Mukherjee wrote:
>>>> On 03/23/2015 03:28 PM, Andreas Hollaus wrote:
>>>>> 2Hi,
>>>>>
>>>>> This network problem is persistent. However, I can ping the server so guess it
>>>>> depends on the port no, right?
>>>>> I tried to telnet to port 24007, but I was not sure how to interpret the result as I
>>>>> got no respons and no timeout (it just seemed to be waiting for something).
>>>>> That's why I decided to install nmap, but according to that tool the port was
>>>>> accessible. Are there any other ports that are vital to gluster peer probe?
>>>>>
>>>>> When you say 'deprobe', I guess you mean 'gluster peer detach'? That command shows
>>>>> similar behaviour to gluster peer probe.
>>>> Yes I meant peer detach. How about gluster peer detach force?
>>>>> Regards
>>>>> Andreas
>>>>>
>>>>> On 03/23/15 05:34, Atin Mukherjee wrote:
>>>>>> On 03/22/2015 07:11 PM, Andreas Hollaus wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I hope that these are the logs that you requested.
>>>>>>>
>>>>>>> Logs from 10.32.0.48:
>>>>>>> ------------------------------
>>>>>>> # more /var/log/glusterfs/.cmd_log_history
>>>>>>> [2015-03-19 13:52:03.277438]  : peer probe 10.32.1.144 : FAILED : Probe returned
>>>>>>>  with unknown errno -1
>>>>>>>
>>>>>>> # more /var/log/glusterfs/etc-glusterfs-glusterd.vol.log
>>>>>>> [2015-03-19 13:41:31.241768] I [MSGID: 100030] [glusterfsd.c:2018:main] 0-/usr/s
>>>>>>> bin/glusterd: Started running /usr/sbin/glusterd version 3.6.2 (args: /usr/sbin/
>>>>>>> glusterd -p /var/run/glusterd.pid)
>>>>>>> [2015-03-19 13:41:31.245352] I [glusterd.c:1214:init] 0-management: Maximum allo
>>>>>>> wed open file descriptors set to 65536
>>>>>>> [2015-03-19 13:41:31.245432] I [glusterd.c:1259:init] 0-management: Using /var/l
>>>>>>> ib/glusterd as working directory
>>>>>>> [2015-03-19 13:41:31.247826] I [glusterd-store.c:2063:glusterd_restore_op_versio
>>>>>>> n] 0-management: Detected new install. Setting op-version to maximum : 30600
>>>>>>> [2015-03-19 13:41:31.247902] I [glusterd-store.c:3497:glusterd_store_retrieve_mi
>>>>>>> ssed_snaps_list] 0-management: No missed snaps list.
>>>>>>> Final graph:
>>>>>>> +------------------------------------------------------------------------------+
>>>>>>>   1: volume management
>>>>>>>   2:     type mgmt/glusterd
>>>>>>>   3:     option rpc-auth.auth-glusterfs on
>>>>>>>   4:     option rpc-auth.auth-unix on
>>>>>>>   5:     option rpc-auth.auth-null on
>>>>>>>   6:     option transport.socket.listen-backlog 128
>>>>>>>   7:     option ping-timeout 30
>>>>>>>   8:     option transport.socket.read-fail-log off
>>>>>>>   9:     option transport.socket.keepalive-interval 2
>>>>>>>  10:     option transport.socket.keepalive-time 10
>>>>>>>  11:     option transport-type socket
>>>>>>>  12:     option working-directory /var/lib/glusterd
>>>>>>>  13: end-volume
>>>>>>>  14: 
>>>>>>> +------------------------------------------------------------------------------+
>>>>>>> [2015-03-19 13:42:02.258403] I [glusterd-handler.c:1015:__glusterd_handle_cli_pr
>>>>>>> obe] 0-glusterd: Received CLI probe req 10.32.1.144 24007
>>>>>>> [2015-03-19 13:42:02.259456] I [glusterd-handler.c:3165:glusterd_probe_begin] 0-
>>>>>>> glusterd: Unable to find peerinfo for host: 10.32.1.144 (24007)
>>>>>>> [2015-03-19 13:42:02.259664] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-manag
>>>>>>> ement: setting frame-timeout to 600
>>>>>>> [2015-03-19 13:42:02.260488] I [glusterd-handler.c:3098:glusterd_friend_add] 0-m
>>>>>>> anagement: connect returned 0
>>>>>>> [2015-03-19 13:42:02.270316] I [glusterd.c:176:glusterd_uuid_generate_save] 0-ma
>>>>>>> nagement: generated UUID: 4441e237-89d6-4cdf-a212-f17ecb953b58
>>>>>>> [2015-03-19 13:42:02.273427] I [glusterd-rpc-ops.c:244:__glusterd_probe_cbk] 0-m
>>>>>>> anagement: Received probe resp from uuid: 82cdb873-28cc-4ed0-8cfe-2b6275770429,
>>>>>>> host: 10.32.1.144
>>>>>>> [2015-03-19 13:42:02.273681] I [glusterd-rpc-ops.c:386:__glusterd_probe_cbk] 0-g
>>>>>>> lusterd: Received resp to probe req
>>>>>>> [2015-03-19 13:42:02.278863] I [glusterd-handshake.c:1119:__glusterd_mgmt_hndsk_
>>>>>>> versions_ack] 0-management: using the op-version 30600
>>>>>>> [2015-03-19 13:52:03.277422] E [rpc-clnt.c:201:call_bail] 0-management: bailing
>>>>>>> out frame type(Peer mgmt) op(--(2)) xid = 0x6 sent = 2015-03-19 13:42:02.273482.
>>>>>>>  timeout = 600 for 10.32.1.144:24007
>>>>>> Here is the issue, there was some problem in the network at the time
>>>>>> when peer probe was issued. This is why the call bail is seen. Could you
>>>>>> try to deprobe and then probe it back again?
>>>>>>> [2015-03-19 13:52:03.277453] I [socket.c:3366:socket_submit_reply] 0-socket.mana
>>>>>>> gement: not connected (priv->connected = 255)
>>>>>>> [2015-03-19 13:52:03.277468] E [rpcsvc.c:1247:rpcsvc_submit_generic] 0-rpc-servi
>>>>>>> ce: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2,
>>>>>>> Proc: 1) to rpc-transport (socket.management)
>>>>>>> [2015-03-19 13:52:03.277483] E [glusterd-utils.c:387:glusterd_submit_reply] 0-:
>>>>>>> Reply submission failed
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Logs from 10.32.1.144:
>>>>>>> ---------------------------------
>>>>>>> # more ./.cmd_log_history
>>>>>>>
>>>>>>> # more ./etc-glusterfs-glusterd.vol.log
>>>>>>> [1970-01-01 00:00:53.225739] I [MSGID: 100030] [glusterfsd.c:2018:main] 0-/usr/s
>>>>>>> bin/glusterd: Started running /usr/sbin/glusterd version 3.6.2 (args: /usr/sbin/
>>>>>>> glusterd -p /var/run/glusterd.pid)
>>>>>>> [1970-01-01 00:00:53.229222] I [glusterd.c:1214:init] 0-management: Maximum allo
>>>>>>> wed open file descriptors set to 65536
>>>>>>> [1970-01-01 00:00:53.229301] I [glusterd.c:1259:init] 0-management: Using /var/l
>>>>>>> ib/glusterd as working directory
>>>>>>> [1970-01-01 00:00:53.231653] I [glusterd-store.c:2063:glusterd_restore_op_versio
>>>>>>> n] 0-management: Detected new install. Setting op-version to maximum : 30600
>>>>>>> [1970-01-01 00:00:53.231730] I [glusterd-store.c:3497:glusterd_store_retrieve_mi
>>>>>>> ssed_snaps_list] 0-management: No missed snaps list.
>>>>>>> Final graph:
>>>>>>> +------------------------------------------------------------------------------+
>>>>>>>   1: volume management
>>>>>>>   2:     type mgmt/glusterd
>>>>>>>   3:     option rpc-auth.auth-glusterfs on
>>>>>>>   4:     option rpc-auth.auth-unix on
>>>>>>>   5:     option rpc-auth.auth-null on
>>>>>>>   6:     option transport.socket.listen-backlog 128
>>>>>>>   7:     option ping-timeout 30
>>>>>>>   8:     option transport.socket.read-fail-log off
>>>>>>>   9:     option transport.socket.keepalive-interval 2
>>>>>>>  10:     option transport.socket.keepalive-time 10
>>>>>>>  11:     option transport-type socket
>>>>>>>  12:     option working-directory /var/lib/glusterd
>>>>>>>  13: end-volume
>>>>>>>  14: 
>>>>>>> +------------------------------------------------------------------------------+
>>>>>>> [1970-01-01 00:01:24.417689] I [glusterd-handshake.c:1119:__glusterd_mgmt_hndsk_
>>>>>>> versions_ack] 0-management: using the op-version 30600
>>>>>>> [1970-01-01 00:01:24.417736] I [glusterd.c:176:glusterd_uuid_generate_save] 0-ma
>>>>>>> nagement: generated UUID: 82cdb873-28cc-4ed0-8cfe-2b6275770429
>>>>>>> [1970-01-01 00:01:24.420067] I [glusterd-handler.c:2523:__glusterd_handle_probe_
>>>>>>> query] 0-glusterd: Received probe from uuid: 4441e237-89d6-4cdf-a212-f17ecb953b5
>>>>>>> 8
>>>>>>> [1970-01-01 00:01:24.420158] I [glusterd-handler.c:2551:__glusterd_handle_probe_
>>>>>>> query] 0-glusterd: Unable to find peerinfo for host: 10.32.0.48 (24007)
>>>>>>> [1970-01-01 00:01:24.420379] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-manag
>>>>>>> ement: setting frame-timeout to 600
>>>>>>> [1970-01-01 00:01:24.421140] I [glusterd-handler.c:3098:glusterd_friend_add] 0-m
>>>>>>> anagement: connect returned 0
>>>>>>> [1970-01-01 00:01:24.421167] I [glusterd-handler.c:2575:__glusterd_handle_probe_
>>>>>>> query] 0-glusterd: Responded to 10.32.0.48, op_ret: 0, op_errno: 0, ret: 0
>>>>>>> [1970-01-01 00:01:24.422991] I [glusterd-handler.c:2216:__glusterd_handle_incomi
>>>>>>> ng_friend_req] 0-glusterd: Received probe from uuid: 4441e237-89d6-4cdf-a212-f17
>>>>>>> ecb953b58
>>>>>>> [1970-01-01 00:01:24.423024] E [glusterd-utils.c:5760:glusterd_compare_friend_da
>>>>>>> ta] 0-management: Importing global options failed
>>>>>>> [1970-01-01 00:01:24.423036] E [glusterd-sm.c:1078:glusterd_friend_sm] 0-gluster
>>>>>>> d: handler returned: -2
>>>>>>>  
>>>>>>>
>>>>>>> Regards
>>>>>>> Andreas
>>>>>>>
>>>>>>>
>>>>>>> On 03/22/15 07:33, Atin Mukherjee wrote:
>>>>>>>> On 03/22/2015 12:09 AM, Andreas Hollaus wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I get a strange result when I execute 'gluster peer probe'. The command hangs and
>>>>>>>>> seems to timeout without any message (I can ping the address):
>>>>>>>>> # gluster peer probe 10.32.1.144
>>>>>>>>> # echo $?
>>>>>>>>> 146
>>>>>>>> Could you provide the glusterd log and .cmd_log_history for all the
>>>>>>>> nodes in the cluster?
>>>>>>>>> The status looks promising, but there's a differences between this output and what
>>>>>>>>> you normally get from a successful call:
>>>>>>>>> # gluster peer status
>>>>>>>>> Number of Peers: 1
>>>>>>>>>
>>>>>>>>> Hostname: 10.32.1.144
>>>>>>>>> Uuid: 0b008d3e-c51b-4243-ad19-c79c869ba9f2
>>>>>>>>> State: Probe Sent to Peer (Connected)
>>>>>>>>>
>>>>>>>>> (instead of 'State: Peer in Cluster (Connected)')
>>>>>>>>>
>>>>>>>>> Running the command again will tell you that it is connected:
>>>>>>>>>
>>>>>>>>> # gluster peer probe 10.32.1.144
>>>>>>>>> peer probe: success. Host 10.32.1.144 port 24007 already in peer list
>>>>>>>> This means that this peer was added locally but peer handshake was not
>>>>>>>> completed for previous peer probe transaction. I would be interested to
>>>>>>>> see the logs and then can comment on what went wrong.
>>>>>>>>> But when you try to add a brick from that server it fails:
>>>>>>>>>
>>>>>>>>> # gluster volume add-brick c_test replica 2 10.32.1.144:/opt/lvmdir/c2 force
>>>>>>>>> volume add-brick: failed: Host 10.32.1.144 is not in 'Peer in Cluster' state
>>>>>>>>>
>>>>>>>>> The volume was previously created using the following commands:
>>>>>>>>> # gluster volume create c_test 10.32.0.48:/opt/lvmdir/c2 force
>>>>>>>>> volume create: c_test: success: please start the volume to access data
>>>>>>>>> # gluster volume start c_test
>>>>>>>>> volume start: c_test: success
>>>>>>>>>
>>>>>>>>> What could be the reason for this problem?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Andreas
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Gluster-users mailing list
>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users