[Gluster-users] libgfapi failover problem on replica bricks

Mon Sep 1 08:27:27 UTC 2014

On 09/01/2014 12:56 PM, Roman wrote:
> Hmm, I don't know how, but both VM-s survived the second server outage 
> :) Still had no any message about healing completion anywhere :)
Healing can be performed by:
1) Mount process (/path/to/mount/log/)
2) Self-heal daemons on either of the bricks 
(/var/log/glusterfs/glustershd.log)

Check if there are any messages on either of these logs.

Pranith
>
>
> 2014-09-01 10:13 GMT+03:00 Roman <romeo.r at gmail.com 
> <mailto:romeo.r at gmail.com>>:
>
>     The mount is on the proxmox machine.
>
>     here are the logs from disconnection till connection:
>
>
>     [2014-09-01 06:19:38.059383] W [socket.c:522:__socket_rwv]
>     0-glusterfs: readv on 10.250.0.1:24007 <http://10.250.0.1:24007>
>     failed (Connection timed out)
>     [2014-09-01 06:19:40.338393] W [socket.c:522:__socket_rwv]
>     0-HA-2TB-TT-Proxmox-cluster-client-0: readv on 10.250.0.1:49159
>     <http://10.250.0.1:49159> failed (Connection timed out)
>     [2014-09-01 06:19:40.338447] I [client.c:2229:client_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-client-0: disconnected from
>     10.250.0.1:49159 <http://10.250.0.1:49159>. Client process will
>     keep trying to connect to glusterd until brick's port is available
>     [2014-09-01 06:19:49.196768] E
>     [socket.c:2161:socket_connect_finish] 0-glusterfs: connection to
>     10.250.0.1:24007 <http://10.250.0.1:24007> failed (No route to host)
>     [2014-09-01 06:20:05.565444] E
>     [socket.c:2161:socket_connect_finish]
>     0-HA-2TB-TT-Proxmox-cluster-client-0: connection to
>     10.250.0.1:24007 <http://10.250.0.1:24007> failed (No route to host)
>     [2014-09-01 06:23:26.607180] I [rpc-clnt.c:1729:rpc_clnt_reconfig]
>     0-HA-2TB-TT-Proxmox-cluster-client-0: changing port to 49159 (from 0)
>     [2014-09-01 06:23:26.608032] I
>     [client-handshake.c:1677:select_server_supported_programs]
>     0-HA-2TB-TT-Proxmox-cluster-client-0: Using Program GlusterFS 3.3,
>     Num (1298437), Version (330)
>     [2014-09-01 06:23:26.608395] I
>     [client-handshake.c:1462:client_setvolume_cbk]
>     0-HA-2TB-TT-Proxmox-cluster-client-0: Connected to
>     10.250.0.1:49159 <http://10.250.0.1:49159>, attached to remote
>     volume '/exports/HA-2TB-TT-Proxmox-cluster/2TB'.
>     [2014-09-01 06:23:26.608420] I
>     [client-handshake.c:1474:client_setvolume_cbk]
>     0-HA-2TB-TT-Proxmox-cluster-client-0: Server and Client lk-version
>     numbers are not same, reopening the fds
>     [2014-09-01 06:23:26.608606] I
>     [client-handshake.c:450:client_set_lk_version_cbk]
>     0-HA-2TB-TT-Proxmox-cluster-client-0: Server lk version = 1
>     [2014-09-01 06:23:40.604979] I
>     [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change
>     in volfile, continuing
>
>     Now there is no healing traffic also. I could try to disconnect
>     now second server to see if it is going to failover. I don't
>     really believe it will :(
>
>     here are some logs for stor1 server (the one I've disconnected):
>     root at stor1:~# cat
>     /var/log/glusterfs/bricks/exports-HA-2TB-TT-Proxmox-cluster-2TB.log
>     [2014-09-01 06:19:26.403323] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     pve1-298005-2014/08/28-19:41:19:7269-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:26.403399] I
>     [server-helpers.c:289:do_fd_cleanup]
>     0-HA-2TB-TT-Proxmox-cluster-server: fd cleanup on
>     /images/112/vm-112-disk-1.raw
>     [2014-09-01 06:19:26.403486] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     pve1-298005-2014/08/28-19:41:19:7269-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:29.475318] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     stor2-22775-2014/08/28-19:26:34:786262-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:29.475373] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     stor2-22775-2014/08/28-19:26:34:786262-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:36.963318] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     stor2-22777-2014/08/28-19:26:34:791148-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:36.963373] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     stor2-22777-2014/08/28-19:26:34:791148-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:40.419298] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     pve1-289547-2014/08/28-19:27:22:605477-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:40.419355] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     pve1-289547-2014/08/28-19:27:22:605477-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:42.531310] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     sisemon-141844-2014/08/28-19:27:19:824141-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:19:42.531368] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     sisemon-141844-2014/08/28-19:27:19:824141-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:23:25.088518] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     sisemon-141844-2014/08/28-19:27:19:824141-HA-2TB-TT-Proxmox-cluster-client-0-0-1
>     (version: 3.5.2)
>     [2014-09-01 06:23:25.532734] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     stor2-22775-2014/08/28-19:26:34:786262-HA-2TB-TT-Proxmox-cluster-client-0-0-1
>     (version: 3.5.2)
>     [2014-09-01 06:23:26.608074] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     pve1-289547-2014/08/28-19:27:22:605477-HA-2TB-TT-Proxmox-cluster-client-0-0-1
>     (version: 3.5.2)
>     [2014-09-01 06:23:27.187556] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     pve1-298005-2014/08/28-19:41:19:7269-HA-2TB-TT-Proxmox-cluster-client-0-0-1
>     (version: 3.5.2)
>     [2014-09-01 06:23:27.213890] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     stor2-22777-2014/08/28-19:26:34:791148-HA-2TB-TT-Proxmox-cluster-client-0-0-1
>     (version: 3.5.2)
>     [2014-09-01 06:23:31.222654] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     pve1-494566-2014/08/29-01:00:13:257498-HA-2TB-TT-Proxmox-cluster-client-0-0-1
>     (version: 3.5.2)
>     [2014-09-01 06:23:52.591365] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     pve1-494566-2014/08/29-01:00:13:257498-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:23:52.591447] W
>     [inodelk.c:392:pl_inodelk_log_cleanup]
>     0-HA-2TB-TT-Proxmox-cluster-server: releasing lock on
>     14f70955-5e1e-4499-b66b-52cd50892315 held by
>     {client=0x7f2494001ed0, pid=0 lk-owner=bc3ddbdbae7f0000}
>     [2014-09-01 06:23:52.591568] I
>     [server-helpers.c:289:do_fd_cleanup]
>     0-HA-2TB-TT-Proxmox-cluster-server: fd cleanup on
>     /images/124/vm-124-disk-1.qcow2
>     [2014-09-01 06:23:52.591679] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     pve1-494566-2014/08/29-01:00:13:257498-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:23:58.709444] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     stor1-3975-2014/09/01-06:23:58:673930-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     (version: 3.5.2)
>     [2014-09-01 06:24:00.741542] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     stor1-3975-2014/09/01-06:23:58:673930-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:24:00.741598] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     stor1-3975-2014/09/01-06:23:58:673930-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:30:06.010819] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     stor1-4030-2014/09/01-06:30:05:976735-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     (version: 3.5.2)
>     [2014-09-01 06:30:08.056059] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     stor1-4030-2014/09/01-06:30:05:976735-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:30:08.056127] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     stor1-4030-2014/09/01-06:30:05:976735-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:36:54.307743] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     stor1-4077-2014/09/01-06:36:54:289911-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     (version: 3.5.2)
>     [2014-09-01 06:36:56.340078] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     stor1-4077-2014/09/01-06:36:54:289911-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:36:56.340122] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     stor1-4077-2014/09/01-06:36:54:289911-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:46:53.601517] I
>     [server-handshake.c:575:server_setvolume]
>     0-HA-2TB-TT-Proxmox-cluster-server: accepted client from
>     stor2-6891-2014/09/01-06:46:53:583529-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     (version: 3.5.2)
>     [2014-09-01 06:46:55.624705] I [server.c:520:server_rpc_notify]
>     0-HA-2TB-TT-Proxmox-cluster-server: disconnecting connectionfrom
>     stor2-6891-2014/09/01-06:46:53:583529-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>     [2014-09-01 06:46:55.624793] I [client_t.c:417:gf_client_unref]
>     0-HA-2TB-TT-Proxmox-cluster-server: Shutting down connection
>     stor2-6891-2014/09/01-06:46:53:583529-HA-2TB-TT-Proxmox-cluster-client-0-0-0
>
>     last 2 lines are pretty unclear. Why it has disconnected?
>
>
>
>
>     2014-09-01 9:41 GMT+03:00 Pranith Kumar Karampuri
>     <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>
>
>         On 09/01/2014 12:08 PM, Roman wrote:
>>         Well, as for me, VM-s are not very impacted by healing
>>         process. At least the munin server running with pretty high
>>         load (average rarely goes below 0,9 :) )had no problems. To
>>         create some more load I've made a copy of 590 MB file on the
>>         VM-s disk, It took 22 seconds. Which is ca 27 MB /sec or 214
>>         Mbps/sec
>>
>>         Servers are connected via 10 gbit network. Proxmox client is
>>         connected to the server with separate 1 gbps interface. We
>>         are thinking of moving it to 10gbps also.
>>
>>         Here are some heal info which is pretty confusing.
>>
>>         right after 1st server restored it connection, it was pretty
>>         clear:
>>
>>         root at stor1:~# gluster volume heal HA-2TB-TT-Proxmox-cluster info
>>         Brick stor1:/exports/HA-2TB-TT-Proxmox-cluster/2TB/
>>         /images/124/vm-124-disk-1.qcow2 - Possibly undergoing heal
>>         Number of entries: 1
>>
>>         Brick stor2:/exports/HA-2TB-TT-Proxmox-cluster/2TB/
>>         /images/124/vm-124-disk-1.qcow2 - Possibly undergoing heal
>>         /images/112/vm-112-disk-1.raw - Possibly undergoing heal
>>         Number of entries: 2
>>
>>
>>         some time later is says
>>         root at stor1:~# gluster volume heal HA-2TB-TT-Proxmox-cluster info
>>         Brick stor1:/exports/HA-2TB-TT-Proxmox-cluster/2TB/
>>         Number of entries: 0
>>
>>         Brick stor2:/exports/HA-2TB-TT-Proxmox-cluster/2TB/
>>         Number of entries: 0
>>
>>         while I can still see traffic between servers and still there
>>         was no messages about healing process completion.
>         On which machine do we have the mount?
>
>         Pranith
>
>>
>>
>>
>>         2014-08-29 10:02 GMT+03:00 Pranith Kumar Karampuri
>>         <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>>
>>             Wow, this is great news! Thanks a lot for sharing the
>>             results :-). Did you get a chance to test the performance
>>             of the applications in the vm during self-heal?
>>             May I know more about your use case? i.e. How many vms
>>             and what is the avg size of each vm etc?
>>
>>             Pranith
>>
>>
>>             On 08/28/2014 11:27 PM, Roman wrote:
>>>             Here are the results.
>>>             1. still have problem with logs rotation. logs are being
>>>             written to .log.1 file, not .log file. any hints, how to
>>>             fix?
>>>             2. healing logs are now much more better, I can see the
>>>             successful message.
>>>             3. both volumes with HD off and on successfully synced.
>>>             the volume with HD on synced much more faster.
>>>             4. both VMs on volumes survived the outage, when new
>>>             files were added  and deleted during outage.
>>>
>>>             So replication works well with both HD on and off for
>>>             volumes for VM-s. With HD even faster. Need to solve the
>>>             logging issue.
>>>
>>>             Seems we could start production storage from this moment
>>>             :) The whole company will use it. Some distributed and
>>>             some replicated. Thanks for great product.
>>>
>>>
>>>             2014-08-27 16:03 GMT+03:00 Roman <romeo.r at gmail.com
>>>             <mailto:romeo.r at gmail.com>>:
>>>
>>>                 Installed new packages. Will make some tests
>>>                 tomorrow. thanx.
>>>
>>>
>>>                 2014-08-27 14:10 GMT+03:00 Pranith Kumar Karampuri
>>>                 <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>>>
>>>
>>>                     On 08/27/2014 04:38 PM, Kaleb KEITHLEY wrote:
>>>
>>>                         On 08/27/2014 03:09 AM, Humble Chirammal wrote:
>>>
>>>
>>>
>>>                             ----- Original Message -----
>>>                             | From: "Pranith Kumar Karampuri"
>>>                             <pkarampu at redhat.com
>>>                             <mailto:pkarampu at redhat.com>>
>>>                             | To: "Humble Chirammal"
>>>                             <hchiramm at redhat.com
>>>                             <mailto:hchiramm at redhat.com>>
>>>                             | Cc: "Roman" <romeo.r at gmail.com
>>>                             <mailto:romeo.r at gmail.com>>,
>>>                             gluster-users at gluster.org
>>>                             <mailto:gluster-users at gluster.org>,
>>>                             "Niels de Vos" <ndevos at redhat.com
>>>                             <mailto:ndevos at redhat.com>>
>>>                             | Sent: Wednesday, August 27, 2014
>>>                             12:34:22 PM
>>>                             | Subject: Re: [Gluster-users] libgfapi
>>>                             failover problem on replica bricks
>>>                             |
>>>                             |
>>>                             | On 08/27/2014 12:24 PM, Roman wrote:
>>>                             | > root at stor1:~# ls -l /usr/sbin/glfsheal
>>>                             | > ls: cannot access
>>>                             /usr/sbin/glfsheal: No such file or
>>>                             directory
>>>                             | > Seems like not.
>>>                             | Humble,
>>>                             |       Seems like the binary is still
>>>                             not packaged?
>>>
>>>                             Checking with Kaleb on this.
>>>
>>>                         ...
>>>
>>>                             | >>>            |
>>>                             | >>>            | Humble/Niels,
>>>                             | >>>            |     Do we have debs
>>>                             available for 3.5.2? In 3.5.1
>>>                             | >>>  there was packaging
>>>                             | >>>            | issue where
>>>                             /usr/bin/glfsheal is not packaged along
>>>                             | >>>  with the deb. I
>>>                             | >>>            | think that should be
>>>                             fixed now as well?
>>>                             | >>>            |
>>>                             | >>>  Pranith,
>>>                             | >>>
>>>                             | >>>            The 3.5.2 packages for
>>>                             debian is not available yet. We
>>>                             | >>>            are co-ordinating
>>>                             internally to get it processed.
>>>                             | >>>            I will update the list
>>>                             once its available.
>>>                             | >>>
>>>                             | >>>  --Humble
>>>
>>>
>>>                         glfsheal isn't in our 3.5.2-1 DPKGs either.
>>>                         We (meaning I) started with the 3.5.1
>>>                         packaging bits from Semiosis. Perhaps he
>>>                         fixed 3.5.1 after giving me his bits.
>>>
>>>                         I'll fix it and spin 3.5.2-2 DPKGs.
>>>
>>>                     That is great Kaleb. Please notify semiosis as
>>>                     well in case he is yet to fix it.
>>>
>>>                     Pranith
>>>
>>>
>>>                         -- 
>>>
>>>                         Kaleb
>>>
>>>
>>>
>>>
>>>
>>>                 -- 
>>>                 Best regards,
>>>                 Roman.
>>>
>>>
>>>
>>>
>>>             -- 
>>>             Best regards,
>>>             Roman.
>>
>>
>>
>>
>>         -- 
>>         Best regards,
>>         Roman.
>
>
>
>
>     -- 
>     Best regards,
>     Roman.
>
>
>
>
> -- 
> Best regards,
> Roman.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140901/439e962c/attachment.html>