[Gluster-users] libgfapi failover problem on replica bricks

Pranith Kumar Karampuri pkarampu at redhat.com
Thu Aug 7 01:07:07 UTC 2014


hi Roman,
        Does the md5 sum match when the VMs are paused?

Pranith
On 08/07/2014 03:11 AM, Roman wrote:
> I don't know, if it makes any sense, but I'll add this kind of 
> information:
> after I stop the VM (in situation, when one glusterfs server was down 
> for a while and then back up) and start it again, glusterfs treats 
> those VM disk files, like they are the same now. Meanwhile they are 
> not. The sizes are different. I think there is some kind of problem 
> with striped files checks in glusterfs.
>
>
> root at stor1:~# getfattr -d -m. -e hex 
> /exports/pve1/1T/images/125/vm-125-disk-1.qcow2
> getfattr: Removing leading '/' from absolute path names
> # file: exports/pve1/1T/images/125/vm-125-disk-1.qcow2
> trusted.afr.HA-MED-PVE1-1T-client-0=0x000000000000000000000000
> trusted.afr.HA-MED-PVE1-1T-client-1=0x000000000000000000000000
> trusted.gfid=0x207984df4e6e4ef983f285ed0c4ce8fa
>
>
> root at stor1:~# du -sh /exports/pve1/1T/images/125/
> 1.6G    /exports/pve1/1T/images/125/
>
>
> getfattr: Removing leading '/' from absolute path names
> # file: exports/pve1/1T/images/125/vm-125-disk-1.qcow2
> trusted.afr.HA-MED-PVE1-1T-client-0=0x000000000000000000000000
> trusted.afr.HA-MED-PVE1-1T-client-1=0x000000000000000000000000
> trusted.gfid=0x207984df4e6e4ef983f285ed0c4ce8fa
>
> root at stor2:~# du -sh /exports/pve1/1T/images/125/
> 2.6G    /exports/pve1/1T/images/125/
>
>
>
> 2014-08-06 12:49 GMT+03:00 Humble Chirammal <hchiramm at redhat.com 
> <mailto:hchiramm at redhat.com>>:
>
>
>
>
>     ----- Original Message -----
>     | From: "Pranith Kumar Karampuri" <pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>>
>     | To: "Roman" <romeo.r at gmail.com <mailto:romeo.r at gmail.com>>
>     | Cc: gluster-users at gluster.org
>     <mailto:gluster-users at gluster.org>, "Niels de Vos"
>     <ndevos at redhat.com <mailto:ndevos at redhat.com>>, "Humble Chirammal"
>     <hchiramm at redhat.com <mailto:hchiramm at redhat.com>>
>     | Sent: Wednesday, August 6, 2014 12:09:57 PM
>     | Subject: Re: [Gluster-users] libgfapi failover problem on
>     replica bricks
>     |
>     | Roman,
>     |      The file went into split-brain. I think we should do these
>     tests
>     | with 3.5.2. Where monitoring the heals is easier. Let me also
>     come up
>     | with a document about how to do this testing you are trying to do.
>     |
>     | Humble/Niels,
>     |      Do we have debs available for 3.5.2? In 3.5.1 there was
>     packaging
>     | issue where /usr/bin/glfsheal is not packaged along with the deb. I
>     | think that should be fixed now as well?
>     |
>     Pranith,
>
>     The 3.5.2 packages for debian is not available yet. We are
>     co-ordinating internally to get it processed.
>     I will update the list once its available.
>
>     --Humble
>     |
>     | On 08/06/2014 11:52 AM, Roman wrote:
>     | > good morning,
>     | >
>     | > root at stor1:~# getfattr -d -m. -e hex
>     | > /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | > getfattr: Removing leading '/' from absolute path names
>     | > # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | > trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>     | > trusted.afr.HA-fast-150G-PVE1-client-1=0x000001320000000000000000
>     | > trusted.gfid=0x23c79523075a4158bea38078da570449
>     | >
>     | > getfattr: Removing leading '/' from absolute path names
>     | > # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | > trusted.afr.HA-fast-150G-PVE1-client-0=0x000000040000000000000000
>     | > trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>     | > trusted.gfid=0x23c79523075a4158bea38078da570449
>     | >
>     | >
>     | >
>     | > 2014-08-06 9:20 GMT+03:00 Pranith Kumar Karampuri
>     <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>     | > <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>:
>     | >
>     | >
>     | >     On 08/06/2014 11:30 AM, Roman wrote:
>     | >>     Also, this time files are not the same!
>     | >>
>     | >>     root at stor1:~# md5sum
>     | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | >>     32411360c53116b96a059f17306caeda
>     | >>  /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | >>
>     | >>     root at stor2:~# md5sum
>     | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | >>     65b8a6031bcb6f5fb3a11cb1e8b1c9c9
>     | >>  /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | >     What is the getfattr output?
>     | >
>     | >     Pranith
>     | >
>     | >>
>     | >>
>     | >>     2014-08-05 16:33 GMT+03:00 Roman <romeo.r at gmail.com
>     <mailto:romeo.r at gmail.com>
>     | >>     <mailto:romeo.r at gmail.com <mailto:romeo.r at gmail.com>>>:
>     | >>
>     | >>         Nope, it is not working. But this time it went a bit
>     other way
>     | >>
>     | >>         root at gluster-client:~# dmesg
>     | >>         Segmentation fault
>     | >>
>     | >>
>     | >>         I was not able even to start the VM after I done the
>     tests
>     | >>
>     | >>         Could not read qcow2 header: Operation not permitted
>     | >>
>     | >>         And it seems, it never starts to sync files after first
>     | >>         disconnect. VM survives first disconnect, but not
>     second (I
>     | >>         waited around 30 minutes). Also, I've
>     | >>         got network.ping-timeout: 2 in volume settings, but logs
>     | >>         react on first disconnect around 30 seconds. Second was
>     | >>         faster, 2 seconds.
>     | >>
>     | >>         Reaction was different also:
>     | >>
>     | >>         slower one:
>     | >>         [2014-08-05 13:26:19.558435] W
>     [socket.c:514:__socket_rwv]
>     | >>         0-glusterfs: readv failed (Connection timed out)
>     | >>         [2014-08-05 13:26:19.558485] W
>     | >> [socket.c:1962:__socket_proto_state_machine] 0-glusterfs:
>     | >>         reading from socket failed. Error (Connection timed out),
>     | >>         peer (10.250.0.1:24007 <http://10.250.0.1:24007>
>     <http://10.250.0.1:24007>)
>     | >>         [2014-08-05 13:26:21.281426] W
>     [socket.c:514:__socket_rwv]
>     | >>         0-HA-fast-150G-PVE1-client-0: readv failed
>     (Connection timed out)
>     | >>         [2014-08-05 13:26:21.281474] W
>     | >> [socket.c:1962:__socket_proto_state_machine]
>     | >>         0-HA-fast-150G-PVE1-client-0: reading from socket failed.
>     | >>         Error (Connection timed out), peer (10.250.0.1:49153
>     <http://10.250.0.1:49153>
>     | >>         <http://10.250.0.1:49153>)
>     | >>         [2014-08-05 13:26:21.281507] I
>     | >>         [client.c:2098:client_rpc_notify]
>     | >>         0-HA-fast-150G-PVE1-client-0: disconnected
>     | >>
>     | >>         the fast one:
>     | >>         2014-08-05 12:52:44.607389] C
>     | >> [client-handshake.c:127:rpc_client_ping_timer_expired]
>     | >>         0-HA-fast-150G-PVE1-client-1: server 10.250.0.2:49153
>     <http://10.250.0.2:49153>
>     | >>         <http://10.250.0.2:49153> has not responded in the last 2
>     | >>         seconds, disconnecting.
>     | >>         [2014-08-05 12:52:44.607491] W
>     [socket.c:514:__socket_rwv]
>     | >>         0-HA-fast-150G-PVE1-client-1: readv failed (No data
>     available)
>     | >>         [2014-08-05 12:52:44.607585] E
>     | >>         [rpc-clnt.c:368:saved_frames_unwind]
>     | >> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>     | >>         [0x7fcb1b4b0558]
>     | >>
>     (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>     | >>         [0x7fcb1b4aea63]
>     | >>
>     (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>     | >>         [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced
>     | >>         unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>     called at
>     | >>         2014-08-05 12:52:42.463881 (xid=0x381883x)
>     | >>         [2014-08-05 12:52:44.607604] W
>     | >> [client-rpc-fops.c:2624:client3_3_lookup_cbk]
>     | >>         0-HA-fast-150G-PVE1-client-1: remote operation failed:
>     | >>         Transport endpoint is not connected. Path: /
>     | >> (00000000-0000-0000-0000-000000000001)
>     | >>         [2014-08-05 12:52:44.607736] E
>     | >>         [rpc-clnt.c:368:saved_frames_unwind]
>     | >> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>     | >>         [0x7fcb1b4b0558]
>     | >>
>     (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>     | >>         [0x7fcb1b4aea63]
>     | >>
>     (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>     | >>         [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced
>     | >>         unwinding frame type(GlusterFS Handshake) op(PING(3))
>     called
>     | >>         at 2014-08-05 12:52:42.463891 (xid=0x381884x)
>     | >>         [2014-08-05 12:52:44.607753] W
>     | >> [client-handshake.c:276:client_ping_cbk]
>     | >>         0-HA-fast-150G-PVE1-client-1: timer must have expired
>     | >>         [2014-08-05 12:52:44.607776] I
>     | >>         [client.c:2098:client_rpc_notify]
>     | >>         0-HA-fast-150G-PVE1-client-1: disconnected
>     | >>
>     | >>
>     | >>
>     | >>         I've got SSD disks (just for an info).
>     | >>         Should I go and give a try for 3.5.2?
>     | >>
>     | >>
>     | >>
>     | >>         2014-08-05 13:06 GMT+03:00 Pranith Kumar Karampuri
>     | >>         <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>     <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>:
>     | >>
>     | >>             reply along with gluster-users please :-). May be
>     you are
>     | >>             hitting 'reply' instead of 'reply all'?
>     | >>
>     | >>             Pranith
>     | >>
>     | >>             On 08/05/2014 03:35 PM, Roman wrote:
>     | >>>             To make sure and clean, I've created another VM
>     with raw
>     | >>>             format and goint to repeat those steps. So now
>     I've got
>     | >>>             two VM-s one with qcow2 format and other with raw
>     | >>>             format. I will send another e-mail shortly.
>     | >>>
>     | >>>
>     | >>>             2014-08-05 13:01 GMT+03:00 Pranith Kumar Karampuri
>     | >>>             <pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>>>:
>     | >>>
>     | >>>
>     | >>>                 On 08/05/2014 03:07 PM, Roman wrote:
>     | >>>>                 really, seems like the same file
>     | >>>>
>     | >>>>                 stor1:
>     | >>>> a951641c5230472929836f9fcede6b04
>     | >>>>  /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | >>>>
>     | >>>>                 stor2:
>     | >>>> a951641c5230472929836f9fcede6b04
>     | >>>>  /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | >>>>
>     | >>>>
>     | >>>>                 one thing I've seen from logs, that somehow
>     proxmox
>     | >>>>                 VE is connecting with wrong version to servers?
>     | >>>>                 [2014-08-05 09:23:45.218550] I
>     | >>>> [client-handshake.c:1659:select_server_supported_programs]
>     | >>>> 0-HA-fast-150G-PVE1-client-0: Using Program
>     | >>>>                 GlusterFS 3.3, Num (1298437), Version (330)
>     | >>>                 It is the rpc (over the network data structures)
>     | >>>                 version, which is not changed at all from 3.3 so
>     | >>>                 thats not a problem. So what is the
>     conclusion? Is
>     | >>>                 your test case working now or not?
>     | >>>
>     | >>>                 Pranith
>     | >>>
>     | >>>>                 but if I issue:
>     | >>>>                 root at pve1:~# glusterfs -V
>     | >>>>                 glusterfs 3.4.4 built on Jun 28 2014 03:44:57
>     | >>>>                 seems ok.
>     | >>>>
>     | >>>>                 server  use 3.4.4 meanwhile
>     | >>>>                 [2014-08-05 09:23:45.117875] I
>     | >>>> [server-handshake.c:567:server_setvolume]
>     | >>>> 0-HA-fast-150G-PVE1-server: accepted client from
>     | >>>>
>     stor1-9004-2014/08/05-09:23:45:93538-HA-fast-150G-PVE1-client-1-0
>     | >>>>                 (version: 3.4.4)
>     | >>>>                 [2014-08-05 09:23:49.103035] I
>     | >>>> [server-handshake.c:567:server_setvolume]
>     | >>>> 0-HA-fast-150G-PVE1-server: accepted client from
>     | >>>>
>     stor1-8998-2014/08/05-09:23:45:89883-HA-fast-150G-PVE1-client-0-0
>     | >>>>                 (version: 3.4.4)
>     | >>>>
>     | >>>>                 if this could be the reason, of course.
>     | >>>>                 I did restart the Proxmox VE yesterday
>     (just for an
>     | >>>>                 information)
>     | >>>>
>     | >>>>
>     | >>>>
>     | >>>>
>     | >>>>
>     | >>>>                 2014-08-05 12:30 GMT+03:00 Pranith Kumar
>     Karampuri
>     | >>>>                 <pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>>>:
>     | >>>>
>     | >>>>
>     | >>>>                     On 08/05/2014 02:33 PM, Roman wrote:
>     | >>>>>                     Waited long enough for now, still
>     different
>     | >>>>>                     sizes and no logs about healing :(
>     | >>>>>
>     | >>>>>                     stor1
>     | >>>>>                     # file:
>     | >>>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | >>>>>
>     trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>     | >>>>>
>     trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>     | >>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>     | >>>>>
>     | >>>>>                     root at stor1:~# du -sh
>     | >>>>> /exports/fast-test/150G/images/127/
>     | >>>>>                     1.2G  /exports/fast-test/150G/images/127/
>     | >>>>>
>     | >>>>>
>     | >>>>>                     stor2
>     | >>>>>                     # file:
>     | >>>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>     | >>>>>
>     trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>     | >>>>>
>     trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>     | >>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>     | >>>>>
>     | >>>>>
>     | >>>>>                     root at stor2:~# du -sh
>     | >>>>> /exports/fast-test/150G/images/127/
>     | >>>>>                     1.4G  /exports/fast-test/150G/images/127/
>     | >>>>                     According to the changelogs, the file
>     doesn't
>     | >>>>                     need any healing. Could you stop the
>     operations
>     | >>>>                     on the VMs and take md5sum on both
>     these machines?
>     | >>>>
>     | >>>>                     Pranith
>     | >>>>
>     | >>>>>
>     | >>>>>
>     | >>>>>
>     | >>>>>
>     | >>>>>                     2014-08-05 11:49 GMT+03:00 Pranith Kumar
>     | >>>>>                     Karampuri <pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>
>     | >>>>>                     <mailto:pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>>>:
>     | >>>>>
>     | >>>>>
>     | >>>>>                         On 08/05/2014 02:06 PM, Roman wrote:
>     | >>>>>>                         Well, it seems like it doesn't
>     see the
>     | >>>>>> changes were made to the volume ? I
>     | >>>>>> created two files 200 and 100 MB (from
>     | >>>>>> /dev/zero) after I disconnected the first
>     | >>>>>> brick. Then connected it back and got
>     | >>>>>>                         these logs:
>     | >>>>>>
>     | >>>>>> [2014-08-05 08:30:37.830150] I
>     | >>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>     | >>>>>> 0-glusterfs: No change in volfile, continuing
>     | >>>>>> [2014-08-05 08:30:37.830207] I
>     | >>>>>> [rpc-clnt.c:1676:rpc_clnt_reconfig]
>     | >>>>>> 0-HA-fast-150G-PVE1-client-0: changing
>     | >>>>>>                         port to 49153 (from 0)
>     | >>>>>> [2014-08-05 08:30:37.830239] W
>     | >>>>>> [socket.c:514:__socket_rwv]
>     | >>>>>> 0-HA-fast-150G-PVE1-client-0: readv
>     | >>>>>> failed (No data available)
>     | >>>>>> [2014-08-05 08:30:37.831024] I
>     | >>>>>> [client-handshake.c:1659:select_server_supported_programs]
>     | >>>>>> 0-HA-fast-150G-PVE1-client-0: Using
>     | >>>>>> Program GlusterFS 3.3, Num (1298437),
>     | >>>>>> Version (330)
>     | >>>>>> [2014-08-05 08:30:37.831375] I
>     | >>>>>> [client-handshake.c:1456:client_setvolume_cbk]
>     | >>>>>> 0-HA-fast-150G-PVE1-client-0: Connected
>     | >>>>>>                         to 10.250.0.1:49153
>     <http://10.250.0.1:49153>
>     | >>>>>>                         <http://10.250.0.1:49153>,
>     attached to
>     | >>>>>>         remote volume '/exports/fast-test/150G'.
>     | >>>>>> [2014-08-05 08:30:37.831394] I
>     | >>>>>> [client-handshake.c:1468:client_setvolume_cbk]
>     | >>>>>> 0-HA-fast-150G-PVE1-client-0: Server and
>     | >>>>>> Client lk-version numbers are not same,
>     | >>>>>> reopening the fds
>     | >>>>>> [2014-08-05 08:30:37.831566] I
>     | >>>>>> [client-handshake.c:450:client_set_lk_version_cbk]
>     | >>>>>> 0-HA-fast-150G-PVE1-client-0: Server lk
>     | >>>>>> version = 1
>     | >>>>>>
>     | >>>>>>
>     | >>>>>> [2014-08-05 08:30:37.830150] I
>     | >>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>     | >>>>>> 0-glusterfs: No change in volfile, continuing
>     | >>>>>>                         this line seems weird to me tbh.
>     | >>>>>>                         I do not see any traffic on switch
>     | >>>>>> interfaces between gluster servers, which
>     | >>>>>> means, there is no syncing between them.
>     | >>>>>>                         I tried to ls -l the files on the
>     client
>     | >>>>>>                         and servers to trigger the
>     healing, but
>     | >>>>>>                         seems like no success. Should I
>     wait more?
>     | >>>>>                         Yes, it should take around 10-15
>     minutes.
>     | >>>>>                         Could you provide 'getfattr -d -m.
>     -e hex
>     | >>>>> <file-on-brick>' on both the bricks.
>     | >>>>>
>     | >>>>>                         Pranith
>     | >>>>>
>     | >>>>>>
>     | >>>>>>
>     | >>>>>> 2014-08-05 11:25 GMT+03:00 Pranith Kumar
>     | >>>>>> Karampuri <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>     | >>>>>> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>:
>     | >>>>>>
>     | >>>>>>
>     | >>>>>> On 08/05/2014 01:10 PM, Roman wrote:
>     | >>>>>>>   Ahha! For some reason I was not able
>     | >>>>>>>   to start the VM anymore, Proxmox VE
>     | >>>>>>>   told me, that it is not able to read
>     | >>>>>>>   the qcow2 header due to permission
>     | >>>>>>>   is denied for some reason. So I just
>     | >>>>>>>   deleted that file and created a new
>     | >>>>>>>   VM. And the nex message I've got was
>     | >>>>>>>   this:
>     | >>>>>> Seems like these are the messages
>     | >>>>>> where you took down the bricks before
>     | >>>>>> self-heal. Could you restart the run
>     | >>>>>> waiting for self-heals to complete
>     | >>>>>> before taking down the next brick?
>     | >>>>>>
>     | >>>>>> Pranith
>     | >>>>>>
>     | >>>>>>>
>     | >>>>>>>
>     | >>>>>>>   [2014-08-05 07:31:25.663412] E
>     | >>>>>>> [afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
>     | >>>>>>>   0-HA-fast-150G-PVE1-replicate-0:
>     | >>>>>>>   Unable to self-heal contents of
>     | >>>>>>>   '/images/124/vm-124-disk-1.qcow2'
>     | >>>>>>>   (possible split-brain). Please
>     | >>>>>>>   delete the file from all but the
>     | >>>>>>>   preferred subvolume.- Pending
>     | >>>>>>>   matrix:  [ [ 0 60 ] [ 11 0 ] ]
>     | >>>>>>>   [2014-08-05 07:31:25.663955] E
>     | >>>>>>> [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk]
>     | >>>>>>>   0-HA-fast-150G-PVE1-replicate-0:
>     | >>>>>>>   background  data self-heal failed on
>     | >>>>>>>   /images/124/vm-124-disk-1.qcow2
>     | >>>>>>>
>     | >>>>>>>
>     | >>>>>>>
>     | >>>>>>>   2014-08-05 10:13 GMT+03:00 Pranith
>     | >>>>>>>   Kumar Karampuri <pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>
>     | >>>>>>> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>:
>     | >>>>>>>
>     | >>>>>>>       I just responded to your earlier
>     | >>>>>>>       mail about how the log looks.
>     | >>>>>>>       The log comes on the mount's logfile
>     | >>>>>>>
>     | >>>>>>>       Pranith
>     | >>>>>>>
>     | >>>>>>>       On 08/05/2014 12:41 PM, Roman wrote:
>     | >>>>>>>>           Ok, so I've waited enough, I
>     | >>>>>>>>           think. Had no any traffic on
>     | >>>>>>>>           switch ports between servers.
>     | >>>>>>>>           Could not find any suitable log
>     | >>>>>>>>           message about completed
>     | >>>>>>>>           self-heal (waited about 30
>     | >>>>>>>>           minutes). Plugged out the other
>     | >>>>>>>>           server's UTP cable this time
>     | >>>>>>>>           and got in the same situation:
>     | >>>>>>>>           root at gluster-test1:~# cat
>     | >>>>>>>>           /var/log/dmesg
>     | >>>>>>>>           -bash: /bin/cat: Input/output error
>     | >>>>>>>>
>     | >>>>>>>>           brick logs:
>     | >>>>>>>>           [2014-08-05 07:09:03.005474] I
>     | >>>>>>>>           [server.c:762:server_rpc_notify]
>     | >>>>>>>>           0-HA-fast-150G-PVE1-server:
>     | >>>>>>>>           disconnecting connectionfrom
>     | >>>>>>>>
>     pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>     | >>>>>>>>           [2014-08-05 07:09:03.005530] I
>     | >>>>>>>>           [server-helpers.c:729:server_connection_put]
>     | >>>>>>>>           0-HA-fast-150G-PVE1-server:
>     | >>>>>>>>           Shutting down connection
>     | >>>>>>>>
>     pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>     | >>>>>>>>           [2014-08-05 07:09:03.005560] I
>     | >>>>>>>>           [server-helpers.c:463:do_fd_cleanup]
>     | >>>>>>>>           0-HA-fast-150G-PVE1-server: fd
>     | >>>>>>>>           cleanup on
>     | >>>>>>>>           /images/124/vm-124-disk-1.qcow2
>     | >>>>>>>>           [2014-08-05 07:09:03.005797] I
>     | >>>>>>>> [server-helpers.c:617:server_connection_destroy]
>     | >>>>>>>>           0-HA-fast-150G-PVE1-server:
>     | >>>>>>>>           destroyed connection of
>     | >>>>>>>>
>     pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>     | >>>>>>>>
>     | >>>>>>>>
>     | >>>>>>>>
>     | >>>>>>>>
>     | >>>>>>>>
>     | >>>>>>>>           2014-08-05 9:53 GMT+03:00
>     | >>>>>>>>           Pranith Kumar Karampuri
>     | >>>>>>>>           <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>     | >>>>>>>>       <mailto:pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>>>:
>     | >>>>>>>>
>     | >>>>>>>>               Do you think it is possible
>     | >>>>>>>>               for you to do these tests
>     | >>>>>>>>               on the latest version
>     | >>>>>>>>               3.5.2? 'gluster volume heal
>     | >>>>>>>>               <volname> info' would give
>     | >>>>>>>>               you that information in
>     | >>>>>>>>               versions > 3.5.1.
>     | >>>>>>>>               Otherwise you will have to
>     | >>>>>>>>               check it from either the
>     | >>>>>>>>               logs, there will be
>     | >>>>>>>>               self-heal completed message
>     | >>>>>>>>               on the mount logs (or) by
>     | >>>>>>>>               observing 'getfattr -d -m.
>     | >>>>>>>>               -e hex <image-file-on-bricks>'
>     | >>>>>>>>
>     | >>>>>>>>               Pranith
>     | >>>>>>>>
>     | >>>>>>>>
>     | >>>>>>>>               On 08/05/2014 12:09 PM,
>     | >>>>>>>>               Roman wrote:
>     | >>>>>>>>>                   Ok, I understand. I will
>     | >>>>>>>>>                   try this shortly.
>     | >>>>>>>>>                   How can I be sure, that
>     | >>>>>>>>>                   healing process is done,
>     | >>>>>>>>>                   if I am not able to see
>     | >>>>>>>>>                   its status?
>     | >>>>>>>>>
>     | >>>>>>>>>
>     | >>>>>>>>>                   2014-08-05 9:30 GMT+03:00
>     | >>>>>>>>>                   Pranith Kumar Karampuri
>     | >>>>>>>>>                   <pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>
>     | >>>>>>>>>               <mailto:pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>>>:
>     | >>>>>>>>>
>     | >>>>>>>>>                       Mounts will do the
>     | >>>>>>>>>                       healing, not the
>     | >>>>>>>>>                       self-heal-daemon. The
>     | >>>>>>>>>                       problem I feel is that
>     | >>>>>>>>>                       whichever process does
>     | >>>>>>>>>                       the healing has the
>     | >>>>>>>>>                       latest information
>     | >>>>>>>>>                       about the good bricks
>     | >>>>>>>>>                       in this usecase. Since
>     | >>>>>>>>>                       for VM usecase, mounts
>     | >>>>>>>>>                       should have the latest
>     | >>>>>>>>>                       information, we should
>     | >>>>>>>>>                       let the mounts do the
>     | >>>>>>>>>                       healing. If the mount
>     | >>>>>>>>>                       accesses the VM image
>     | >>>>>>>>>                       either by someone
>     | >>>>>>>>>                       doing operations
>     | >>>>>>>>>                       inside the VM or
>     | >>>>>>>>>                       explicit stat on the
>     | >>>>>>>>>                       file it should do the
>     | >>>>>>>>>                       healing.
>     | >>>>>>>>>
>     | >>>>>>>>>                       Pranith.
>     | >>>>>>>>>
>     | >>>>>>>>>
>     | >>>>>>>>>                       On 08/05/2014 10:39
>     | >>>>>>>>>                       AM, Roman wrote:
>     | >>>>>>>>>>                           Hmmm, you told me to
>     | >>>>>>>>>>                           turn it off. Did I
>     | >>>>>>>>>>                           understood something
>     | >>>>>>>>>>                           wrong? After I issued
>     | >>>>>>>>>>                           the command you've
>     | >>>>>>>>>>                           sent me, I was not
>     | >>>>>>>>>>                           able to watch the
>     | >>>>>>>>>>                           healing process, it
>     | >>>>>>>>>>                           said, it won't be
>     | >>>>>>>>>>                           healed, becouse its
>     | >>>>>>>>>>                           turned off.
>     | >>>>>>>>>>
>     | >>>>>>>>>>
>     | >>>>>>>>>>                           2014-08-05 5:39
>     | >>>>>>>>>>                           GMT+03:00 Pranith
>     | >>>>>>>>>>                           Kumar Karampuri
>     | >>>>>>>>>>                           <pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>
>     | >>>>>>>>>>                       <mailto:pkarampu at redhat.com
>     <mailto:pkarampu at redhat.com>>>:
>     | >>>>>>>>>>
>     | >>>>>>>>>>                               You didn't
>     | >>>>>>>>>>                               mention anything
>     | >>>>>>>>>>                               about
>     | >>>>>>>>>>                               self-healing. Did
>     | >>>>>>>>>>                               you wait until
>     | >>>>>>>>>>                               the self-heal is
>     | >>>>>>>>>>                               complete?
>     | >>>>>>>>>>
>     | >>>>>>>>>>                               Pranith
>     | >>>>>>>>>>
>     | >>>>>>>>>>                               On 08/04/2014
>     | >>>>>>>>>>                               05:49 PM, Roman
>     | >>>>>>>>>>                               wrote:
>     | >>>>>>>>>>>                                   Hi!
>     | >>>>>>>>>>>                                   Result is pretty
>     | >>>>>>>>>>>                                   same. I set the
>     | >>>>>>>>>>>                                   switch port down
>     | >>>>>>>>>>>                                   for 1st server,
>     | >>>>>>>>>>>                                   it was ok. Then
>     | >>>>>>>>>>>                                   set it up back
>     | >>>>>>>>>>>                                   and set other
>     | >>>>>>>>>>>                                   server's port
>     | >>>>>>>>>>>                                   off. and it
>     | >>>>>>>>>>>                                   triggered IO
>     | >>>>>>>>>>>                                   error on two
>     | >>>>>>>>>>>                                   virtual
>     | >>>>>>>>>>>                                   machines: one
>     | >>>>>>>>>>>                                   with local root
>     | >>>>>>>>>>>                                   FS but network
>     | >>>>>>>>>>>                                   mounted storage.
>     | >>>>>>>>>>>                                   and other with
>     | >>>>>>>>>>>                                   network root FS.
>     | >>>>>>>>>>>                                   1st gave an
>     | >>>>>>>>>>>                                   error on copying
>     | >>>>>>>>>>>                                   to or from the
>     | >>>>>>>>>>>                                   mounted network
>     | >>>>>>>>>>>                                   disk, other just
>     | >>>>>>>>>>>                                   gave me an error
>     | >>>>>>>>>>>                                   for even reading
>     | >>>>>>>>>>>                                   log.files.
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>                                   cat:
>     | >>>>>>>>>>> /var/log/alternatives.log:
>     | >>>>>>>>>>>                                   Input/output error
>     | >>>>>>>>>>>                                   then I reset the
>     | >>>>>>>>>>>                                   kvm VM and it
>     | >>>>>>>>>>>                                   said me, there
>     | >>>>>>>>>>>                                   is no boot
>     | >>>>>>>>>>>                                   device. Next I
>     | >>>>>>>>>>>                                   virtually
>     | >>>>>>>>>>>                                   powered it off
>     | >>>>>>>>>>>                                   and then back on
>     | >>>>>>>>>>>                                   and it has booted.
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>                                   By the way, did
>     | >>>>>>>>>>>                                   I have to
>     | >>>>>>>>>>>                                   start/stop volume?
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>                                   >> Could you do
>     | >>>>>>>>>>>                                   the following
>     | >>>>>>>>>>>                                   and test it again?
>     | >>>>>>>>>>>                                   >> gluster volume
>     | >>>>>>>>>>>                                   set <volname>
>     | >>>>>>>>>>> cluster.self-heal-daemon
>     | >>>>>>>>>>>                                   off
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>                                   >>Pranith
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>                                   2014-08-04 14:10
>     | >>>>>>>>>>>                                   GMT+03:00
>     | >>>>>>>>>>>                                   Pranith Kumar
>     | >>>>>>>>>>>                                   Karampuri
>     | >>>>>>>>>>>                                  
>     <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>     | >>>>>>>>>>>                              
>     <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>:
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>                                       On
>     | >>>>>>>>>>>                                       08/04/2014
>     | >>>>>>>>>>>                                       03:33 PM,
>     | >>>>>>>>>>>                                       Roman wrote:
>     | >>>>>>>>>>>>                                           Hello!
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>                                           Facing the
>     | >>>>>>>>>>>>                                           same
>     | >>>>>>>>>>>>                                           problem as
>     | >>>>>>>>>>>>                                           mentioned
>     | >>>>>>>>>>>>                                           here:
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>
>     http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>                                           my set up
>     | >>>>>>>>>>>>                                           is up and
>     | >>>>>>>>>>>>                                           running, so
>     | >>>>>>>>>>>>                                           i'm ready
>     | >>>>>>>>>>>>                                           to help you
>     | >>>>>>>>>>>>                                           back with
>     | >>>>>>>>>>>>                                           feedback.
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>                                           setup:
>     | >>>>>>>>>>>>                                           proxmox
>     | >>>>>>>>>>>>                                           server as
>     | >>>>>>>>>>>>                                           client
>     | >>>>>>>>>>>>                                           2 gluster
>     | >>>>>>>>>>>>                                           physical
>     | >>>>>>>>>>>>                                            servers
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>                                           server side
>     | >>>>>>>>>>>>                                           and client
>     | >>>>>>>>>>>>                                           side both
>     | >>>>>>>>>>>>                                           running atm
>     | >>>>>>>>>>>>                                           3.4.4
>     | >>>>>>>>>>>>                                           glusterfs
>     | >>>>>>>>>>>>                                           from
>     | >>>>>>>>>>>>                                           gluster repo.
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>                                           the
>     problem is:
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>                                           1. craeted
>     | >>>>>>>>>>>>                                           replica
>     bricks.
>     | >>>>>>>>>>>>                                           2. mounted
>     | >>>>>>>>>>>>                                           in proxmox
>     | >>>>>>>>>>>>                                           (tried both
>     | >>>>>>>>>>>>                                           promox
>     | >>>>>>>>>>>>                                           ways: via
>     | >>>>>>>>>>>>                                           GUI and
>     | >>>>>>>>>>>>                                           fstab (with
>     | >>>>>>>>>>>>                                           backup
>     | >>>>>>>>>>>>                                           volume
>     | >>>>>>>>>>>>                                           line), btw
>     | >>>>>>>>>>>>                                           while
>     | >>>>>>>>>>>>                                           mounting
>     | >>>>>>>>>>>>                                           via fstab
>     | >>>>>>>>>>>>                                           I'm unable
>     | >>>>>>>>>>>>                                           to launch a
>     | >>>>>>>>>>>>                                           VM without
>     | >>>>>>>>>>>>                                           cache,
>     | >>>>>>>>>>>>                                           meanwhile
>     | >>>>>>>>>>>>                                          
>     direct-io-mode
>     | >>>>>>>>>>>>                                           is enabled
>     | >>>>>>>>>>>>                                           in fstab
>     line)
>     | >>>>>>>>>>>>                                           3.
>     installed VM
>     | >>>>>>>>>>>>                                           4. bring
>     | >>>>>>>>>>>>                                           one volume
>     | >>>>>>>>>>>>                                           down - ok
>     | >>>>>>>>>>>>                                           5. bringing
>     | >>>>>>>>>>>>                                           up, waiting
>     | >>>>>>>>>>>>                                           for sync is
>     | >>>>>>>>>>>>                                           done.
>     | >>>>>>>>>>>>                                           6. bring
>     | >>>>>>>>>>>>                                           other
>     | >>>>>>>>>>>>                                           volume down
>     | >>>>>>>>>>>>                                           - getting
>     | >>>>>>>>>>>>                                           IO errors
>     | >>>>>>>>>>>>                                           on VM guest
>     | >>>>>>>>>>>>                                           and not
>     | >>>>>>>>>>>>                                           able to
>     | >>>>>>>>>>>>                                           restore the
>     | >>>>>>>>>>>>                                           VM after I
>     | >>>>>>>>>>>>                                           reset the
>     | >>>>>>>>>>>>                                           VM via
>     | >>>>>>>>>>>>                                           host. It
>     | >>>>>>>>>>>>                                           says (no
>     | >>>>>>>>>>>>                                           bootable
>     | >>>>>>>>>>>>                                           media).
>     | >>>>>>>>>>>>                                           After I
>     | >>>>>>>>>>>>                                           shut it
>     | >>>>>>>>>>>>                                           down
>     | >>>>>>>>>>>>                                           (forced)
>     | >>>>>>>>>>>>                                           and bring
>     | >>>>>>>>>>>>                                           back up, it
>     | >>>>>>>>>>>>                                           boots.
>     | >>>>>>>>>>>                                       Could you do
>     | >>>>>>>>>>>                                       the
>     | >>>>>>>>>>>                                       following
>     | >>>>>>>>>>>                                       and test it
>     | >>>>>>>>>>>                                       again?
>     | >>>>>>>>>>>                                       gluster
>     | >>>>>>>>>>>                                       volume set
>     | >>>>>>>>>>>                                       <volname>
>     | >>>>>>>>>>> cluster.self-heal-daemon
>     | >>>>>>>>>>>                                       off
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>                                       Pranith
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>                                           Need help.
>     | >>>>>>>>>>>>                                           Tried
>     | >>>>>>>>>>>>                                           3.4.3, 3.4.4.
>     | >>>>>>>>>>>>                                           Still
>     | >>>>>>>>>>>>                                           missing
>     | >>>>>>>>>>>>                                           pkg-s for
>     | >>>>>>>>>>>>                                           3.4.5 for
>     | >>>>>>>>>>>>                                           debian and
>     | >>>>>>>>>>>>                                           3.5.2
>     | >>>>>>>>>>>>                                           (3.5.1
>     | >>>>>>>>>>>>                                           always
>     | >>>>>>>>>>>>                                           gives a
>     | >>>>>>>>>>>>                                           healing
>     | >>>>>>>>>>>>                                           error for
>     | >>>>>>>>>>>>                                           some reason)
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>                                           --
>     | >>>>>>>>>>>>                                           Best regards,
>     | >>>>>>>>>>>>                                           Roman.
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>>
>     | >>>>>>>>>>>> _______________________________________________
>     | >>>>>>>>>>>>                                           Gluster-users
>     | >>>>>>>>>>>>                                           mailing list
>     | >>>>>>>>>>>> Gluster-users at gluster.org
>     <mailto:Gluster-users at gluster.org>
>     | >>>>>>>>>>>>                                      
>     <mailto:Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>>
>     | >>>>>>>>>>>>
>     http://supercolony.gluster.org/mailman/listinfo/gluster-users
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>
>     | >>>>>>>>>>>                                   --
>     | >>>>>>>>>>>                                   Best regards,
>     | >>>>>>>>>>>                                   Roman.
>     | >>>>>>>>>>
>     | >>>>>>>>>>
>     | >>>>>>>>>>
>     | >>>>>>>>>>
>     | >>>>>>>>>>                           --
>     | >>>>>>>>>>                           Best regards,
>     | >>>>>>>>>>                           Roman.
>     | >>>>>>>>>
>     | >>>>>>>>>
>     | >>>>>>>>>
>     | >>>>>>>>>
>     | >>>>>>>>>                   --
>     | >>>>>>>>>                   Best regards,
>     | >>>>>>>>>                   Roman.
>     | >>>>>>>>
>     | >>>>>>>>
>     | >>>>>>>>
>     | >>>>>>>>
>     | >>>>>>>>           --
>     | >>>>>>>>           Best regards,
>     | >>>>>>>>           Roman.
>     | >>>>>>>
>     | >>>>>>>
>     | >>>>>>>
>     | >>>>>>>
>     | >>>>>>>   --
>     | >>>>>>>   Best regards,
>     | >>>>>>>   Roman.
>     | >>>>>>
>     | >>>>>>
>     | >>>>>>
>     | >>>>>>
>     | >>>>>>                         --
>     | >>>>>>                         Best regards,
>     | >>>>>> Roman.
>     | >>>>>
>     | >>>>>
>     | >>>>>
>     | >>>>>
>     | >>>>>                     --
>     | >>>>>                     Best regards,
>     | >>>>>                     Roman.
>     | >>>>
>     | >>>>
>     | >>>>
>     | >>>>
>     | >>>>                 --
>     | >>>>                 Best regards,
>     | >>>>                 Roman.
>     | >>>
>     | >>>
>     | >>>
>     | >>>
>     | >>>             --
>     | >>>             Best regards,
>     | >>>             Roman.
>     | >>
>     | >>
>     | >>
>     | >>
>     | >>         --
>     | >>         Best regards,
>     | >>         Roman.
>     | >>
>     | >>
>     | >>
>     | >>
>     | >>     --
>     | >>     Best regards,
>     | >>     Roman.
>     | >
>     | >
>     | >
>     | >
>     | > --
>     | > Best regards,
>     | > Roman.
>     |
>     |
>
>
>
>
> -- 
> Best regards,
> Roman.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140807/ca26adc5/attachment-0001.html>


More information about the Gluster-users mailing list