[Gluster-users] libgfapi failover problem on replica bricks
Pranith Kumar Karampuri
pkarampu at redhat.com
Wed Aug 27 06:50:48 UTC 2014
On 08/27/2014 11:53 AM, Roman wrote:
> Okay.
> so here are first results:
>
> after I disconnected the first server, I've got this:
>
> root at stor2:~# gluster volume heal HA-FAST-PVE1-150G info
> Volume heal failed
Can you check if the following binary is present?
/usr/sbin/glfsheal
Pranith
>
>
> but
> [2014-08-26 11:45:35.315974] I
> [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status]
> 0-HA-FAST-PVE1-150G-replicate-0: foreground data self heal is
> successfully completed, data self heal from
> HA-FAST-PVE1-150G-client-1 to sinks HA-FAST-PVE1-150G-client-0, with
> 16108814336 bytes on HA-FAST-PVE1-150G-client-0, 16108814336 bytes on
> HA-FAST-PVE1-150G-client-1, data - Pending matrix: [ [ 0 0 ] [ 348 0
> ] ] on <gfid:e3ede9c6-28d6-4755-841a-d8329e42ccc4>
>
> something wrong during upgrade?
>
> I've got two VM-s on different volumes: one with HD on and other with
> HD off.
> Both survived the outage and both seemed synced.
>
> but today I've found kind of a bug with log rotation.
>
> logs rotated both on server and client sides, but logs are being
> written in *.log.1 file :)
>
> /var/log/glusterfs/mnt-pve-HA-MED-PVE1-1T.log.1
> /var/log/glusterfs/glustershd.log.1
>
> such behavior came after upgrade.
>
> logrotate.d conf files include the HUP for gluster pid-s.
>
> client:
> /var/log/glusterfs/*.log {
> daily
> rotate 7
> delaycompress
> compress
> notifempty
> missingok
> postrotate
> [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
> /var/run/glusterd.pid`
> endscript
> }
>
> but I'm not able to ls the pid on client side (should it be there?) :(
>
> and servers:
> /var/log/glusterfs/*.log {
> daily
> rotate 7
> delaycompress
> compress
> notifempty
> missingok
> postrotate
> [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
> /var/run/glusterd.pid`
> endscript
> }
>
>
> /var/log/glusterfs/*/*.log {
> daily
> rotate 7
> delaycompress
> compress
> notifempty
> missingok
> copytruncate
> postrotate
> [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
> /var/run/glusterd.pid`
> endscript
> }
>
> I do have /var/run/glusterd.pid on server side.
>
> Should I change something? Logrotation seems to be broken.
>
>
>
>
>
>
> 2014-08-26 9:29 GMT+03:00 Pranith Kumar Karampuri <pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>>:
>
>
> On 08/26/2014 11:55 AM, Roman wrote:
>> Hello all again!
>> I'm back from vacation and I'm pretty happy with 3.5.2 available
>> for wheezy. Thanks! Just made my updates.
>> For 3.5.2 do I still have to set cluster.self-heal-daemon to off?
> Welcome back :-). If you set it to off, the test case you execute
> will work(Validate please :-) ). But we need to test it with
> self-heal-daemon 'on' and fix any bugs if the test case does not work?
>
> Pranith.
>
>>
>>
>> 2014-08-06 12:49 GMT+03:00 Humble Chirammal <hchiramm at redhat.com
>> <mailto:hchiramm at redhat.com>>:
>>
>>
>>
>>
>> ----- Original Message -----
>> | From: "Pranith Kumar Karampuri" <pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>
>> | To: "Roman" <romeo.r at gmail.com <mailto:romeo.r at gmail.com>>
>> | Cc: gluster-users at gluster.org
>> <mailto:gluster-users at gluster.org>, "Niels de Vos"
>> <ndevos at redhat.com <mailto:ndevos at redhat.com>>, "Humble
>> Chirammal" <hchiramm at redhat.com <mailto:hchiramm at redhat.com>>
>> | Sent: Wednesday, August 6, 2014 12:09:57 PM
>> | Subject: Re: [Gluster-users] libgfapi failover problem on
>> replica bricks
>> |
>> | Roman,
>> | The file went into split-brain. I think we should do
>> these tests
>> | with 3.5.2. Where monitoring the heals is easier. Let me
>> also come up
>> | with a document about how to do this testing you are trying
>> to do.
>> |
>> | Humble/Niels,
>> | Do we have debs available for 3.5.2? In 3.5.1 there
>> was packaging
>> | issue where /usr/bin/glfsheal is not packaged along with
>> the deb. I
>> | think that should be fixed now as well?
>> |
>> Pranith,
>>
>> The 3.5.2 packages for debian is not available yet. We are
>> co-ordinating internally to get it processed.
>> I will update the list once its available.
>>
>> --Humble
>> |
>> | On 08/06/2014 11:52 AM, Roman wrote:
>> | > good morning,
>> | >
>> | > root at stor1:~# getfattr -d -m. -e hex
>> | > /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | > getfattr: Removing leading '/' from absolute path names
>> | > # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >
>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>> | >
>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000001320000000000000000
>> | > trusted.gfid=0x23c79523075a4158bea38078da570449
>> | >
>> | > getfattr: Removing leading '/' from absolute path names
>> | > # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >
>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000040000000000000000
>> | >
>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>> | > trusted.gfid=0x23c79523075a4158bea38078da570449
>> | >
>> | >
>> | >
>> | > 2014-08-06 9:20 GMT+03:00 Pranith Kumar Karampuri
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>> | > <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>:
>> | >
>> | >
>> | > On 08/06/2014 11:30 AM, Roman wrote:
>> | >> Also, this time files are not the same!
>> | >>
>> | >> root at stor1:~# md5sum
>> | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >> 32411360c53116b96a059f17306caeda
>> | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>
>> | >> root at stor2:~# md5sum
>> | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >> 65b8a6031bcb6f5fb3a11cb1e8b1c9c9
>> | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | > What is the getfattr output?
>> | >
>> | > Pranith
>> | >
>> | >>
>> | >>
>> | >> 2014-08-05 16:33 GMT+03:00 Roman <romeo.r at gmail.com
>> <mailto:romeo.r at gmail.com>
>> | >> <mailto:romeo.r at gmail.com <mailto:romeo.r at gmail.com>>>:
>> | >>
>> | >> Nope, it is not working. But this time it went a
>> bit other way
>> | >>
>> | >> root at gluster-client:~# dmesg
>> | >> Segmentation fault
>> | >>
>> | >>
>> | >> I was not able even to start the VM after I done
>> the tests
>> | >>
>> | >> Could not read qcow2 header: Operation not permitted
>> | >>
>> | >> And it seems, it never starts to sync files
>> after first
>> | >> disconnect. VM survives first disconnect, but
>> not second (I
>> | >> waited around 30 minutes). Also, I've
>> | >> got network.ping-timeout: 2 in volume settings,
>> but logs
>> | >> react on first disconnect around 30 seconds.
>> Second was
>> | >> faster, 2 seconds.
>> | >>
>> | >> Reaction was different also:
>> | >>
>> | >> slower one:
>> | >> [2014-08-05 13:26:19.558435] W
>> [socket.c:514:__socket_rwv]
>> | >> 0-glusterfs: readv failed (Connection timed out)
>> | >> [2014-08-05 13:26:19.558485] W
>> | >> [socket.c:1962:__socket_proto_state_machine] 0-glusterfs:
>> | >> reading from socket failed. Error (Connection
>> timed out),
>> | >> peer (10.250.0.1:24007 <http://10.250.0.1:24007>
>> <http://10.250.0.1:24007>)
>> | >> [2014-08-05 13:26:21.281426] W
>> [socket.c:514:__socket_rwv]
>> | >> 0-HA-fast-150G-PVE1-client-0: readv failed (Connection
>> timed out)
>> | >> [2014-08-05 13:26:21.281474] W
>> | >> [socket.c:1962:__socket_proto_state_machine]
>> | >> 0-HA-fast-150G-PVE1-client-0: reading from socket failed.
>> | >> Error (Connection timed out), peer
>> (10.250.0.1:49153 <http://10.250.0.1:49153>
>> | >> <http://10.250.0.1:49153>)
>> | >> [2014-08-05 13:26:21.281507] I
>> | >> [client.c:2098:client_rpc_notify]
>> | >> 0-HA-fast-150G-PVE1-client-0: disconnected
>> | >>
>> | >> the fast one:
>> | >> 2014-08-05 12:52:44.607389] C
>> | >> [client-handshake.c:127:rpc_client_ping_timer_expired]
>> | >> 0-HA-fast-150G-PVE1-client-1: server 10.250.0.2:49153
>> <http://10.250.0.2:49153>
>> | >> <http://10.250.0.2:49153> has not responded in
>> the last 2
>> | >> seconds, disconnecting.
>> | >> [2014-08-05 12:52:44.607491] W
>> [socket.c:514:__socket_rwv]
>> | >> 0-HA-fast-150G-PVE1-client-1: readv failed (No data
>> available)
>> | >> [2014-08-05 12:52:44.607585] E
>> | >> [rpc-clnt.c:368:saved_frames_unwind]
>> | >>
>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>> | >> [0x7fcb1b4b0558]
>> | >>
>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>> | >> [0x7fcb1b4aea63]
>> | >>
>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>> | >> [0x7fcb1b4ae97e])))
>> 0-HA-fast-150G-PVE1-client-1: forced
>> | >> unwinding frame type(GlusterFS 3.3)
>> op(LOOKUP(27)) called at
>> | >> 2014-08-05 12:52:42.463881 (xid=0x381883x)
>> | >> [2014-08-05 12:52:44.607604] W
>> | >> [client-rpc-fops.c:2624:client3_3_lookup_cbk]
>> | >> 0-HA-fast-150G-PVE1-client-1: remote operation failed:
>> | >> Transport endpoint is not connected. Path: /
>> | >> (00000000-0000-0000-0000-000000000001)
>> | >> [2014-08-05 12:52:44.607736] E
>> | >> [rpc-clnt.c:368:saved_frames_unwind]
>> | >>
>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>> | >> [0x7fcb1b4b0558]
>> | >>
>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>> | >> [0x7fcb1b4aea63]
>> | >>
>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>> | >> [0x7fcb1b4ae97e])))
>> 0-HA-fast-150G-PVE1-client-1: forced
>> | >> unwinding frame type(GlusterFS Handshake)
>> op(PING(3)) called
>> | >> at 2014-08-05 12:52:42.463891 (xid=0x381884x)
>> | >> [2014-08-05 12:52:44.607753] W
>> | >> [client-handshake.c:276:client_ping_cbk]
>> | >> 0-HA-fast-150G-PVE1-client-1: timer must have expired
>> | >> [2014-08-05 12:52:44.607776] I
>> | >> [client.c:2098:client_rpc_notify]
>> | >> 0-HA-fast-150G-PVE1-client-1: disconnected
>> | >>
>> | >>
>> | >>
>> | >> I've got SSD disks (just for an info).
>> | >> Should I go and give a try for 3.5.2?
>> | >>
>> | >>
>> | >>
>> | >> 2014-08-05 13:06 GMT+03:00 Pranith Kumar Karampuri
>> | >> <pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>
>> | >> reply along with gluster-users please :-).
>> May be you are
>> | >> hitting 'reply' instead of 'reply all'?
>> | >>
>> | >> Pranith
>> | >>
>> | >> On 08/05/2014 03:35 PM, Roman wrote:
>> | >>> To make sure and clean, I've created
>> another VM with raw
>> | >>> format and goint to repeat those steps. So
>> now I've got
>> | >>> two VM-s one with qcow2 format and other
>> with raw
>> | >>> format. I will send another e-mail shortly.
>> | >>>
>> | >>>
>> | >>> 2014-08-05 13:01 GMT+03:00 Pranith Kumar
>> Karampuri
>> | >>> <pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>>
>> | >>>
>> | >>> On 08/05/2014 03:07 PM, Roman wrote:
>> | >>>> really, seems like the same file
>> | >>>>
>> | >>>> stor1:
>> | >>>> a951641c5230472929836f9fcede6b04
>> | >>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>>>
>> | >>>> stor2:
>> | >>>> a951641c5230472929836f9fcede6b04
>> | >>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>>>
>> | >>>>
>> | >>>> one thing I've seen from logs, that
>> somehow proxmox
>> | >>>> VE is connecting with wrong version to
>> servers?
>> | >>>> [2014-08-05 09:23:45.218550] I
>> | >>>>
>> [client-handshake.c:1659:select_server_supported_programs]
>> | >>>> 0-HA-fast-150G-PVE1-client-0: Using Program
>> | >>>> GlusterFS 3.3, Num (1298437), Version (330)
>> | >>> It is the rpc (over the network data
>> structures)
>> | >>> version, which is not changed at all
>> from 3.3 so
>> | >>> thats not a problem. So what is the
>> conclusion? Is
>> | >>> your test case working now or not?
>> | >>>
>> | >>> Pranith
>> | >>>
>> | >>>> but if I issue:
>> | >>>> root at pve1:~# glusterfs -V
>> | >>>> glusterfs 3.4.4 built on Jun 28 2014 03:44:57
>> | >>>> seems ok.
>> | >>>>
>> | >>>> server use 3.4.4 meanwhile
>> | >>>> [2014-08-05 09:23:45.117875] I
>> | >>>> [server-handshake.c:567:server_setvolume]
>> | >>>> 0-HA-fast-150G-PVE1-server: accepted client from
>> | >>>>
>> stor1-9004-2014/08/05-09:23:45:93538-HA-fast-150G-PVE1-client-1-0
>> | >>>> (version: 3.4.4)
>> | >>>> [2014-08-05 09:23:49.103035] I
>> | >>>> [server-handshake.c:567:server_setvolume]
>> | >>>> 0-HA-fast-150G-PVE1-server: accepted client from
>> | >>>>
>> stor1-8998-2014/08/05-09:23:45:89883-HA-fast-150G-PVE1-client-0-0
>> | >>>> (version: 3.4.4)
>> | >>>>
>> | >>>> if this could be the reason, of course.
>> | >>>> I did restart the Proxmox VE yesterday
>> (just for an
>> | >>>> information)
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>> 2014-08-05 12:30 GMT+03:00 Pranith Kumar Karampuri
>> | >>>> <pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>>>
>> | >>>>
>> | >>>> On 08/05/2014 02:33 PM, Roman wrote:
>> | >>>>> Waited long enough for now, still different
>> | >>>>> sizes and no logs about healing :(
>> | >>>>>
>> | >>>>> stor1
>> | >>>>> # file:
>> | >>>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>>>>
>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>> | >>>>>
>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>> | >>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>> | >>>>>
>> | >>>>> root at stor1:~# du -sh
>> | >>>>> /exports/fast-test/150G/images/127/
>> | >>>>> 1.2G /exports/fast-test/150G/images/127/
>> | >>>>>
>> | >>>>>
>> | >>>>> stor2
>> | >>>>> # file:
>> | >>>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>>>>
>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>> | >>>>>
>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>> | >>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>> | >>>>>
>> | >>>>>
>> | >>>>> root at stor2:~# du -sh
>> | >>>>> /exports/fast-test/150G/images/127/
>> | >>>>> 1.4G /exports/fast-test/150G/images/127/
>> | >>>> According to the changelogs, the file doesn't
>> | >>>> need any healing. Could you stop the operations
>> | >>>> on the VMs and take md5sum on both
>> these machines?
>> | >>>>
>> | >>>> Pranith
>> | >>>>
>> | >>>>>
>> | >>>>>
>> | >>>>>
>> | >>>>>
>> | >>>>> 2014-08-05 11:49 GMT+03:00 Pranith Kumar
>> | >>>>> Karampuri <pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>
>> | >>>>> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>>>>
>> | >>>>>
>> | >>>>> On 08/05/2014 02:06 PM, Roman wrote:
>> | >>>>>> Well, it seems like it doesn't see the
>> | >>>>>> changes were made to the volume ? I
>> | >>>>>> created two files 200 and 100 MB (from
>> | >>>>>> /dev/zero) after I disconnected the first
>> | >>>>>> brick. Then connected it back and got
>> | >>>>>> these logs:
>> | >>>>>>
>> | >>>>>> [2014-08-05 08:30:37.830150] I
>> | >>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>> | >>>>>> 0-glusterfs: No change in volfile, continuing
>> | >>>>>> [2014-08-05 08:30:37.830207] I
>> | >>>>>> [rpc-clnt.c:1676:rpc_clnt_reconfig]
>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: changing
>> | >>>>>> port to 49153 (from 0)
>> | >>>>>> [2014-08-05 08:30:37.830239] W
>> | >>>>>> [socket.c:514:__socket_rwv]
>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: readv
>> | >>>>>> failed (No data available)
>> | >>>>>> [2014-08-05 08:30:37.831024] I
>> | >>>>>>
>> [client-handshake.c:1659:select_server_supported_programs]
>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: Using
>> | >>>>>> Program GlusterFS 3.3, Num (1298437),
>> | >>>>>> Version (330)
>> | >>>>>> [2014-08-05 08:30:37.831375] I
>> | >>>>>> [client-handshake.c:1456:client_setvolume_cbk]
>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: Connected
>> | >>>>>> to 10.250.0.1:49153 <http://10.250.0.1:49153>
>> | >>>>>> <http://10.250.0.1:49153>, attached to
>> | >>>>>> remote volume '/exports/fast-test/150G'.
>> | >>>>>> [2014-08-05 08:30:37.831394] I
>> | >>>>>> [client-handshake.c:1468:client_setvolume_cbk]
>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: Server and
>> | >>>>>> Client lk-version numbers are not same,
>> | >>>>>> reopening the fds
>> | >>>>>> [2014-08-05 08:30:37.831566] I
>> | >>>>>> [client-handshake.c:450:client_set_lk_version_cbk]
>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: Server lk
>> | >>>>>> version = 1
>> | >>>>>>
>> | >>>>>>
>> | >>>>>> [2014-08-05 08:30:37.830150] I
>> | >>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>> | >>>>>> 0-glusterfs: No change in volfile, continuing
>> | >>>>>> this line seems weird to me tbh.
>> | >>>>>> I do not see any traffic on switch
>> | >>>>>> interfaces between gluster servers, which
>> | >>>>>> means, there is no syncing between them.
>> | >>>>>> I tried to ls -l the files on the client
>> | >>>>>> and servers to trigger the healing, but
>> | >>>>>> seems like no success. Should I wait more?
>> | >>>>> Yes, it should take around 10-15 minutes.
>> | >>>>> Could you provide 'getfattr -d -m. -e hex
>> | >>>>> <file-on-brick>' on both the bricks.
>> | >>>>>
>> | >>>>> Pranith
>> | >>>>>
>> | >>>>>>
>> | >>>>>>
>> | >>>>>> 2014-08-05 11:25 GMT+03:00 Pranith Kumar
>> | >>>>>> Karampuri <pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>
>> | >>>>>> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>>>>>
>> | >>>>>>
>> | >>>>>> On 08/05/2014 01:10 PM, Roman wrote:
>> | >>>>>>> Ahha! For some reason I was not able
>> | >>>>>>> to start the VM anymore, Proxmox VE
>> | >>>>>>> told me, that it is not able to read
>> | >>>>>>> the qcow2 header due to permission
>> | >>>>>>> is denied for some reason. So I just
>> | >>>>>>> deleted that file and created a new
>> | >>>>>>> VM. And the nex message I've got was
>> | >>>>>>> this:
>> | >>>>>> Seems like these are the messages
>> | >>>>>> where you took down the bricks before
>> | >>>>>> self-heal. Could you restart the run
>> | >>>>>> waiting for self-heals to complete
>> | >>>>>> before taking down the next brick?
>> | >>>>>>
>> | >>>>>> Pranith
>> | >>>>>>
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>> [2014-08-05 07:31:25.663412] E
>> | >>>>>>>
>> [afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
>> | >>>>>>> 0-HA-fast-150G-PVE1-replicate-0:
>> | >>>>>>> Unable to self-heal contents of
>> | >>>>>>> '/images/124/vm-124-disk-1.qcow2'
>> | >>>>>>> (possible split-brain). Please
>> | >>>>>>> delete the file from all but the
>> | >>>>>>> preferred subvolume.- Pending
>> | >>>>>>> matrix: [ [ 0 60 ] [ 11 0 ] ]
>> | >>>>>>> [2014-08-05 07:31:25.663955] E
>> | >>>>>>>
>> [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk]
>> | >>>>>>> 0-HA-fast-150G-PVE1-replicate-0:
>> | >>>>>>> background data self-heal failed on
>> | >>>>>>> /images/124/vm-124-disk-1.qcow2
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>> 2014-08-05 10:13 GMT+03:00 Pranith
>> | >>>>>>> Kumar Karampuri
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>> | >>>>>>> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>>>>>>
>> | >>>>>>> I just responded to your earlier
>> | >>>>>>> mail about how the log looks.
>> | >>>>>>> The log comes on the mount's
>> logfile
>> | >>>>>>>
>> | >>>>>>> Pranith
>> | >>>>>>>
>> | >>>>>>> On 08/05/2014 12:41 PM, Roman
>> wrote:
>> | >>>>>>>> Ok, so I've waited enough, I
>> | >>>>>>>> think. Had no any traffic on
>> | >>>>>>>> switch ports between servers.
>> | >>>>>>>> Could not find any
>> suitable log
>> | >>>>>>>> message about completed
>> | >>>>>>>> self-heal (waited about 30
>> | >>>>>>>> minutes). Plugged out the
>> other
>> | >>>>>>>> server's UTP cable this time
>> | >>>>>>>> and got in the same
>> situation:
>> | >>>>>>>> root at gluster-test1:~# cat
>> | >>>>>>>> /var/log/dmesg
>> | >>>>>>>> -bash: /bin/cat:
>> Input/output error
>> | >>>>>>>>
>> | >>>>>>>> brick logs:
>> | >>>>>>>> [2014-08-05
>> 07:09:03.005474] I
>> | >>>>>>>> [server.c:762:server_rpc_notify]
>> | >>>>>>>> 0-HA-fast-150G-PVE1-server:
>> | >>>>>>>> disconnecting connectionfrom
>> | >>>>>>>>
>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>> | >>>>>>>> [2014-08-05
>> 07:09:03.005530] I
>> | >>>>>>>> [server-helpers.c:729:server_connection_put]
>> | >>>>>>>> 0-HA-fast-150G-PVE1-server:
>> | >>>>>>>> Shutting down connection
>> | >>>>>>>>
>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>> | >>>>>>>> [2014-08-05
>> 07:09:03.005560] I
>> | >>>>>>>> [server-helpers.c:463:do_fd_cleanup]
>> | >>>>>>>> 0-HA-fast-150G-PVE1-server: fd
>> | >>>>>>>> cleanup on
>> | >>>>>>>> /images/124/vm-124-disk-1.qcow2
>> | >>>>>>>> [2014-08-05
>> 07:09:03.005797] I
>> | >>>>>>>> [server-helpers.c:617:server_connection_destroy]
>> | >>>>>>>> 0-HA-fast-150G-PVE1-server:
>> | >>>>>>>> destroyed connection of
>> | >>>>>>>>
>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>> 2014-08-05 9:53 GMT+03:00
>> | >>>>>>>> Pranith Kumar Karampuri
>> | >>>>>>>> <pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>
>> | >>>>>>>> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>>>>>>>
>> | >>>>>>>> Do you think it is
>> possible
>> | >>>>>>>> for you to do these tests
>> | >>>>>>>> on the latest version
>> | >>>>>>>> 3.5.2? 'gluster
>> volume heal
>> | >>>>>>>> <volname> info' would give
>> | >>>>>>>> you that information in
>> | >>>>>>>> versions > 3.5.1.
>> | >>>>>>>> Otherwise you will
>> have to
>> | >>>>>>>> check it from either the
>> | >>>>>>>> logs, there will be
>> | >>>>>>>> self-heal completed
>> message
>> | >>>>>>>> on the mount logs (or) by
>> | >>>>>>>> observing 'getfattr
>> -d -m.
>> | >>>>>>>> -e hex
>> <image-file-on-bricks>'
>> | >>>>>>>>
>> | >>>>>>>> Pranith
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>> On 08/05/2014 12:09 PM,
>> | >>>>>>>> Roman wrote:
>> | >>>>>>>>> Ok, I
>> understand. I will
>> | >>>>>>>>> try this shortly.
>> | >>>>>>>>> How can I be
>> sure, that
>> | >>>>>>>>> healing process
>> is done,
>> | >>>>>>>>> if I am not able
>> to see
>> | >>>>>>>>> its status?
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>> 2014-08-05 9:30 GMT+03:00
>> | >>>>>>>>> Pranith Kumar
>> Karampuri
>> | >>>>>>>>>
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>> | >>>>>>>>>
>> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>:
>> | >>>>>>>>>
>> | >>>>>>>>> Mounts will do the
>> | >>>>>>>>> healing, not the
>> | >>>>>>>>> self-heal-daemon. The
>> | >>>>>>>>> problem I feel is that
>> | >>>>>>>>> whichever process does
>> | >>>>>>>>> the healing
>> has the
>> | >>>>>>>>> latest information
>> | >>>>>>>>> about the
>> good bricks
>> | >>>>>>>>> in this
>> usecase. Since
>> | >>>>>>>>> for VM
>> usecase, mounts
>> | >>>>>>>>> should have the latest
>> | >>>>>>>>> information, we should
>> | >>>>>>>>> let the
>> mounts do the
>> | >>>>>>>>> healing. If the mount
>> | >>>>>>>>> accesses the VM image
>> | >>>>>>>>> either by someone
>> | >>>>>>>>> doing operations
>> | >>>>>>>>> inside the VM or
>> | >>>>>>>>> explicit stat on the
>> | >>>>>>>>> file it
>> should do the
>> | >>>>>>>>> healing.
>> | >>>>>>>>>
>> | >>>>>>>>> Pranith.
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>> On
>> 08/05/2014 10:39
>> | >>>>>>>>> AM, Roman wrote:
>> | >>>>>>>>>> Hmmm, you told me to
>> | >>>>>>>>>> turn it off. Did I
>> | >>>>>>>>>> understood something
>> | >>>>>>>>>> wrong? After I issued
>> | >>>>>>>>>> the
>> command you've
>> | >>>>>>>>>> sent me, I was not
>> | >>>>>>>>>> able to watch the
>> | >>>>>>>>>> healing process, it
>> | >>>>>>>>>> said, it won't be
>> | >>>>>>>>>> healed, becouse its
>> | >>>>>>>>>> turned off.
>> | >>>>>>>>>>
>> | >>>>>>>>>>
>> | >>>>>>>>>> 2014-08-05 5:39
>> | >>>>>>>>>> GMT+03:00 Pranith
>> | >>>>>>>>>> Kumar Karampuri
>> | >>>>>>>>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>> | >>>>>>>>>> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>>>>>>>>>
>> | >>>>>>>>>> You didn't
>> | >>>>>>>>>> mention anything
>> | >>>>>>>>>> about
>> | >>>>>>>>>> self-healing. Did
>> | >>>>>>>>>> you wait until
>> | >>>>>>>>>> the self-heal is
>> | >>>>>>>>>> complete?
>> | >>>>>>>>>>
>> | >>>>>>>>>> Pranith
>> | >>>>>>>>>>
>> | >>>>>>>>>> On 08/04/2014
>> | >>>>>>>>>> 05:49 PM, Roman
>> | >>>>>>>>>> wrote:
>> | >>>>>>>>>>> Hi!
>> | >>>>>>>>>>> Result is pretty
>> | >>>>>>>>>>> same. I set the
>> | >>>>>>>>>>> switch port down
>> | >>>>>>>>>>> for 1st server,
>> | >>>>>>>>>>> it was ok. Then
>> | >>>>>>>>>>> set it up back
>> | >>>>>>>>>>> and set other
>> | >>>>>>>>>>> server's port
>> | >>>>>>>>>>> off. and it
>> | >>>>>>>>>>> triggered IO
>> | >>>>>>>>>>> error on two
>> | >>>>>>>>>>> virtual
>> | >>>>>>>>>>> machines: one
>> | >>>>>>>>>>> with local root
>> | >>>>>>>>>>> FS but network
>> | >>>>>>>>>>> mounted storage.
>> | >>>>>>>>>>> and other with
>> | >>>>>>>>>>> network root FS.
>> | >>>>>>>>>>> 1st gave an
>> | >>>>>>>>>>> error on copying
>> | >>>>>>>>>>> to or from the
>> | >>>>>>>>>>> mounted network
>> | >>>>>>>>>>> disk, other just
>> | >>>>>>>>>>> gave me an error
>> | >>>>>>>>>>> for even reading
>> | >>>>>>>>>>> log.files.
>> | >>>>>>>>>>>
>> | >>>>>>>>>>> cat:
>> | >>>>>>>>>>> /var/log/alternatives.log:
>> | >>>>>>>>>>> Input/output error
>> | >>>>>>>>>>> then I reset the
>> | >>>>>>>>>>> kvm VM and it
>> | >>>>>>>>>>> said me, there
>> | >>>>>>>>>>> is no boot
>> | >>>>>>>>>>> device. Next I
>> | >>>>>>>>>>> virtually
>> | >>>>>>>>>>> powered it off
>> | >>>>>>>>>>> and then back on
>> | >>>>>>>>>>> and it has booted.
>> | >>>>>>>>>>>
>> | >>>>>>>>>>> By the way, did
>> | >>>>>>>>>>> I have to
>> | >>>>>>>>>>> start/stop volume?
>> | >>>>>>>>>>>
>> | >>>>>>>>>>> >> Could you do
>> | >>>>>>>>>>> the following
>> | >>>>>>>>>>> and test it again?
>> | >>>>>>>>>>> >> gluster volume
>> | >>>>>>>>>>> set <volname>
>> | >>>>>>>>>>> cluster.self-heal-daemon
>> | >>>>>>>>>>> off
>> | >>>>>>>>>>>
>> | >>>>>>>>>>> >>Pranith
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>> 2014-08-04 14:10
>> | >>>>>>>>>>> GMT+03:00
>> | >>>>>>>>>>> Pranith Kumar
>> | >>>>>>>>>>> Karampuri
>> | >>>>>>>>>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>> | >>>>>>>>>>> <mailto:pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>>:
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>> On
>> | >>>>>>>>>>> 08/04/2014
>> | >>>>>>>>>>> 03:33 PM,
>> | >>>>>>>>>>> Roman wrote:
>> | >>>>>>>>>>>> Hello!
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> Facing the
>> | >>>>>>>>>>>> same
>> | >>>>>>>>>>>> problem as
>> | >>>>>>>>>>>> mentioned
>> | >>>>>>>>>>>> here:
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>
>> http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> my set up
>> | >>>>>>>>>>>> is up and
>> | >>>>>>>>>>>> running, so
>> | >>>>>>>>>>>> i'm ready
>> | >>>>>>>>>>>> to help you
>> | >>>>>>>>>>>> back with
>> | >>>>>>>>>>>> feedback.
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> setup:
>> | >>>>>>>>>>>> proxmox
>> | >>>>>>>>>>>> server as
>> | >>>>>>>>>>>> client
>> | >>>>>>>>>>>> 2 gluster
>> | >>>>>>>>>>>> physical
>> | >>>>>>>>>>>> servers
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> server side
>> | >>>>>>>>>>>> and client
>> | >>>>>>>>>>>> side both
>> | >>>>>>>>>>>> running atm
>> | >>>>>>>>>>>> 3.4.4
>> | >>>>>>>>>>>> glusterfs
>> | >>>>>>>>>>>> from
>> | >>>>>>>>>>>> gluster repo.
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> the problem is:
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> 1. craeted
>> | >>>>>>>>>>>> replica bricks.
>> | >>>>>>>>>>>> 2. mounted
>> | >>>>>>>>>>>> in proxmox
>> | >>>>>>>>>>>> (tried both
>> | >>>>>>>>>>>> promox
>> | >>>>>>>>>>>> ways: via
>> | >>>>>>>>>>>> GUI and
>> | >>>>>>>>>>>> fstab (with
>> | >>>>>>>>>>>> backup
>> | >>>>>>>>>>>> volume
>> | >>>>>>>>>>>> line), btw
>> | >>>>>>>>>>>> while
>> | >>>>>>>>>>>> mounting
>> | >>>>>>>>>>>> via fstab
>> | >>>>>>>>>>>> I'm unable
>> | >>>>>>>>>>>> to launch a
>> | >>>>>>>>>>>> VM without
>> | >>>>>>>>>>>> cache,
>> | >>>>>>>>>>>> meanwhile
>> | >>>>>>>>>>>> direct-io-mode
>> | >>>>>>>>>>>> is enabled
>> | >>>>>>>>>>>> in fstab line)
>> | >>>>>>>>>>>> 3. installed VM
>> | >>>>>>>>>>>> 4. bring
>> | >>>>>>>>>>>> one volume
>> | >>>>>>>>>>>> down - ok
>> | >>>>>>>>>>>> 5. bringing
>> | >>>>>>>>>>>> up, waiting
>> | >>>>>>>>>>>> for sync is
>> | >>>>>>>>>>>> done.
>> | >>>>>>>>>>>> 6. bring
>> | >>>>>>>>>>>> other
>> | >>>>>>>>>>>> volume down
>> | >>>>>>>>>>>> - getting
>> | >>>>>>>>>>>> IO errors
>> | >>>>>>>>>>>> on VM guest
>> | >>>>>>>>>>>> and not
>> | >>>>>>>>>>>> able to
>> | >>>>>>>>>>>> restore the
>> | >>>>>>>>>>>> VM after I
>> | >>>>>>>>>>>> reset the
>> | >>>>>>>>>>>> VM via
>> | >>>>>>>>>>>> host. It
>> | >>>>>>>>>>>> says (no
>> | >>>>>>>>>>>> bootable
>> | >>>>>>>>>>>> media).
>> | >>>>>>>>>>>> After I
>> | >>>>>>>>>>>> shut it
>> | >>>>>>>>>>>> down
>> | >>>>>>>>>>>> (forced)
>> | >>>>>>>>>>>> and bring
>> | >>>>>>>>>>>> back up, it
>> | >>>>>>>>>>>> boots.
>> | >>>>>>>>>>> Could you do
>> | >>>>>>>>>>> the
>> | >>>>>>>>>>> following
>> | >>>>>>>>>>> and test it
>> | >>>>>>>>>>> again?
>> | >>>>>>>>>>> gluster
>> | >>>>>>>>>>> volume set
>> | >>>>>>>>>>> <volname>
>> | >>>>>>>>>>> cluster.self-heal-daemon
>> | >>>>>>>>>>> off
>> | >>>>>>>>>>>
>> | >>>>>>>>>>> Pranith
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> Need help.
>> | >>>>>>>>>>>> Tried
>> | >>>>>>>>>>>> 3.4.3, 3.4.4.
>> | >>>>>>>>>>>> Still
>> | >>>>>>>>>>>> missing
>> | >>>>>>>>>>>> pkg-s for
>> | >>>>>>>>>>>> 3.4.5 for
>> | >>>>>>>>>>>> debian and
>> | >>>>>>>>>>>> 3.5.2
>> | >>>>>>>>>>>> (3.5.1
>> | >>>>>>>>>>>> always
>> | >>>>>>>>>>>> gives a
>> | >>>>>>>>>>>> healing
>> | >>>>>>>>>>>> error for
>> | >>>>>>>>>>>> some reason)
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> --
>> | >>>>>>>>>>>> Best regards,
>> | >>>>>>>>>>>> Roman.
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>> _______________________________________________
>> | >>>>>>>>>>>> Gluster-users
>> | >>>>>>>>>>>> mailing list
>> | >>>>>>>>>>>> Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>
>> | >>>>>>>>>>>> <mailto:Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>>
>> | >>>>>>>>>>>>
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>> --
>> | >>>>>>>>>>> Best regards,
>> | >>>>>>>>>>> Roman.
>> | >>>>>>>>>>
>> | >>>>>>>>>>
>> | >>>>>>>>>>
>> | >>>>>>>>>>
>> | >>>>>>>>>> --
>> | >>>>>>>>>> Best regards,
>> | >>>>>>>>>> Roman.
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>> --
>> | >>>>>>>>> Best regards,
>> | >>>>>>>>> Roman.
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>> --
>> | >>>>>>>> Best regards,
>> | >>>>>>>> Roman.
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>> --
>> | >>>>>>> Best regards,
>> | >>>>>>> Roman.
>> | >>>>>>
>> | >>>>>>
>> | >>>>>>
>> | >>>>>>
>> | >>>>>> --
>> | >>>>>> Best regards,
>> | >>>>>> Roman.
>> | >>>>>
>> | >>>>>
>> | >>>>>
>> | >>>>>
>> | >>>>> --
>> | >>>>> Best regards,
>> | >>>>> Roman.
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>> --
>> | >>>> Best regards,
>> | >>>> Roman.
>> | >>>
>> | >>>
>> | >>>
>> | >>>
>> | >>> --
>> | >>> Best regards,
>> | >>> Roman.
>> | >>
>> | >>
>> | >>
>> | >>
>> | >> --
>> | >> Best regards,
>> | >> Roman.
>> | >>
>> | >>
>> | >>
>> | >>
>> | >> --
>> | >> Best regards,
>> | >> Roman.
>> | >
>> | >
>> | >
>> | >
>> | > --
>> | > Best regards,
>> | > Roman.
>> |
>> |
>>
>>
>>
>>
>> --
>> Best regards,
>> Roman.
>
>
>
>
> --
> Best regards,
> Roman.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140827/3b119d0d/attachment.html>
More information about the Gluster-users
mailing list