[Gluster-users] libgfapi failover problem on replica bricks
Pranith Kumar Karampuri
pkarampu at redhat.com
Wed Aug 27 07:04:22 UTC 2014
On 08/27/2014 12:24 PM, Roman wrote:
> root at stor1:~# ls -l /usr/sbin/glfsheal
> ls: cannot access /usr/sbin/glfsheal: No such file or directory
> Seems like not.
Humble,
Seems like the binary is still not packaged?
Pranith
>
>
> 2014-08-27 9:50 GMT+03:00 Pranith Kumar Karampuri <pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>>:
>
>
> On 08/27/2014 11:53 AM, Roman wrote:
>> Okay.
>> so here are first results:
>>
>> after I disconnected the first server, I've got this:
>>
>> root at stor2:~# gluster volume heal HA-FAST-PVE1-150G info
>> Volume heal failed
> Can you check if the following binary is present?
> /usr/sbin/glfsheal
>
> Pranith
>>
>>
>> but
>> [2014-08-26 11:45:35.315974] I
>> [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status]
>> 0-HA-FAST-PVE1-150G-replicate-0: foreground data self heal is
>> successfully completed, data self heal from
>> HA-FAST-PVE1-150G-client-1 to sinks HA-FAST-PVE1-150G-client-0,
>> with 16108814336 bytes on HA-FAST-PVE1-150G-client-0, 16108814336
>> bytes on HA-FAST-PVE1-150G-client-1, data - Pending matrix: [ [
>> 0 0 ] [ 348 0 ] ] on <gfid:e3ede9c6-28d6-4755-841a-d8329e42ccc4>
>>
>> something wrong during upgrade?
>>
>> I've got two VM-s on different volumes: one with HD on and other
>> with HD off.
>> Both survived the outage and both seemed synced.
>>
>> but today I've found kind of a bug with log rotation.
>>
>> logs rotated both on server and client sides, but logs are being
>> written in *.log.1 file :)
>>
>> /var/log/glusterfs/mnt-pve-HA-MED-PVE1-1T.log.1
>> /var/log/glusterfs/glustershd.log.1
>>
>> such behavior came after upgrade.
>>
>> logrotate.d conf files include the HUP for gluster pid-s.
>>
>> client:
>> /var/log/glusterfs/*.log {
>> daily
>> rotate 7
>> delaycompress
>> compress
>> notifempty
>> missingok
>> postrotate
>> [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
>> /var/run/glusterd.pid`
>> endscript
>> }
>>
>> but I'm not able to ls the pid on client side (should it be
>> there?) :(
>>
>> and servers:
>> /var/log/glusterfs/*.log {
>> daily
>> rotate 7
>> delaycompress
>> compress
>> notifempty
>> missingok
>> postrotate
>> [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
>> /var/run/glusterd.pid`
>> endscript
>> }
>>
>>
>> /var/log/glusterfs/*/*.log {
>> daily
>> rotate 7
>> delaycompress
>> compress
>> notifempty
>> missingok
>> copytruncate
>> postrotate
>> [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
>> /var/run/glusterd.pid`
>> endscript
>> }
>>
>> I do have /var/run/glusterd.pid on server side.
>>
>> Should I change something? Logrotation seems to be broken.
>>
>>
>>
>>
>>
>>
>> 2014-08-26 9:29 GMT+03:00 Pranith Kumar Karampuri
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>>
>>
>> On 08/26/2014 11:55 AM, Roman wrote:
>>> Hello all again!
>>> I'm back from vacation and I'm pretty happy with 3.5.2
>>> available for wheezy. Thanks! Just made my updates.
>>> For 3.5.2 do I still have to set cluster.self-heal-daemon to
>>> off?
>> Welcome back :-). If you set it to off, the test case you
>> execute will work(Validate please :-) ). But we need to test
>> it with self-heal-daemon 'on' and fix any bugs if the test
>> case does not work?
>>
>> Pranith.
>>
>>>
>>>
>>> 2014-08-06 12:49 GMT+03:00 Humble Chirammal
>>> <hchiramm at redhat.com <mailto:hchiramm at redhat.com>>:
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> | From: "Pranith Kumar Karampuri" <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>
>>> | To: "Roman" <romeo.r at gmail.com <mailto:romeo.r at gmail.com>>
>>> | Cc: gluster-users at gluster.org
>>> <mailto:gluster-users at gluster.org>, "Niels de Vos"
>>> <ndevos at redhat.com <mailto:ndevos at redhat.com>>, "Humble
>>> Chirammal" <hchiramm at redhat.com
>>> <mailto:hchiramm at redhat.com>>
>>> | Sent: Wednesday, August 6, 2014 12:09:57 PM
>>> | Subject: Re: [Gluster-users] libgfapi failover problem
>>> on replica bricks
>>> |
>>> | Roman,
>>> | The file went into split-brain. I think we should
>>> do these tests
>>> | with 3.5.2. Where monitoring the heals is easier. Let
>>> me also come up
>>> | with a document about how to do this testing you are
>>> trying to do.
>>> |
>>> | Humble/Niels,
>>> | Do we have debs available for 3.5.2? In 3.5.1
>>> there was packaging
>>> | issue where /usr/bin/glfsheal is not packaged along
>>> with the deb. I
>>> | think that should be fixed now as well?
>>> |
>>> Pranith,
>>>
>>> The 3.5.2 packages for debian is not available yet. We
>>> are co-ordinating internally to get it processed.
>>> I will update the list once its available.
>>>
>>> --Humble
>>> |
>>> | On 08/06/2014 11:52 AM, Roman wrote:
>>> | > good morning,
>>> | >
>>> | > root at stor1:~# getfattr -d -m. -e hex
>>> | > /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | > getfattr: Removing leading '/' from absolute path names
>>> | > # file:
>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >
>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>>> | >
>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000001320000000000000000
>>> | > trusted.gfid=0x23c79523075a4158bea38078da570449
>>> | >
>>> | > getfattr: Removing leading '/' from absolute path names
>>> | > # file:
>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >
>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000040000000000000000
>>> | >
>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>>> | > trusted.gfid=0x23c79523075a4158bea38078da570449
>>> | >
>>> | >
>>> | >
>>> | > 2014-08-06 9:20 GMT+03:00 Pranith Kumar Karampuri
>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>>> | > <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >
>>> | >
>>> | > On 08/06/2014 11:30 AM, Roman wrote:
>>> | >> Also, this time files are not the same!
>>> | >>
>>> | >> root at stor1:~# md5sum
>>> | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >> 32411360c53116b96a059f17306caeda
>>> | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >>
>>> | >> root at stor2:~# md5sum
>>> | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >> 65b8a6031bcb6f5fb3a11cb1e8b1c9c9
>>> | >> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | > What is the getfattr output?
>>> | >
>>> | > Pranith
>>> | >
>>> | >>
>>> | >>
>>> | >> 2014-08-05 16:33 GMT+03:00 Roman
>>> <romeo.r at gmail.com <mailto:romeo.r at gmail.com>
>>> | >> <mailto:romeo.r at gmail.com
>>> <mailto:romeo.r at gmail.com>>>:
>>> | >>
>>> | >> Nope, it is not working. But this time it
>>> went a bit other way
>>> | >>
>>> | >> root at gluster-client:~# dmesg
>>> | >> Segmentation fault
>>> | >>
>>> | >>
>>> | >> I was not able even to start the VM after I
>>> done the tests
>>> | >>
>>> | >> Could not read qcow2 header: Operation not
>>> permitted
>>> | >>
>>> | >> And it seems, it never starts to sync files
>>> after first
>>> | >> disconnect. VM survives first disconnect,
>>> but not second (I
>>> | >> waited around 30 minutes). Also, I've
>>> | >> got network.ping-timeout: 2 in volume
>>> settings, but logs
>>> | >> react on first disconnect around 30
>>> seconds. Second was
>>> | >> faster, 2 seconds.
>>> | >>
>>> | >> Reaction was different also:
>>> | >>
>>> | >> slower one:
>>> | >> [2014-08-05 13:26:19.558435] W
>>> [socket.c:514:__socket_rwv]
>>> | >> 0-glusterfs: readv failed (Connection timed
>>> out)
>>> | >> [2014-08-05 13:26:19.558485] W
>>> | >> [socket.c:1962:__socket_proto_state_machine]
>>> 0-glusterfs:
>>> | >> reading from socket failed. Error
>>> (Connection timed out),
>>> | >> peer (10.250.0.1:24007
>>> <http://10.250.0.1:24007> <http://10.250.0.1:24007>)
>>> | >> [2014-08-05 13:26:21.281426] W
>>> [socket.c:514:__socket_rwv]
>>> | >> 0-HA-fast-150G-PVE1-client-0: readv failed
>>> (Connection timed out)
>>> | >> [2014-08-05 13:26:21.281474] W
>>> | >> [socket.c:1962:__socket_proto_state_machine]
>>> | >> 0-HA-fast-150G-PVE1-client-0: reading from socket
>>> failed.
>>> | >> Error (Connection timed out), peer
>>> (10.250.0.1:49153 <http://10.250.0.1:49153>
>>> | >> <http://10.250.0.1:49153>)
>>> | >> [2014-08-05 13:26:21.281507] I
>>> | >> [client.c:2098:client_rpc_notify]
>>> | >> 0-HA-fast-150G-PVE1-client-0: disconnected
>>> | >>
>>> | >> the fast one:
>>> | >> 2014-08-05 12:52:44.607389] C
>>> | >> [client-handshake.c:127:rpc_client_ping_timer_expired]
>>> | >> 0-HA-fast-150G-PVE1-client-1: server
>>> 10.250.0.2:49153 <http://10.250.0.2:49153>
>>> | >> <http://10.250.0.2:49153> has not responded
>>> in the last 2
>>> | >> seconds, disconnecting.
>>> | >> [2014-08-05 12:52:44.607491] W
>>> [socket.c:514:__socket_rwv]
>>> | >> 0-HA-fast-150G-PVE1-client-1: readv failed (No
>>> data available)
>>> | >> [2014-08-05 12:52:44.607585] E
>>> | >> [rpc-clnt.c:368:saved_frames_unwind]
>>> | >>
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>>> | >> [0x7fcb1b4b0558]
>>> | >>
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>>> | >> [0x7fcb1b4aea63]
>>> | >>
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>>> | >> [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1:
>>> forced
>>> | >> unwinding frame type(GlusterFS 3.3)
>>> op(LOOKUP(27)) called at
>>> | >> 2014-08-05 12:52:42.463881 (xid=0x381883x)
>>> | >> [2014-08-05 12:52:44.607604] W
>>> | >> [client-rpc-fops.c:2624:client3_3_lookup_cbk]
>>> | >> 0-HA-fast-150G-PVE1-client-1: remote operation failed:
>>> | >> Transport endpoint is not connected. Path: /
>>> | >> (00000000-0000-0000-0000-000000000001)
>>> | >> [2014-08-05 12:52:44.607736] E
>>> | >> [rpc-clnt.c:368:saved_frames_unwind]
>>> | >>
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>>> | >> [0x7fcb1b4b0558]
>>> | >>
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>>> | >> [0x7fcb1b4aea63]
>>> | >>
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>>> | >> [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1:
>>> forced
>>> | >> unwinding frame type(GlusterFS Handshake)
>>> op(PING(3)) called
>>> | >> at 2014-08-05 12:52:42.463891 (xid=0x381884x)
>>> | >> [2014-08-05 12:52:44.607753] W
>>> | >> [client-handshake.c:276:client_ping_cbk]
>>> | >> 0-HA-fast-150G-PVE1-client-1: timer must have expired
>>> | >> [2014-08-05 12:52:44.607776] I
>>> | >> [client.c:2098:client_rpc_notify]
>>> | >> 0-HA-fast-150G-PVE1-client-1: disconnected
>>> | >>
>>> | >>
>>> | >>
>>> | >> I've got SSD disks (just for an info).
>>> | >> Should I go and give a try for 3.5.2?
>>> | >>
>>> | >>
>>> | >>
>>> | >> 2014-08-05 13:06 GMT+03:00 Pranith Kumar
>>> Karampuri
>>> | >> <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>
>>> | >> reply along with gluster-users please
>>> :-). May be you are
>>> | >> hitting 'reply' instead of 'reply all'?
>>> | >>
>>> | >> Pranith
>>> | >>
>>> | >> On 08/05/2014 03:35 PM, Roman wrote:
>>> | >>> To make sure and clean, I've created
>>> another VM with raw
>>> | >>> format and goint to repeat those
>>> steps. So now I've got
>>> | >>> two VM-s one with qcow2 format and
>>> other with raw
>>> | >>> format. I will send another e-mail
>>> shortly.
>>> | >>>
>>> | >>>
>>> | >>> 2014-08-05 13:01 GMT+03:00 Pranith Kumar Karampuri
>>> | >>> <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>>
>>> | >>>
>>> | >>> On 08/05/2014 03:07 PM, Roman wrote:
>>> | >>>> really, seems like the same file
>>> | >>>>
>>> | >>>> stor1:
>>> | >>>> a951641c5230472929836f9fcede6b04
>>> | >>>>
>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >>>>
>>> | >>>> stor2:
>>> | >>>> a951641c5230472929836f9fcede6b04
>>> | >>>>
>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >>>>
>>> | >>>>
>>> | >>>> one thing I've seen from logs, that somehow
>>> proxmox
>>> | >>>> VE is connecting with wrong version to servers?
>>> | >>>> [2014-08-05 09:23:45.218550] I
>>> | >>>>
>>> [client-handshake.c:1659:select_server_supported_programs]
>>> | >>>> 0-HA-fast-150G-PVE1-client-0: Using Program
>>> | >>>> GlusterFS 3.3, Num (1298437), Version (330)
>>> | >>> It is the rpc (over the network data structures)
>>> | >>> version, which is not changed at all from 3.3 so
>>> | >>> thats not a problem. So what is the conclusion? Is
>>> | >>> your test case working now or not?
>>> | >>>
>>> | >>> Pranith
>>> | >>>
>>> | >>>> but if I issue:
>>> | >>>> root at pve1:~# glusterfs -V
>>> | >>>> glusterfs 3.4.4 built on Jun 28 2014 03:44:57
>>> | >>>> seems ok.
>>> | >>>>
>>> | >>>> server use 3.4.4 meanwhile
>>> | >>>> [2014-08-05 09:23:45.117875] I
>>> | >>>> [server-handshake.c:567:server_setvolume]
>>> | >>>> 0-HA-fast-150G-PVE1-server: accepted client from
>>> | >>>>
>>> stor1-9004-2014/08/05-09:23:45:93538-HA-fast-150G-PVE1-client-1-0
>>> | >>>> (version: 3.4.4)
>>> | >>>> [2014-08-05 09:23:49.103035] I
>>> | >>>> [server-handshake.c:567:server_setvolume]
>>> | >>>> 0-HA-fast-150G-PVE1-server: accepted client from
>>> | >>>>
>>> stor1-8998-2014/08/05-09:23:45:89883-HA-fast-150G-PVE1-client-0-0
>>> | >>>> (version: 3.4.4)
>>> | >>>>
>>> | >>>> if this could be the reason, of course.
>>> | >>>> I did restart the Proxmox VE yesterday (just
>>> for an
>>> | >>>> information)
>>> | >>>>
>>> | >>>>
>>> | >>>>
>>> | >>>>
>>> | >>>>
>>> | >>>> 2014-08-05 12:30 GMT+03:00 Pranith Kumar Karampuri
>>> | >>>> <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>>>
>>> | >>>>
>>> | >>>> On 08/05/2014 02:33 PM, Roman wrote:
>>> | >>>>> Waited long enough for now, still
>>> different
>>> | >>>>> sizes and no logs about healing :(
>>> | >>>>>
>>> | >>>>> stor1
>>> | >>>>> # file:
>>> | >>>>>
>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >>>>>
>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>>> | >>>>>
>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>>> | >>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>>> | >>>>>
>>> | >>>>> root at stor1:~# du -sh
>>> | >>>>> /exports/fast-test/150G/images/127/
>>> | >>>>> 1.2G /exports/fast-test/150G/images/127/
>>> | >>>>>
>>> | >>>>>
>>> | >>>>> stor2
>>> | >>>>> # file:
>>> | >>>>>
>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> | >>>>>
>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>>> | >>>>>
>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>>> | >>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>>> | >>>>>
>>> | >>>>>
>>> | >>>>> root at stor2:~# du -sh
>>> | >>>>> /exports/fast-test/150G/images/127/
>>> | >>>>> 1.4G /exports/fast-test/150G/images/127/
>>> | >>>> According to the changelogs, the file doesn't
>>> | >>>> need any healing. Could you stop the
>>> operations
>>> | >>>> on the VMs and take md5sum on both these
>>> machines?
>>> | >>>>
>>> | >>>> Pranith
>>> | >>>>
>>> | >>>>>
>>> | >>>>>
>>> | >>>>>
>>> | >>>>>
>>> | >>>>> 2014-08-05 11:49 GMT+03:00 Pranith Kumar
>>> | >>>>> Karampuri <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>
>>> | >>>>> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>>>>
>>> | >>>>>
>>> | >>>>> On 08/05/2014 02:06 PM, Roman wrote:
>>> | >>>>>> Well, it seems like it
>>> doesn't see the
>>> | >>>>>> changes were made to the
>>> volume ? I
>>> | >>>>>> created two files 200 and
>>> 100 MB (from
>>> | >>>>>> /dev/zero) after I
>>> disconnected the first
>>> | >>>>>> brick. Then connected it
>>> back and got
>>> | >>>>>> these logs:
>>> | >>>>>>
>>> | >>>>>> [2014-08-05 08:30:37.830150] I
>>> | >>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>>> | >>>>>> 0-glusterfs: No change in
>>> volfile, continuing
>>> | >>>>>> [2014-08-05 08:30:37.830207] I
>>> | >>>>>> [rpc-clnt.c:1676:rpc_clnt_reconfig]
>>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: changing
>>> | >>>>>> port to 49153 (from 0)
>>> | >>>>>> [2014-08-05 08:30:37.830239] W
>>> | >>>>>> [socket.c:514:__socket_rwv]
>>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: readv
>>> | >>>>>> failed (No data available)
>>> | >>>>>> [2014-08-05 08:30:37.831024] I
>>> | >>>>>>
>>> [client-handshake.c:1659:select_server_supported_programs]
>>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: Using
>>> | >>>>>> Program GlusterFS 3.3, Num
>>> (1298437),
>>> | >>>>>> Version (330)
>>> | >>>>>> [2014-08-05 08:30:37.831375] I
>>> | >>>>>> [client-handshake.c:1456:client_setvolume_cbk]
>>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: Connected
>>> | >>>>>> to 10.250.0.1:49153
>>> <http://10.250.0.1:49153>
>>> | >>>>>> <http://10.250.0.1:49153>,
>>> attached to
>>> | >>>>>> remote volume
>>> '/exports/fast-test/150G'.
>>> | >>>>>> [2014-08-05 08:30:37.831394] I
>>> | >>>>>> [client-handshake.c:1468:client_setvolume_cbk]
>>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: Server and
>>> | >>>>>> Client lk-version numbers
>>> are not same,
>>> | >>>>>> reopening the fds
>>> | >>>>>> [2014-08-05 08:30:37.831566] I
>>> | >>>>>> [client-handshake.c:450:client_set_lk_version_cbk]
>>> | >>>>>> 0-HA-fast-150G-PVE1-client-0: Server lk
>>> | >>>>>> version = 1
>>> | >>>>>>
>>> | >>>>>>
>>> | >>>>>> [2014-08-05 08:30:37.830150] I
>>> | >>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>>> | >>>>>> 0-glusterfs: No change in
>>> volfile, continuing
>>> | >>>>>> this line seems weird to me tbh.
>>> | >>>>>> I do not see any traffic on
>>> switch
>>> | >>>>>> interfaces between gluster
>>> servers, which
>>> | >>>>>> means, there is no syncing
>>> between them.
>>> | >>>>>> I tried to ls -l the files
>>> on the client
>>> | >>>>>> and servers to trigger the
>>> healing, but
>>> | >>>>>> seems like no success.
>>> Should I wait more?
>>> | >>>>> Yes, it should take around 10-15
>>> minutes.
>>> | >>>>> Could you provide 'getfattr -d
>>> -m. -e hex
>>> | >>>>> <file-on-brick>' on both the bricks.
>>> | >>>>>
>>> | >>>>> Pranith
>>> | >>>>>
>>> | >>>>>>
>>> | >>>>>>
>>> | >>>>>> 2014-08-05 11:25 GMT+03:00
>>> Pranith Kumar
>>> | >>>>>> Karampuri
>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>>> | >>>>>> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>>>>>
>>> | >>>>>>
>>> | >>>>>> On 08/05/2014 01:10 PM,
>>> Roman wrote:
>>> | >>>>>>> Ahha! For some
>>> reason I was not able
>>> | >>>>>>> to start the VM
>>> anymore, Proxmox VE
>>> | >>>>>>> told me, that it is
>>> not able to read
>>> | >>>>>>> the qcow2 header
>>> due to permission
>>> | >>>>>>> is denied for some
>>> reason. So I just
>>> | >>>>>>> deleted that file and created a new
>>> | >>>>>>> VM. And the nex
>>> message I've got was
>>> | >>>>>>> this:
>>> | >>>>>> Seems like these are the
>>> messages
>>> | >>>>>> where you took down the
>>> bricks before
>>> | >>>>>> self-heal. Could you restart the run
>>> | >>>>>> waiting for self-heals
>>> to complete
>>> | >>>>>> before taking down the
>>> next brick?
>>> | >>>>>>
>>> | >>>>>> Pranith
>>> | >>>>>>
>>> | >>>>>>>
>>> | >>>>>>>
>>> | >>>>>>> [2014-08-05 07:31:25.663412] E
>>> | >>>>>>>
>>> [afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
>>> | >>>>>>> 0-HA-fast-150G-PVE1-replicate-0:
>>> | >>>>>>> Unable to self-heal contents of
>>> | >>>>>>> '/images/124/vm-124-disk-1.qcow2'
>>> | >>>>>>> (possible split-brain). Please
>>> | >>>>>>> delete the file from all but the
>>> | >>>>>>> preferred subvolume.- Pending
>>> | >>>>>>> matrix: [ [ 0 60 ] [ 11 0 ] ]
>>> | >>>>>>> [2014-08-05 07:31:25.663955] E
>>> | >>>>>>>
>>> [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk]
>>> | >>>>>>> 0-HA-fast-150G-PVE1-replicate-0:
>>> | >>>>>>> background data self-heal failed on
>>> | >>>>>>> /images/124/vm-124-disk-1.qcow2
>>> | >>>>>>>
>>> | >>>>>>>
>>> | >>>>>>>
>>> | >>>>>>> 2014-08-05 10:13 GMT+03:00 Pranith
>>> | >>>>>>> Kumar Karampuri
>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>>> | >>>>>>>
>>> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>:
>>> | >>>>>>>
>>> | >>>>>>> I just
>>> responded to your earlier
>>> | >>>>>>> mail about how the log looks.
>>> | >>>>>>> The log comes on the mount's logfile
>>> | >>>>>>>
>>> | >>>>>>> Pranith
>>> | >>>>>>>
>>> | >>>>>>> On 08/05/2014 12:41 PM, Roman wrote:
>>> | >>>>>>>> Ok, so I've waited enough, I
>>> | >>>>>>>> think. Had no any traffic on
>>> | >>>>>>>> switch ports between servers.
>>> | >>>>>>>> Could not find any suitable log
>>> | >>>>>>>> message about completed
>>> | >>>>>>>> self-heal (waited about 30
>>> | >>>>>>>> minutes). Plugged out the other
>>> | >>>>>>>> server's UTP cable this time
>>> | >>>>>>>> and got in the same situation:
>>> | >>>>>>>> root at gluster-test1:~# cat
>>> | >>>>>>>> /var/log/dmesg
>>> | >>>>>>>> -bash: /bin/cat: Input/output error
>>> | >>>>>>>>
>>> | >>>>>>>> brick logs:
>>> | >>>>>>>> [2014-08-05 07:09:03.005474] I
>>> | >>>>>>>> [server.c:762:server_rpc_notify]
>>> | >>>>>>>> 0-HA-fast-150G-PVE1-server:
>>> | >>>>>>>> disconnecting connectionfrom
>>> | >>>>>>>>
>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>> | >>>>>>>> [2014-08-05 07:09:03.005530] I
>>> | >>>>>>>> [server-helpers.c:729:server_connection_put]
>>> | >>>>>>>> 0-HA-fast-150G-PVE1-server:
>>> | >>>>>>>> Shutting down connection
>>> | >>>>>>>>
>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>> | >>>>>>>> [2014-08-05 07:09:03.005560] I
>>> | >>>>>>>> [server-helpers.c:463:do_fd_cleanup]
>>> | >>>>>>>> 0-HA-fast-150G-PVE1-server: fd
>>> | >>>>>>>> cleanup on
>>> | >>>>>>>> /images/124/vm-124-disk-1.qcow2
>>> | >>>>>>>> [2014-08-05 07:09:03.005797] I
>>> | >>>>>>>> [server-helpers.c:617:server_connection_destroy]
>>> | >>>>>>>> 0-HA-fast-150G-PVE1-server:
>>> | >>>>>>>> destroyed connection of
>>> | >>>>>>>>
>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>> | >>>>>>>>
>>> | >>>>>>>>
>>> | >>>>>>>>
>>> | >>>>>>>>
>>> | >>>>>>>>
>>> | >>>>>>>> 2014-08-05 9:53 GMT+03:00
>>> | >>>>>>>> Pranith Kumar Karampuri
>>> | >>>>>>>> <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>
>>> | >>>>>>>> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>>>>>>>
>>> | >>>>>>>> Do you think it is possible
>>> | >>>>>>>> for you to do these tests
>>> | >>>>>>>> on the latest version
>>> | >>>>>>>> 3.5.2? 'gluster volume heal
>>> | >>>>>>>> <volname> info' would give
>>> | >>>>>>>> you that information in
>>> | >>>>>>>> versions > 3.5.1.
>>> | >>>>>>>> Otherwise you will have to
>>> | >>>>>>>> check it from either the
>>> | >>>>>>>> logs, there will be
>>> | >>>>>>>> self-heal completed message
>>> | >>>>>>>> on the mount logs (or) by
>>> | >>>>>>>> observing 'getfattr -d -m.
>>> | >>>>>>>> -e hex <image-file-on-bricks>'
>>> | >>>>>>>>
>>> | >>>>>>>> Pranith
>>> | >>>>>>>>
>>> | >>>>>>>>
>>> | >>>>>>>> On 08/05/2014 12:09 PM,
>>> | >>>>>>>> Roman wrote:
>>> | >>>>>>>>> Ok, I understand. I will
>>> | >>>>>>>>> try this shortly.
>>> | >>>>>>>>> How can I be sure, that
>>> | >>>>>>>>> healing process is done,
>>> | >>>>>>>>> if I am not able to see
>>> | >>>>>>>>> its status?
>>> | >>>>>>>>>
>>> | >>>>>>>>>
>>> | >>>>>>>>> 2014-08-05 9:30 GMT+03:00
>>> | >>>>>>>>> Pranith Kumar Karampuri
>>> | >>>>>>>>> <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>
>>> | >>>>>>>>> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>>>>>>>>
>>> | >>>>>>>>> Mounts will do the
>>> | >>>>>>>>> healing, not the
>>> | >>>>>>>>> self-heal-daemon. The
>>> | >>>>>>>>> problem I feel is that
>>> | >>>>>>>>> whichever process does
>>> | >>>>>>>>> the healing has the
>>> | >>>>>>>>> latest information
>>> | >>>>>>>>> about the good bricks
>>> | >>>>>>>>> in this usecase. Since
>>> | >>>>>>>>> for VM usecase, mounts
>>> | >>>>>>>>> should have the latest
>>> | >>>>>>>>> information, we should
>>> | >>>>>>>>> let the mounts do the
>>> | >>>>>>>>> healing. If the mount
>>> | >>>>>>>>> accesses the VM image
>>> | >>>>>>>>> either by someone
>>> | >>>>>>>>> doing operations
>>> | >>>>>>>>> inside the VM or
>>> | >>>>>>>>> explicit stat on the
>>> | >>>>>>>>> file it should do the
>>> | >>>>>>>>> healing.
>>> | >>>>>>>>>
>>> | >>>>>>>>> Pranith.
>>> | >>>>>>>>>
>>> | >>>>>>>>>
>>> | >>>>>>>>> On 08/05/2014 10:39
>>> | >>>>>>>>> AM, Roman wrote:
>>> | >>>>>>>>>> Hmmm, you told me to
>>> | >>>>>>>>>> turn it off. Did I
>>> | >>>>>>>>>> understood something
>>> | >>>>>>>>>> wrong? After I issued
>>> | >>>>>>>>>> the command you've
>>> | >>>>>>>>>> sent me, I was not
>>> | >>>>>>>>>> able to watch the
>>> | >>>>>>>>>> healing process, it
>>> | >>>>>>>>>> said, it won't be
>>> | >>>>>>>>>> healed, becouse its
>>> | >>>>>>>>>> turned off.
>>> | >>>>>>>>>>
>>> | >>>>>>>>>>
>>> | >>>>>>>>>> 2014-08-05 5:39
>>> | >>>>>>>>>> GMT+03:00 Pranith
>>> | >>>>>>>>>> Kumar Karampuri
>>> | >>>>>>>>>> <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>
>>> | >>>>>>>>>> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>>>>>>>>>
>>> | >>>>>>>>>> You didn't
>>> | >>>>>>>>>> mention anything
>>> | >>>>>>>>>> about
>>> | >>>>>>>>>> self-healing. Did
>>> | >>>>>>>>>> you wait until
>>> | >>>>>>>>>> the self-heal is
>>> | >>>>>>>>>> complete?
>>> | >>>>>>>>>>
>>> | >>>>>>>>>> Pranith
>>> | >>>>>>>>>>
>>> | >>>>>>>>>> On 08/04/2014
>>> | >>>>>>>>>> 05:49 PM, Roman
>>> | >>>>>>>>>> wrote:
>>> | >>>>>>>>>>> Hi!
>>> | >>>>>>>>>>> Result is pretty
>>> | >>>>>>>>>>> same. I set the
>>> | >>>>>>>>>>> switch port down
>>> | >>>>>>>>>>> for 1st server,
>>> | >>>>>>>>>>> it was ok. Then
>>> | >>>>>>>>>>> set it up back
>>> | >>>>>>>>>>> and set other
>>> | >>>>>>>>>>> server's port
>>> | >>>>>>>>>>> off. and it
>>> | >>>>>>>>>>> triggered IO
>>> | >>>>>>>>>>> error on two
>>> | >>>>>>>>>>> virtual
>>> | >>>>>>>>>>> machines: one
>>> | >>>>>>>>>>> with local root
>>> | >>>>>>>>>>> FS but network
>>> | >>>>>>>>>>> mounted storage.
>>> | >>>>>>>>>>> and other with
>>> | >>>>>>>>>>> network root FS.
>>> | >>>>>>>>>>> 1st gave an
>>> | >>>>>>>>>>> error on copying
>>> | >>>>>>>>>>> to or from the
>>> | >>>>>>>>>>> mounted network
>>> | >>>>>>>>>>> disk, other just
>>> | >>>>>>>>>>> gave me an error
>>> | >>>>>>>>>>> for even reading
>>> | >>>>>>>>>>> log.files.
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>> cat:
>>> | >>>>>>>>>>> /var/log/alternatives.log:
>>> | >>>>>>>>>>> Input/output error
>>> | >>>>>>>>>>> then I reset the
>>> | >>>>>>>>>>> kvm VM and it
>>> | >>>>>>>>>>> said me, there
>>> | >>>>>>>>>>> is no boot
>>> | >>>>>>>>>>> device. Next I
>>> | >>>>>>>>>>> virtually
>>> | >>>>>>>>>>> powered it off
>>> | >>>>>>>>>>> and then back on
>>> | >>>>>>>>>>> and it has booted.
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>> By the way, did
>>> | >>>>>>>>>>> I have to
>>> | >>>>>>>>>>> start/stop volume?
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>> >> Could you do
>>> | >>>>>>>>>>> the following
>>> | >>>>>>>>>>> and test it again?
>>> | >>>>>>>>>>> >> gluster volume
>>> | >>>>>>>>>>> set <volname>
>>> | >>>>>>>>>>> cluster.self-heal-daemon
>>> | >>>>>>>>>>> off
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>> >>Pranith
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>> 2014-08-04 14:10
>>> | >>>>>>>>>>> GMT+03:00
>>> | >>>>>>>>>>> Pranith Kumar
>>> | >>>>>>>>>>> Karampuri
>>> | >>>>>>>>>>> <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>
>>> | >>>>>>>>>>> <mailto:pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>>>:
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>> On
>>> | >>>>>>>>>>> 08/04/2014
>>> | >>>>>>>>>>> 03:33 PM,
>>> | >>>>>>>>>>> Roman wrote:
>>> | >>>>>>>>>>>> Hello!
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>> Facing the
>>> | >>>>>>>>>>>> same
>>> | >>>>>>>>>>>> problem as
>>> | >>>>>>>>>>>> mentioned
>>> | >>>>>>>>>>>> here:
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>>
>>> http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>> my set up
>>> | >>>>>>>>>>>> is up and
>>> | >>>>>>>>>>>> running, so
>>> | >>>>>>>>>>>> i'm ready
>>> | >>>>>>>>>>>> to help you
>>> | >>>>>>>>>>>> back with
>>> | >>>>>>>>>>>> feedback.
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>> setup:
>>> | >>>>>>>>>>>> proxmox
>>> | >>>>>>>>>>>> server as
>>> | >>>>>>>>>>>> client
>>> | >>>>>>>>>>>> 2 gluster
>>> | >>>>>>>>>>>> physical
>>> | >>>>>>>>>>>> servers
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>> server side
>>> | >>>>>>>>>>>> and client
>>> | >>>>>>>>>>>> side both
>>> | >>>>>>>>>>>> running atm
>>> | >>>>>>>>>>>> 3.4.4
>>> | >>>>>>>>>>>> glusterfs
>>> | >>>>>>>>>>>> from
>>> | >>>>>>>>>>>> gluster repo.
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>> the problem is:
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>> 1. craeted
>>> | >>>>>>>>>>>> replica bricks.
>>> | >>>>>>>>>>>> 2. mounted
>>> | >>>>>>>>>>>> in proxmox
>>> | >>>>>>>>>>>> (tried both
>>> | >>>>>>>>>>>> promox
>>> | >>>>>>>>>>>> ways: via
>>> | >>>>>>>>>>>> GUI and
>>> | >>>>>>>>>>>> fstab (with
>>> | >>>>>>>>>>>> backup
>>> | >>>>>>>>>>>> volume
>>> | >>>>>>>>>>>> line), btw
>>> | >>>>>>>>>>>> while
>>> | >>>>>>>>>>>> mounting
>>> | >>>>>>>>>>>> via fstab
>>> | >>>>>>>>>>>> I'm unable
>>> | >>>>>>>>>>>> to launch a
>>> | >>>>>>>>>>>> VM without
>>> | >>>>>>>>>>>> cache,
>>> | >>>>>>>>>>>> meanwhile
>>> | >>>>>>>>>>>> direct-io-mode
>>> | >>>>>>>>>>>> is enabled
>>> | >>>>>>>>>>>> in fstab line)
>>> | >>>>>>>>>>>> 3. installed VM
>>> | >>>>>>>>>>>> 4. bring
>>> | >>>>>>>>>>>> one volume
>>> | >>>>>>>>>>>> down - ok
>>> | >>>>>>>>>>>> 5. bringing
>>> | >>>>>>>>>>>> up, waiting
>>> | >>>>>>>>>>>> for sync is
>>> | >>>>>>>>>>>> done.
>>> | >>>>>>>>>>>> 6. bring
>>> | >>>>>>>>>>>> other
>>> | >>>>>>>>>>>> volume down
>>> | >>>>>>>>>>>> - getting
>>> | >>>>>>>>>>>> IO errors
>>> | >>>>>>>>>>>> on VM guest
>>> | >>>>>>>>>>>> and not
>>> | >>>>>>>>>>>> able to
>>> | >>>>>>>>>>>> restore the
>>> | >>>>>>>>>>>> VM after I
>>> | >>>>>>>>>>>> reset the
>>> | >>>>>>>>>>>> VM via
>>> | >>>>>>>>>>>> host. It
>>> | >>>>>>>>>>>> says (no
>>> | >>>>>>>>>>>> bootable
>>> | >>>>>>>>>>>> media).
>>> | >>>>>>>>>>>> After I
>>> | >>>>>>>>>>>> shut it
>>> | >>>>>>>>>>>> down
>>> | >>>>>>>>>>>> (forced)
>>> | >>>>>>>>>>>> and bring
>>> | >>>>>>>>>>>> back up, it
>>> | >>>>>>>>>>>> boots.
>>> | >>>>>>>>>>> Could you do
>>> | >>>>>>>>>>> the
>>> | >>>>>>>>>>> following
>>> | >>>>>>>>>>> and test it
>>> | >>>>>>>>>>> again?
>>> | >>>>>>>>>>> gluster
>>> | >>>>>>>>>>> volume set
>>> | >>>>>>>>>>> <volname>
>>> | >>>>>>>>>>> cluster.self-heal-daemon
>>> | >>>>>>>>>>> off
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>> Pranith
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>> Need help.
>>> | >>>>>>>>>>>> Tried
>>> | >>>>>>>>>>>> 3.4.3, 3.4.4.
>>> | >>>>>>>>>>>> Still
>>> | >>>>>>>>>>>> missing
>>> | >>>>>>>>>>>> pkg-s for
>>> | >>>>>>>>>>>> 3.4.5 for
>>> | >>>>>>>>>>>> debian and
>>> | >>>>>>>>>>>> 3.5.2
>>> | >>>>>>>>>>>> (3.5.1
>>> | >>>>>>>>>>>> always
>>> | >>>>>>>>>>>> gives a
>>> | >>>>>>>>>>>> healing
>>> | >>>>>>>>>>>> error for
>>> | >>>>>>>>>>>> some reason)
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>> --
>>> | >>>>>>>>>>>> Best regards,
>>> | >>>>>>>>>>>> Roman.
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>>
>>> | >>>>>>>>>>>>
>>> _______________________________________________
>>> | >>>>>>>>>>>> Gluster-users
>>> | >>>>>>>>>>>> mailing list
>>> | >>>>>>>>>>>> Gluster-users at gluster.org
>>> <mailto:Gluster-users at gluster.org>
>>> | >>>>>>>>>>>>
>>> <mailto:Gluster-users at gluster.org
>>> <mailto:Gluster-users at gluster.org>>
>>> | >>>>>>>>>>>>
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>>
>>> | >>>>>>>>>>> --
>>> | >>>>>>>>>>> Best regards,
>>> | >>>>>>>>>>> Roman.
>>> | >>>>>>>>>>
>>> | >>>>>>>>>>
>>> | >>>>>>>>>>
>>> | >>>>>>>>>>
>>> | >>>>>>>>>> --
>>> | >>>>>>>>>> Best regards,
>>> | >>>>>>>>>> Roman.
>>> | >>>>>>>>>
>>> | >>>>>>>>>
>>> | >>>>>>>>>
>>> | >>>>>>>>>
>>> | >>>>>>>>> --
>>> | >>>>>>>>> Best regards,
>>> | >>>>>>>>> Roman.
>>> | >>>>>>>>
>>> | >>>>>>>>
>>> | >>>>>>>>
>>> | >>>>>>>>
>>> | >>>>>>>> --
>>> | >>>>>>>> Best regards,
>>> | >>>>>>>> Roman.
>>> | >>>>>>>
>>> | >>>>>>>
>>> | >>>>>>>
>>> | >>>>>>>
>>> | >>>>>>> --
>>> | >>>>>>> Best regards,
>>> | >>>>>>> Roman.
>>> | >>>>>>
>>> | >>>>>>
>>> | >>>>>>
>>> | >>>>>>
>>> | >>>>>> --
>>> | >>>>>> Best regards,
>>> | >>>>>> Roman.
>>> | >>>>>
>>> | >>>>>
>>> | >>>>>
>>> | >>>>>
>>> | >>>>> --
>>> | >>>>> Best regards,
>>> | >>>>> Roman.
>>> | >>>>
>>> | >>>>
>>> | >>>>
>>> | >>>>
>>> | >>>> --
>>> | >>>> Best regards,
>>> | >>>> Roman.
>>> | >>>
>>> | >>>
>>> | >>>
>>> | >>>
>>> | >>> --
>>> | >>> Best regards,
>>> | >>> Roman.
>>> | >>
>>> | >>
>>> | >>
>>> | >>
>>> | >> --
>>> | >> Best regards,
>>> | >> Roman.
>>> | >>
>>> | >>
>>> | >>
>>> | >>
>>> | >> --
>>> | >> Best regards,
>>> | >> Roman.
>>> | >
>>> | >
>>> | >
>>> | >
>>> | > --
>>> | > Best regards,
>>> | > Roman.
>>> |
>>> |
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Roman.
>>
>>
>>
>>
>> --
>> Best regards,
>> Roman.
>
>
>
>
> --
> Best regards,
> Roman.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140827/7a0d8986/attachment.html>
More information about the Gluster-users
mailing list