[Gluster-users] libgfapi failover problem on replica bricks

Wed Aug 27 06:23:21 UTC 2014

Okay.
so here are first results:

after I disconnected the first server, I've got this:

root at stor2:~# gluster volume heal HA-FAST-PVE1-150G info
Volume heal failed

but
[2014-08-26 11:45:35.315974] I
[afr-self-heal-common.c:2868:afr_log_self_heal_completion_status]
0-HA-FAST-PVE1-150G-replicate-0:  foreground data self heal  is
successfully completed,  data self heal from HA-FAST-PVE1-150G-client-1  to
sinks  HA-FAST-PVE1-150G-client-0, with 16108814336 bytes on
HA-FAST-PVE1-150G-client-0, 16108814336 bytes on
HA-FAST-PVE1-150G-client-1,  data - Pending matrix:  [ [ 0 0 ] [ 348 0 ] ]
 on <gfid:e3ede9c6-28d6-4755-841a-d8329e42ccc4>

something wrong during upgrade?

I've got two VM-s on different volumes: one with HD on and other with HD
off.
Both survived the outage and both seemed synced.

but today I've found kind of a bug with log rotation.

logs rotated both on server and client sides, but logs are being written in
*.log.1 file :)

/var/log/glusterfs/mnt-pve-HA-MED-PVE1-1T.log.1
/var/log/glusterfs/glustershd.log.1

such behavior came after upgrade.

logrotate.d conf files include the HUP for gluster pid-s.

client:
/var/log/glusterfs/*.log {
        daily
        rotate 7
        delaycompress
        compress
        notifempty
        missingok
        postrotate
                [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
/var/run/glusterd.pid`
        endscript
}

but I'm not able to ls the pid on client side (should it be there?) :(

and servers:
/var/log/glusterfs/*.log {
        daily
        rotate 7
        delaycompress
        compress
        notifempty
        missingok
        postrotate
                [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
/var/run/glusterd.pid`
        endscript
}

/var/log/glusterfs/*/*.log {
        daily
        rotate 7
        delaycompress
        compress
        notifempty
        missingok
        copytruncate
        postrotate
                [ ! -f /var/run/glusterd.pid ] || kill -HUP `cat
/var/run/glusterd.pid`
        endscript
}

I do have /var/run/glusterd.pid on server side.

Should I change something? Logrotation seems to be broken.

2014-08-26 9:29 GMT+03:00 Pranith Kumar Karampuri <pkarampu at redhat.com>:

>
> On 08/26/2014 11:55 AM, Roman wrote:
>
> Hello all again!
> I'm back from vacation and I'm pretty happy with 3.5.2 available for
> wheezy. Thanks! Just made my updates.
> For 3.5.2 do I still have to set cluster.self-heal-daemon to off?
>
> Welcome back :-). If you set it to off, the test case you execute will
> work(Validate please :-) ). But we need to test it with self-heal-daemon
> 'on' and fix any bugs if the test case does not work?
>
> Pranith.
>
>
>
> 2014-08-06 12:49 GMT+03:00 Humble Chirammal <hchiramm at redhat.com>:
>
>>
>>
>>
>> ----- Original Message -----
>> | From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> | To: "Roman" <romeo.r at gmail.com>
>> | Cc: gluster-users at gluster.org, "Niels de Vos" <ndevos at redhat.com>,
>> "Humble Chirammal" <hchiramm at redhat.com>
>> | Sent: Wednesday, August 6, 2014 12:09:57 PM
>> | Subject: Re: [Gluster-users] libgfapi failover problem on replica bricks
>> |
>> | Roman,
>> |      The file went into split-brain. I think we should do these tests
>> | with 3.5.2. Where monitoring the heals is easier. Let me also come up
>> | with a document about how to do this testing you are trying to do.
>> |
>> | Humble/Niels,
>> |      Do we have debs available for 3.5.2? In 3.5.1 there was packaging
>> | issue where /usr/bin/glfsheal is not packaged along with the deb. I
>> | think that should be fixed now as well?
>> |
>>  Pranith,
>>
>> The 3.5.2 packages for debian is not available yet. We are co-ordinating
>> internally to get it processed.
>> I will update the list once its available.
>>
>> --Humble
>> |
>> | On 08/06/2014 11:52 AM, Roman wrote:
>> | > good morning,
>> | >
>> | > root at stor1:~# getfattr -d -m. -e hex
>> | > /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | > getfattr: Removing leading '/' from absolute path names
>> | > # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | > trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>> | > trusted.afr.HA-fast-150G-PVE1-client-1=0x000001320000000000000000
>> | > trusted.gfid=0x23c79523075a4158bea38078da570449
>> | >
>> | > getfattr: Removing leading '/' from absolute path names
>> | > # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | > trusted.afr.HA-fast-150G-PVE1-client-0=0x000000040000000000000000
>> | > trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>> | > trusted.gfid=0x23c79523075a4158bea38078da570449
>> | >
>> | >
>> | >
>> | > 2014-08-06 9:20 GMT+03:00 Pranith Kumar Karampuri <
>> pkarampu at redhat.com
>>  | > <mailto:pkarampu at redhat.com>>:
>> | >
>> | >
>> | >     On 08/06/2014 11:30 AM, Roman wrote:
>> | >>     Also, this time files are not the same!
>> | >>
>> | >>     root at stor1:~# md5sum
>> | >>     /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>     32411360c53116b96a059f17306caeda
>> | >>      /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>
>> | >>     root at stor2:~# md5sum
>> | >>     /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>     65b8a6031bcb6f5fb3a11cb1e8b1c9c9
>> | >>      /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >     What is the getfattr output?
>> | >
>> | >     Pranith
>> | >
>> | >>
>> | >>
>> | >>     2014-08-05 16:33 GMT+03:00 Roman <romeo.r at gmail.com
>>  | >>     <mailto:romeo.r at gmail.com>>:
>> | >>
>> | >>         Nope, it is not working. But this time it went a bit other
>> way
>> | >>
>> | >>         root at gluster-client:~# dmesg
>> | >>         Segmentation fault
>> | >>
>> | >>
>> | >>         I was not able even to start the VM after I done the tests
>> | >>
>> | >>         Could not read qcow2 header: Operation not permitted
>> | >>
>> | >>         And it seems, it never starts to sync files after first
>> | >>         disconnect. VM survives first disconnect, but not second (I
>> | >>         waited around 30 minutes). Also, I've
>> | >>         got network.ping-timeout: 2 in volume settings, but logs
>> | >>         react on first disconnect around 30 seconds. Second was
>> | >>         faster, 2 seconds.
>> | >>
>> | >>         Reaction was different also:
>> | >>
>> | >>         slower one:
>> | >>         [2014-08-05 13:26:19.558435] W [socket.c:514:__socket_rwv]
>> | >>         0-glusterfs: readv failed (Connection timed out)
>> | >>         [2014-08-05 13:26:19.558485] W
>> | >>         [socket.c:1962:__socket_proto_state_machine] 0-glusterfs:
>> | >>         reading from socket failed. Error (Connection timed out),
>>  | >>         peer (10.250.0.1:24007 <http://10.250.0.1:24007>)
>> | >>         [2014-08-05 13:26:21.281426] W [socket.c:514:__socket_rwv]
>> | >>         0-HA-fast-150G-PVE1-client-0: readv failed (Connection timed
>> out)
>> | >>         [2014-08-05 13:26:21.281474] W
>> | >>         [socket.c:1962:__socket_proto_state_machine]
>> | >>         0-HA-fast-150G-PVE1-client-0: reading from socket failed.
>> | >>         Error (Connection timed out), peer (10.250.0.1:49153
>>  | >>         <http://10.250.0.1:49153>)
>> | >>         [2014-08-05 13:26:21.281507] I
>> | >>         [client.c:2098:client_rpc_notify]
>> | >>         0-HA-fast-150G-PVE1-client-0: disconnected
>> | >>
>> | >>         the fast one:
>> | >>         2014-08-05 12:52:44.607389] C
>> | >>         [client-handshake.c:127:rpc_client_ping_timer_expired]
>> | >>         0-HA-fast-150G-PVE1-client-1: server 10.250.0.2:49153
>>  | >>         <http://10.250.0.2:49153> has not responded in the last 2
>>  | >>         seconds, disconnecting.
>> | >>         [2014-08-05 12:52:44.607491] W [socket.c:514:__socket_rwv]
>> | >>         0-HA-fast-150G-PVE1-client-1: readv failed (No data
>> available)
>> | >>         [2014-08-05 12:52:44.607585] E
>> | >>         [rpc-clnt.c:368:saved_frames_unwind]
>> | >>
>>  (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>> | >>         [0x7fcb1b4b0558]
>> | >>
>>  (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>> | >>         [0x7fcb1b4aea63]
>> | >>
>>  (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>> | >>         [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced
>> | >>         unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at
>> | >>         2014-08-05 12:52:42.463881 (xid=0x381883x)
>> | >>         [2014-08-05 12:52:44.607604] W
>> | >>         [client-rpc-fops.c:2624:client3_3_lookup_cbk]
>> | >>         0-HA-fast-150G-PVE1-client-1: remote operation failed:
>> | >>         Transport endpoint is not connected. Path: /
>> | >>         (00000000-0000-0000-0000-000000000001)
>> | >>         [2014-08-05 12:52:44.607736] E
>> | >>         [rpc-clnt.c:368:saved_frames_unwind]
>> | >>
>>  (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>> | >>         [0x7fcb1b4b0558]
>> | >>
>>  (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>> | >>         [0x7fcb1b4aea63]
>> | >>
>>  (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>> | >>         [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced
>> | >>         unwinding frame type(GlusterFS Handshake) op(PING(3)) called
>> | >>         at 2014-08-05 12:52:42.463891 (xid=0x381884x)
>> | >>         [2014-08-05 12:52:44.607753] W
>> | >>         [client-handshake.c:276:client_ping_cbk]
>> | >>         0-HA-fast-150G-PVE1-client-1: timer must have expired
>> | >>         [2014-08-05 12:52:44.607776] I
>> | >>         [client.c:2098:client_rpc_notify]
>> | >>         0-HA-fast-150G-PVE1-client-1: disconnected
>> | >>
>> | >>
>> | >>
>> | >>         I've got SSD disks (just for an info).
>> | >>         Should I go and give a try for 3.5.2?
>> | >>
>> | >>
>> | >>
>> | >>         2014-08-05 13:06 GMT+03:00 Pranith Kumar Karampuri
>>  | >>         <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>> | >>
>> | >>             reply along with gluster-users please :-). May be you are
>> | >>             hitting 'reply' instead of 'reply all'?
>> | >>
>> | >>             Pranith
>> | >>
>> | >>             On 08/05/2014 03:35 PM, Roman wrote:
>> | >>>             To make sure and clean, I've created another VM with raw
>> | >>>             format and goint to repeat those steps. So now I've got
>> | >>>             two VM-s one with qcow2 format and other with raw
>> | >>>             format. I will send another e-mail shortly.
>> | >>>
>> | >>>
>> | >>>             2014-08-05 13:01 GMT+03:00 Pranith Kumar Karampuri
>>  | >>>             <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>>  | >>>
>> | >>>
>> | >>>                 On 08/05/2014 03:07 PM, Roman wrote:
>> | >>>>                 really, seems like the same file
>> | >>>>
>> | >>>>                 stor1:
>> | >>>>                 a951641c5230472929836f9fcede6b04
>> | >>>>
>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>>>
>> | >>>>                 stor2:
>> | >>>>                 a951641c5230472929836f9fcede6b04
>> | >>>>
>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>>>
>> | >>>>
>> | >>>>                 one thing I've seen from logs, that somehow proxmox
>> | >>>>                 VE is connecting with wrong version to servers?
>> | >>>>                 [2014-08-05 09:23:45.218550] I
>> | >>>>
>>  [client-handshake.c:1659:select_server_supported_programs]
>> | >>>>                 0-HA-fast-150G-PVE1-client-0: Using Program
>> | >>>>                 GlusterFS 3.3, Num (1298437), Version (330)
>> | >>>                 It is the rpc (over the network data structures)
>> | >>>                 version, which is not changed at all from 3.3 so
>> | >>>                 thats not a problem. So what is the conclusion? Is
>> | >>>                 your test case working now or not?
>> | >>>
>> | >>>                 Pranith
>> | >>>
>> | >>>>                 but if I issue:
>> | >>>>                 root at pve1:~# glusterfs -V
>> | >>>>                 glusterfs 3.4.4 built on Jun 28 2014 03:44:57
>> | >>>>                 seems ok.
>> | >>>>
>> | >>>>                 server  use 3.4.4 meanwhile
>> | >>>>                 [2014-08-05 09:23:45.117875] I
>> | >>>>                 [server-handshake.c:567:server_setvolume]
>> | >>>>                 0-HA-fast-150G-PVE1-server: accepted client from
>> | >>>>
>>  stor1-9004-2014/08/05-09:23:45:93538-HA-fast-150G-PVE1-client-1-0
>> | >>>>                 (version: 3.4.4)
>> | >>>>                 [2014-08-05 09:23:49.103035] I
>> | >>>>                 [server-handshake.c:567:server_setvolume]
>> | >>>>                 0-HA-fast-150G-PVE1-server: accepted client from
>> | >>>>
>>  stor1-8998-2014/08/05-09:23:45:89883-HA-fast-150G-PVE1-client-0-0
>> | >>>>                 (version: 3.4.4)
>> | >>>>
>> | >>>>                 if this could be the reason, of course.
>> | >>>>                 I did restart the Proxmox VE yesterday (just for an
>> | >>>>                 information)
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>>                 2014-08-05 12:30 GMT+03:00 Pranith Kumar Karampuri
>>  | >>>>                 <pkarampu at redhat.com <mailto:pkarampu at redhat.com
>> >>:
>>  | >>>>
>> | >>>>
>> | >>>>                     On 08/05/2014 02:33 PM, Roman wrote:
>> | >>>>>                     Waited long enough for now, still different
>> | >>>>>                     sizes and no logs about healing :(
>> | >>>>>
>> | >>>>>                     stor1
>> | >>>>>                     # file:
>> | >>>>>
>>  exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>>>>
>>  trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>> | >>>>>
>>  trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>> | >>>>>
>>  trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>> | >>>>>
>> | >>>>>                     root at stor1:~# du -sh
>> | >>>>>                     /exports/fast-test/150G/images/127/
>> | >>>>>                     1.2G  /exports/fast-test/150G/images/127/
>> | >>>>>
>> | >>>>>
>> | >>>>>                     stor2
>> | >>>>>                     # file:
>> | >>>>>
>>  exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> | >>>>>
>>  trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>> | >>>>>
>>  trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>> | >>>>>
>>  trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>> | >>>>>
>> | >>>>>
>> | >>>>>                     root at stor2:~# du -sh
>> | >>>>>                     /exports/fast-test/150G/images/127/
>> | >>>>>                     1.4G  /exports/fast-test/150G/images/127/
>> | >>>>                     According to the changelogs, the file doesn't
>> | >>>>                     need any healing. Could you stop the operations
>> | >>>>                     on the VMs and take md5sum on both these
>> machines?
>> | >>>>
>> | >>>>                     Pranith
>> | >>>>
>> | >>>>>
>> | >>>>>
>> | >>>>>
>> | >>>>>
>> | >>>>>                     2014-08-05 11:49 GMT+03:00 Pranith Kumar
>> | >>>>>                     Karampuri <pkarampu at redhat.com
>>  | >>>>>                     <mailto:pkarampu at redhat.com>>:
>>  | >>>>>
>> | >>>>>
>> | >>>>>                         On 08/05/2014 02:06 PM, Roman wrote:
>> | >>>>>>                         Well, it seems like it doesn't see the
>> | >>>>>>                         changes were made to the volume ? I
>> | >>>>>>                         created two files 200 and 100 MB (from
>> | >>>>>>                         /dev/zero) after I disconnected the first
>> | >>>>>>                         brick. Then connected it back and got
>> | >>>>>>                         these logs:
>> | >>>>>>
>> | >>>>>>                         [2014-08-05 08:30:37.830150] I
>> | >>>>>>                         [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>> | >>>>>>                         0-glusterfs: No change in volfile,
>> continuing
>> | >>>>>>                         [2014-08-05 08:30:37.830207] I
>> | >>>>>>                         [rpc-clnt.c:1676:rpc_clnt_reconfig]
>> | >>>>>>                         0-HA-fast-150G-PVE1-client-0: changing
>> | >>>>>>                         port to 49153 (from 0)
>> | >>>>>>                         [2014-08-05 08:30:37.830239] W
>> | >>>>>>                         [socket.c:514:__socket_rwv]
>> | >>>>>>                         0-HA-fast-150G-PVE1-client-0: readv
>> | >>>>>>                         failed (No data available)
>> | >>>>>>                         [2014-08-05 08:30:37.831024] I
>> | >>>>>>
>>  [client-handshake.c:1659:select_server_supported_programs]
>> | >>>>>>                         0-HA-fast-150G-PVE1-client-0: Using
>> | >>>>>>                         Program GlusterFS 3.3, Num (1298437),
>> | >>>>>>                         Version (330)
>> | >>>>>>                         [2014-08-05 08:30:37.831375] I
>> | >>>>>>
>>  [client-handshake.c:1456:client_setvolume_cbk]
>> | >>>>>>                         0-HA-fast-150G-PVE1-client-0: Connected
>> | >>>>>>                         to 10.250.0.1:49153
>>  | >>>>>>                         <http://10.250.0.1:49153>, attached to
>>  | >>>>>>                         remote volume
>> '/exports/fast-test/150G'.
>> | >>>>>>                         [2014-08-05 08:30:37.831394] I
>> | >>>>>>
>>  [client-handshake.c:1468:client_setvolume_cbk]
>> | >>>>>>                         0-HA-fast-150G-PVE1-client-0: Server and
>> | >>>>>>                         Client lk-version numbers are not same,
>> | >>>>>>                         reopening the fds
>> | >>>>>>                         [2014-08-05 08:30:37.831566] I
>> | >>>>>>
>>  [client-handshake.c:450:client_set_lk_version_cbk]
>> | >>>>>>                         0-HA-fast-150G-PVE1-client-0: Server lk
>> | >>>>>>                         version = 1
>> | >>>>>>
>> | >>>>>>
>> | >>>>>>                         [2014-08-05 08:30:37.830150] I
>> | >>>>>>                         [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>> | >>>>>>                         0-glusterfs: No change in volfile,
>> continuing
>> | >>>>>>                         this line seems weird to me tbh.
>> | >>>>>>                         I do not see any traffic on switch
>> | >>>>>>                         interfaces between gluster servers, which
>> | >>>>>>                         means, there is no syncing between them.
>> | >>>>>>                         I tried to ls -l the files on the client
>> | >>>>>>                         and servers to trigger the healing, but
>> | >>>>>>                         seems like no success. Should I wait
>> more?
>> | >>>>>                         Yes, it should take around 10-15 minutes.
>> | >>>>>                         Could you provide 'getfattr -d -m. -e hex
>> | >>>>>                         <file-on-brick>' on both the bricks.
>> | >>>>>
>> | >>>>>                         Pranith
>> | >>>>>
>> | >>>>>>
>> | >>>>>>
>> | >>>>>>                         2014-08-05 11:25 GMT+03:00 Pranith Kumar
>> | >>>>>>                         Karampuri <pkarampu at redhat.com
>>  | >>>>>>                         <mailto:pkarampu at redhat.com>>:
>>  | >>>>>>
>> | >>>>>>
>> | >>>>>>                             On 08/05/2014 01:10 PM, Roman wrote:
>> | >>>>>>>                             Ahha! For some reason I was not able
>> | >>>>>>>                             to start the VM anymore, Proxmox VE
>> | >>>>>>>                             told me, that it is not able to read
>> | >>>>>>>                             the qcow2 header due to permission
>> | >>>>>>>                             is denied for some reason. So I just
>> | >>>>>>>                             deleted that file and created a new
>> | >>>>>>>                             VM. And the nex message I've got was
>> | >>>>>>>                             this:
>> | >>>>>>                             Seems like these are the messages
>> | >>>>>>                             where you took down the bricks before
>> | >>>>>>                             self-heal. Could you restart the run
>> | >>>>>>                             waiting for self-heals to complete
>> | >>>>>>                             before taking down the next brick?
>> | >>>>>>
>> | >>>>>>                             Pranith
>> | >>>>>>
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>                             [2014-08-05 07:31:25.663412] E
>> | >>>>>>>
>>  [afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
>> | >>>>>>>                             0-HA-fast-150G-PVE1-replicate-0:
>> | >>>>>>>                             Unable to self-heal contents of
>> | >>>>>>>                             '/images/124/vm-124-disk-1.qcow2'
>> | >>>>>>>                             (possible split-brain). Please
>> | >>>>>>>                             delete the file from all but the
>> | >>>>>>>                             preferred subvolume.- Pending
>> | >>>>>>>                             matrix:  [ [ 0 60 ] [ 11 0 ] ]
>> | >>>>>>>                             [2014-08-05 07:31:25.663955] E
>> | >>>>>>>
>>  [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk]
>> | >>>>>>>                             0-HA-fast-150G-PVE1-replicate-0:
>> | >>>>>>>                             background  data self-heal failed on
>> | >>>>>>>                             /images/124/vm-124-disk-1.qcow2
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>                             2014-08-05 10:13 GMT+03:00 Pranith
>> | >>>>>>>                             Kumar Karampuri <
>> pkarampu at redhat.com
>>  | >>>>>>>                             <mailto:pkarampu at redhat.com>>:
>>  | >>>>>>>
>> | >>>>>>>                                 I just responded to your earlier
>> | >>>>>>>                                 mail about how the log looks.
>> | >>>>>>>                                 The log comes on the mount's
>> logfile
>> | >>>>>>>
>> | >>>>>>>                                 Pranith
>> | >>>>>>>
>> | >>>>>>>                                 On 08/05/2014 12:41 PM, Roman
>> wrote:
>> | >>>>>>>>                                 Ok, so I've waited enough, I
>> | >>>>>>>>                                 think. Had no any traffic on
>> | >>>>>>>>                                 switch ports between servers.
>> | >>>>>>>>                                 Could not find any suitable log
>> | >>>>>>>>                                 message about completed
>> | >>>>>>>>                                 self-heal (waited about 30
>> | >>>>>>>>                                 minutes). Plugged out the other
>> | >>>>>>>>                                 server's UTP cable this time
>> | >>>>>>>>                                 and got in the same situation:
>> | >>>>>>>>                                 root at gluster-test1:~# cat
>> | >>>>>>>>                                 /var/log/dmesg
>> | >>>>>>>>                                 -bash: /bin/cat: Input/output
>> error
>> | >>>>>>>>
>> | >>>>>>>>                                 brick logs:
>> | >>>>>>>>                                 [2014-08-05 07:09:03.005474] I
>> | >>>>>>>>
>>  [server.c:762:server_rpc_notify]
>> | >>>>>>>>                                 0-HA-fast-150G-PVE1-server:
>> | >>>>>>>>                                 disconnecting connectionfrom
>> | >>>>>>>>
>>  pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>> | >>>>>>>>                                 [2014-08-05 07:09:03.005530] I
>> | >>>>>>>>
>>  [server-helpers.c:729:server_connection_put]
>> | >>>>>>>>                                 0-HA-fast-150G-PVE1-server:
>> | >>>>>>>>                                 Shutting down connection
>> | >>>>>>>>
>>  pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>> | >>>>>>>>                                 [2014-08-05 07:09:03.005560] I
>> | >>>>>>>>
>>  [server-helpers.c:463:do_fd_cleanup]
>> | >>>>>>>>                                 0-HA-fast-150G-PVE1-server: fd
>> | >>>>>>>>                                 cleanup on
>> | >>>>>>>>                                 /images/124/vm-124-disk-1.qcow2
>> | >>>>>>>>                                 [2014-08-05 07:09:03.005797] I
>> | >>>>>>>>
>>  [server-helpers.c:617:server_connection_destroy]
>> | >>>>>>>>                                 0-HA-fast-150G-PVE1-server:
>> | >>>>>>>>                                 destroyed connection of
>> | >>>>>>>>
>>  pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>                                 2014-08-05 9:53 GMT+03:00
>> | >>>>>>>>                                 Pranith Kumar Karampuri
>> | >>>>>>>>                                 <pkarampu at redhat.com
>>  | >>>>>>>>                                 <mailto:pkarampu at redhat.com
>> >>:
>>  | >>>>>>>>
>> | >>>>>>>>                                     Do you think it is possible
>> | >>>>>>>>                                     for you to do these tests
>> | >>>>>>>>                                     on the latest version
>> | >>>>>>>>                                     3.5.2? 'gluster volume heal
>> | >>>>>>>>                                     <volname> info' would give
>> | >>>>>>>>                                     you that information in
>> | >>>>>>>>                                     versions > 3.5.1.
>> | >>>>>>>>                                     Otherwise you will have to
>> | >>>>>>>>                                     check it from either the
>> | >>>>>>>>                                     logs, there will be
>> | >>>>>>>>                                     self-heal completed message
>> | >>>>>>>>                                     on the mount logs (or) by
>> | >>>>>>>>                                     observing 'getfattr -d -m.
>> | >>>>>>>>                                     -e hex
>> <image-file-on-bricks>'
>> | >>>>>>>>
>> | >>>>>>>>                                     Pranith
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>                                     On 08/05/2014 12:09 PM,
>> | >>>>>>>>                                     Roman wrote:
>> | >>>>>>>>>                                     Ok, I understand. I will
>> | >>>>>>>>>                                     try this shortly.
>> | >>>>>>>>>                                     How can I be sure, that
>> | >>>>>>>>>                                     healing process is done,
>> | >>>>>>>>>                                     if I am not able to see
>> | >>>>>>>>>                                     its status?
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>>                                     2014-08-05 9:30 GMT+03:00
>> | >>>>>>>>>                                     Pranith Kumar Karampuri
>> | >>>>>>>>>                                     <pkarampu at redhat.com
>>  | >>>>>>>>>                                     <mailto:
>> pkarampu at redhat.com>>:
>>  | >>>>>>>>>
>> | >>>>>>>>>                                         Mounts will do the
>> | >>>>>>>>>                                         healing, not the
>> | >>>>>>>>>                                         self-heal-daemon. The
>> | >>>>>>>>>                                         problem I feel is that
>> | >>>>>>>>>                                         whichever process does
>> | >>>>>>>>>                                         the healing has the
>> | >>>>>>>>>                                         latest information
>> | >>>>>>>>>                                         about the good bricks
>> | >>>>>>>>>                                         in this usecase. Since
>> | >>>>>>>>>                                         for VM usecase, mounts
>> | >>>>>>>>>                                         should have the latest
>> | >>>>>>>>>                                         information, we should
>> | >>>>>>>>>                                         let the mounts do the
>> | >>>>>>>>>                                         healing. If the mount
>> | >>>>>>>>>                                         accesses the VM image
>> | >>>>>>>>>                                         either by someone
>> | >>>>>>>>>                                         doing operations
>> | >>>>>>>>>                                         inside the VM or
>> | >>>>>>>>>                                         explicit stat on the
>> | >>>>>>>>>                                         file it should do the
>> | >>>>>>>>>                                         healing.
>> | >>>>>>>>>
>> | >>>>>>>>>                                         Pranith.
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>>                                         On 08/05/2014 10:39
>> | >>>>>>>>>                                         AM, Roman wrote:
>> | >>>>>>>>>>                                         Hmmm, you told me to
>> | >>>>>>>>>>                                         turn it off. Did I
>> | >>>>>>>>>>                                         understood something
>> | >>>>>>>>>>                                         wrong? After I issued
>> | >>>>>>>>>>                                         the command you've
>> | >>>>>>>>>>                                         sent me, I was not
>> | >>>>>>>>>>                                         able to watch the
>> | >>>>>>>>>>                                         healing process, it
>> | >>>>>>>>>>                                         said, it won't be
>> | >>>>>>>>>>                                         healed, becouse its
>> | >>>>>>>>>>                                         turned off.
>> | >>>>>>>>>>
>> | >>>>>>>>>>
>> | >>>>>>>>>>                                         2014-08-05 5:39
>> | >>>>>>>>>>                                         GMT+03:00 Pranith
>> | >>>>>>>>>>                                         Kumar Karampuri
>> | >>>>>>>>>>                                         <pkarampu at redhat.com
>>  | >>>>>>>>>>                                         <mailto:
>> pkarampu at redhat.com>>:
>>  | >>>>>>>>>>
>> | >>>>>>>>>>                                             You didn't
>> | >>>>>>>>>>                                             mention anything
>> | >>>>>>>>>>                                             about
>> | >>>>>>>>>>                                             self-healing. Did
>> | >>>>>>>>>>                                             you wait until
>> | >>>>>>>>>>                                             the self-heal is
>> | >>>>>>>>>>                                             complete?
>> | >>>>>>>>>>
>> | >>>>>>>>>>                                             Pranith
>> | >>>>>>>>>>
>> | >>>>>>>>>>                                             On 08/04/2014
>> | >>>>>>>>>>                                             05:49 PM, Roman
>> | >>>>>>>>>>                                             wrote:
>> | >>>>>>>>>>>                                             Hi!
>> | >>>>>>>>>>>                                             Result is pretty
>> | >>>>>>>>>>>                                             same. I set the
>> | >>>>>>>>>>>                                             switch port down
>> | >>>>>>>>>>>                                             for 1st server,
>> | >>>>>>>>>>>                                             it was ok. Then
>> | >>>>>>>>>>>                                             set it up back
>> | >>>>>>>>>>>                                             and set other
>> | >>>>>>>>>>>                                             server's port
>> | >>>>>>>>>>>                                             off. and it
>> | >>>>>>>>>>>                                             triggered IO
>> | >>>>>>>>>>>                                             error on two
>> | >>>>>>>>>>>                                             virtual
>> | >>>>>>>>>>>                                             machines: one
>> | >>>>>>>>>>>                                             with local root
>> | >>>>>>>>>>>                                             FS but network
>> | >>>>>>>>>>>                                             mounted storage.
>> | >>>>>>>>>>>                                             and other with
>> | >>>>>>>>>>>                                             network root FS.
>> | >>>>>>>>>>>                                             1st gave an
>> | >>>>>>>>>>>                                             error on copying
>> | >>>>>>>>>>>                                             to or from the
>> | >>>>>>>>>>>                                             mounted network
>> | >>>>>>>>>>>                                             disk, other just
>> | >>>>>>>>>>>                                             gave me an error
>> | >>>>>>>>>>>                                             for even reading
>> | >>>>>>>>>>>                                             log.files.
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>                                             cat:
>> | >>>>>>>>>>>
>>  /var/log/alternatives.log:
>> | >>>>>>>>>>>                                             Input/output
>> error
>> | >>>>>>>>>>>                                             then I reset the
>> | >>>>>>>>>>>                                             kvm VM and it
>> | >>>>>>>>>>>                                             said me, there
>> | >>>>>>>>>>>                                             is no boot
>> | >>>>>>>>>>>                                             device. Next I
>> | >>>>>>>>>>>                                             virtually
>> | >>>>>>>>>>>                                             powered it off
>> | >>>>>>>>>>>                                             and then back on
>> | >>>>>>>>>>>                                             and it has
>> booted.
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>                                             By the way, did
>> | >>>>>>>>>>>                                             I have to
>> | >>>>>>>>>>>                                             start/stop
>> volume?
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>                                             >> Could you do
>> | >>>>>>>>>>>                                             the following
>> | >>>>>>>>>>>                                             and test it
>> again?
>> | >>>>>>>>>>>                                             >> gluster
>> volume
>> | >>>>>>>>>>>                                             set <volname>
>> | >>>>>>>>>>>
>>  cluster.self-heal-daemon
>> | >>>>>>>>>>>                                             off
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>                                             >>Pranith
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>                                             2014-08-04 14:10
>> | >>>>>>>>>>>                                             GMT+03:00
>> | >>>>>>>>>>>                                             Pranith Kumar
>> | >>>>>>>>>>>                                             Karampuri
>> | >>>>>>>>>>>                                             <
>> pkarampu at redhat.com
>>  | >>>>>>>>>>>                                             <mailto:
>> pkarampu at redhat.com>>:
>>  | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>                                                 On
>> | >>>>>>>>>>>                                                 08/04/2014
>> | >>>>>>>>>>>                                                 03:33 PM,
>> | >>>>>>>>>>>                                                 Roman wrote:
>> | >>>>>>>>>>>>                                                 Hello!
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>                                                 Facing the
>> | >>>>>>>>>>>>                                                 same
>> | >>>>>>>>>>>>                                                 problem as
>> | >>>>>>>>>>>>                                                 mentioned
>> | >>>>>>>>>>>>                                                 here:
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>
>> http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>                                                 my set up
>> | >>>>>>>>>>>>                                                 is up and
>> | >>>>>>>>>>>>                                                 running, so
>> | >>>>>>>>>>>>                                                 i'm ready
>> | >>>>>>>>>>>>                                                 to help you
>> | >>>>>>>>>>>>                                                 back with
>> | >>>>>>>>>>>>                                                 feedback.
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>                                                 setup:
>> | >>>>>>>>>>>>                                                 proxmox
>> | >>>>>>>>>>>>                                                 server as
>> | >>>>>>>>>>>>                                                 client
>> | >>>>>>>>>>>>                                                 2 gluster
>> | >>>>>>>>>>>>                                                 physical
>> | >>>>>>>>>>>>                                                  servers
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>                                                 server side
>> | >>>>>>>>>>>>                                                 and client
>> | >>>>>>>>>>>>                                                 side both
>> | >>>>>>>>>>>>                                                 running atm
>> | >>>>>>>>>>>>                                                 3.4.4
>> | >>>>>>>>>>>>                                                 glusterfs
>> | >>>>>>>>>>>>                                                 from
>> | >>>>>>>>>>>>                                                 gluster
>> repo.
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>                                                 the
>> problem is:
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>                                                 1. craeted
>> | >>>>>>>>>>>>                                                 replica
>> bricks.
>> | >>>>>>>>>>>>                                                 2. mounted
>> | >>>>>>>>>>>>                                                 in proxmox
>> | >>>>>>>>>>>>                                                 (tried both
>> | >>>>>>>>>>>>                                                 promox
>> | >>>>>>>>>>>>                                                 ways: via
>> | >>>>>>>>>>>>                                                 GUI and
>> | >>>>>>>>>>>>                                                 fstab (with
>> | >>>>>>>>>>>>                                                 backup
>> | >>>>>>>>>>>>                                                 volume
>> | >>>>>>>>>>>>                                                 line), btw
>> | >>>>>>>>>>>>                                                 while
>> | >>>>>>>>>>>>                                                 mounting
>> | >>>>>>>>>>>>                                                 via fstab
>> | >>>>>>>>>>>>                                                 I'm unable
>> | >>>>>>>>>>>>                                                 to launch a
>> | >>>>>>>>>>>>                                                 VM without
>> | >>>>>>>>>>>>                                                 cache,
>> | >>>>>>>>>>>>                                                 meanwhile
>> | >>>>>>>>>>>>
>>  direct-io-mode
>> | >>>>>>>>>>>>                                                 is enabled
>> | >>>>>>>>>>>>                                                 in fstab
>> line)
>> | >>>>>>>>>>>>                                                 3.
>> installed VM
>> | >>>>>>>>>>>>                                                 4. bring
>> | >>>>>>>>>>>>                                                 one volume
>> | >>>>>>>>>>>>                                                 down - ok
>> | >>>>>>>>>>>>                                                 5. bringing
>> | >>>>>>>>>>>>                                                 up, waiting
>> | >>>>>>>>>>>>                                                 for sync is
>> | >>>>>>>>>>>>                                                 done.
>> | >>>>>>>>>>>>                                                 6. bring
>> | >>>>>>>>>>>>                                                 other
>> | >>>>>>>>>>>>                                                 volume down
>> | >>>>>>>>>>>>                                                 - getting
>> | >>>>>>>>>>>>                                                 IO errors
>> | >>>>>>>>>>>>                                                 on VM guest
>> | >>>>>>>>>>>>                                                 and not
>> | >>>>>>>>>>>>                                                 able to
>> | >>>>>>>>>>>>                                                 restore the
>> | >>>>>>>>>>>>                                                 VM after I
>> | >>>>>>>>>>>>                                                 reset the
>> | >>>>>>>>>>>>                                                 VM via
>> | >>>>>>>>>>>>                                                 host. It
>> | >>>>>>>>>>>>                                                 says (no
>> | >>>>>>>>>>>>                                                 bootable
>> | >>>>>>>>>>>>                                                 media).
>> | >>>>>>>>>>>>                                                 After I
>> | >>>>>>>>>>>>                                                 shut it
>> | >>>>>>>>>>>>                                                 down
>> | >>>>>>>>>>>>                                                 (forced)
>> | >>>>>>>>>>>>                                                 and bring
>> | >>>>>>>>>>>>                                                 back up, it
>> | >>>>>>>>>>>>                                                 boots.
>> | >>>>>>>>>>>                                                 Could you do
>> | >>>>>>>>>>>                                                 the
>> | >>>>>>>>>>>                                                 following
>> | >>>>>>>>>>>                                                 and test it
>> | >>>>>>>>>>>                                                 again?
>> | >>>>>>>>>>>                                                 gluster
>> | >>>>>>>>>>>                                                 volume set
>> | >>>>>>>>>>>                                                 <volname>
>> | >>>>>>>>>>>
>>  cluster.self-heal-daemon
>> | >>>>>>>>>>>                                                 off
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>                                                 Pranith
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>                                                 Need help.
>> | >>>>>>>>>>>>                                                 Tried
>> | >>>>>>>>>>>>                                                 3.4.3,
>> 3.4.4.
>> | >>>>>>>>>>>>                                                 Still
>> | >>>>>>>>>>>>                                                 missing
>> | >>>>>>>>>>>>                                                 pkg-s for
>> | >>>>>>>>>>>>                                                 3.4.5 for
>> | >>>>>>>>>>>>                                                 debian and
>> | >>>>>>>>>>>>                                                 3.5.2
>> | >>>>>>>>>>>>                                                 (3.5.1
>> | >>>>>>>>>>>>                                                 always
>> | >>>>>>>>>>>>                                                 gives a
>> | >>>>>>>>>>>>                                                 healing
>> | >>>>>>>>>>>>                                                 error for
>> | >>>>>>>>>>>>                                                 some
>> reason)
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>                                                 --
>> | >>>>>>>>>>>>                                                 Best
>> regards,
>> | >>>>>>>>>>>>                                                 Roman.
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>
>> | >>>>>>>>>>>>
>>  _______________________________________________
>> | >>>>>>>>>>>>
>>  Gluster-users
>> | >>>>>>>>>>>>                                                 mailing
>> list
>> | >>>>>>>>>>>>
>> Gluster-users at gluster.org
>>  | >>>>>>>>>>>>                                                 <mailto:
>> Gluster-users at gluster.org>
>>  | >>>>>>>>>>>>
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>
>> | >>>>>>>>>>>                                             --
>> | >>>>>>>>>>>                                             Best regards,
>> | >>>>>>>>>>>                                             Roman.
>> | >>>>>>>>>>
>> | >>>>>>>>>>
>> | >>>>>>>>>>
>> | >>>>>>>>>>
>> | >>>>>>>>>>                                         --
>> | >>>>>>>>>>                                         Best regards,
>> | >>>>>>>>>>                                         Roman.
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>>
>> | >>>>>>>>>                                     --
>> | >>>>>>>>>                                     Best regards,
>> | >>>>>>>>>                                     Roman.
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>
>> | >>>>>>>>                                 --
>> | >>>>>>>>                                 Best regards,
>> | >>>>>>>>                                 Roman.
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>
>> | >>>>>>>                             --
>> | >>>>>>>                             Best regards,
>> | >>>>>>>                             Roman.
>> | >>>>>>
>> | >>>>>>
>> | >>>>>>
>> | >>>>>>
>> | >>>>>>                         --
>> | >>>>>>                         Best regards,
>> | >>>>>>                         Roman.
>> | >>>>>
>> | >>>>>
>> | >>>>>
>> | >>>>>
>> | >>>>>                     --
>> | >>>>>                     Best regards,
>> | >>>>>                     Roman.
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>>
>> | >>>>                 --
>> | >>>>                 Best regards,
>> | >>>>                 Roman.
>> | >>>
>> | >>>
>> | >>>
>> | >>>
>> | >>>             --
>> | >>>             Best regards,
>> | >>>             Roman.
>> | >>
>> | >>
>> | >>
>> | >>
>> | >>         --
>> | >>         Best regards,
>> | >>         Roman.
>> | >>
>> | >>
>> | >>
>> | >>
>> | >>     --
>> | >>     Best regards,
>> | >>     Roman.
>> | >
>> | >
>> | >
>> | >
>> | > --
>> | > Best regards,
>> | > Roman.
>> |
>> |
>>
>
>
>
>  --
> Best regards,
> Roman.
>
>
>

-- 
Best regards,
Roman.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140827/961f92ab/attachment.html>