[Gluster-users] libgfapi failover problem on replica bricks
Pranith Kumar Karampuri
pkarampu at redhat.com
Wed Aug 6 07:20:50 UTC 2014
On 08/06/2014 12:27 PM, Roman wrote:
> Yesterday I've reproduced this situation two times.
> The setup:
> 1. Hardware and network
> a. Disks INTEL SSDSC2BB240G4
> b1. Network cards: X540-AT2
> b2. Netgear 10g switch
> 2. Software setup:
> a. OS: Debian wheezy
> b. Glusterfs: 3.4.4 (latest 3.4.4 from gluster repository)
> c. Promox VE with update glusterfs from gluster repository
> 3. Software Configuration
> a. create replicated volume with cluster.self-heal-daemon: off;
> nfs.disable: off; network.ping-timeout: 2 opts
> b. mount it on proxmox VE (via proxmox gui, it mouts with these
> opts: stor1:HA-fast-150G-PVE1 on /mnt/pve/FAST-TESt type
> fuse.glusterfs
> (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
> )
> c. install VM with qcow2 or raw disk image.
> d. disable port / remove network cable from one of storage servers
> e. wait and put cable back
> f. keep waiting for sync (pointless, it won't ever start)
> g. disable another port for second server (or remove cable from
> second server)
> h. profit.
>
> Maybe I could use 3.5.2 from debian sid (testing) repository to test with?
Sure, you can go ahead. I will just write one document about maintaining
VMs on gluster from the perspective of replication.
Pranith
>
>
> 2014-08-06 9:39 GMT+03:00 Pranith Kumar Karampuri <pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>>:
>
> Roman,
> The file went into split-brain. I think we should do these
> tests with 3.5.2. Where monitoring the heals is easier. Let me
> also come up with a document about how to do this testing you are
> trying to do.
>
> Humble/Niels,
> Do we have debs available for 3.5.2? In 3.5.1 there was
> packaging issue where /usr/bin/glfsheal is not packaged along with
> the deb. I think that should be fixed now as well?
>
> Pranith
>
> On 08/06/2014 11:52 AM, Roman wrote:
>> good morning,
>>
>> root at stor1:~# getfattr -d -m. -e hex
>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> getfattr: Removing leading '/' from absolute path names
>> # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000001320000000000000000
>> trusted.gfid=0x23c79523075a4158bea38078da570449
>>
>> getfattr: Removing leading '/' from absolute path names
>> # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000040000000000000000
>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>> trusted.gfid=0x23c79523075a4158bea38078da570449
>>
>>
>>
>> 2014-08-06 9:20 GMT+03:00 Pranith Kumar Karampuri
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>>
>>
>> On 08/06/2014 11:30 AM, Roman wrote:
>>> Also, this time files are not the same!
>>>
>>> root at stor1:~# md5sum
>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> 32411360c53116b96a059f17306caeda
>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>
>>> root at stor2:~# md5sum
>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>> 65b8a6031bcb6f5fb3a11cb1e8b1c9c9
>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> What is the getfattr output?
>>
>> Pranith
>>
>>>
>>>
>>> 2014-08-05 16:33 GMT+03:00 Roman <romeo.r at gmail.com
>>> <mailto:romeo.r at gmail.com>>:
>>>
>>> Nope, it is not working. But this time it went a bit
>>> other way
>>>
>>> root at gluster-client:~# dmesg
>>> Segmentation fault
>>>
>>>
>>> I was not able even to start the VM after I done the tests
>>>
>>> Could not read qcow2 header: Operation not permitted
>>>
>>> And it seems, it never starts to sync files after first
>>> disconnect. VM survives first disconnect, but not second
>>> (I waited around 30 minutes). Also, I've
>>> got network.ping-timeout: 2 in volume settings, but logs
>>> react on first disconnect around 30 seconds. Second was
>>> faster, 2 seconds.
>>>
>>> Reaction was different also:
>>>
>>> slower one:
>>> [2014-08-05 13:26:19.558435] W
>>> [socket.c:514:__socket_rwv] 0-glusterfs: readv failed
>>> (Connection timed out)
>>> [2014-08-05 13:26:19.558485] W
>>> [socket.c:1962:__socket_proto_state_machine]
>>> 0-glusterfs: reading from socket failed. Error
>>> (Connection timed out), peer (10.250.0.1:24007
>>> <http://10.250.0.1:24007>)
>>> [2014-08-05 13:26:21.281426] W
>>> [socket.c:514:__socket_rwv]
>>> 0-HA-fast-150G-PVE1-client-0: readv failed (Connection
>>> timed out)
>>> [2014-08-05 13:26:21.281474] W
>>> [socket.c:1962:__socket_proto_state_machine]
>>> 0-HA-fast-150G-PVE1-client-0: reading from socket
>>> failed. Error (Connection timed out), peer
>>> (10.250.0.1:49153 <http://10.250.0.1:49153>)
>>> [2014-08-05 13:26:21.281507] I
>>> [client.c:2098:client_rpc_notify]
>>> 0-HA-fast-150G-PVE1-client-0: disconnected
>>>
>>> the fast one:
>>> 2014-08-05 12:52:44.607389] C
>>> [client-handshake.c:127:rpc_client_ping_timer_expired]
>>> 0-HA-fast-150G-PVE1-client-1: server 10.250.0.2:49153
>>> <http://10.250.0.2:49153> has not responded in the last
>>> 2 seconds, disconnecting.
>>> [2014-08-05 12:52:44.607491] W
>>> [socket.c:514:__socket_rwv]
>>> 0-HA-fast-150G-PVE1-client-1: readv failed (No data
>>> available)
>>> [2014-08-05 12:52:44.607585] E
>>> [rpc-clnt.c:368:saved_frames_unwind]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>>> [0x7fcb1b4b0558]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>>> [0x7fcb1b4aea63]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>>> [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced
>>> unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>> called at 2014-08-05 12:52:42.463881 (xid=0x381883x)
>>> [2014-08-05 12:52:44.607604] W
>>> [client-rpc-fops.c:2624:client3_3_lookup_cbk]
>>> 0-HA-fast-150G-PVE1-client-1: remote operation failed:
>>> Transport endpoint is not connected. Path: /
>>> (00000000-0000-0000-0000-000000000001)
>>> [2014-08-05 12:52:44.607736] E
>>> [rpc-clnt.c:368:saved_frames_unwind]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>>> [0x7fcb1b4b0558]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>>> [0x7fcb1b4aea63]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>>> [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced
>>> unwinding frame type(GlusterFS Handshake) op(PING(3))
>>> called at 2014-08-05 12:52:42.463891 (xid=0x381884x)
>>> [2014-08-05 12:52:44.607753] W
>>> [client-handshake.c:276:client_ping_cbk]
>>> 0-HA-fast-150G-PVE1-client-1: timer must have expired
>>> [2014-08-05 12:52:44.607776] I
>>> [client.c:2098:client_rpc_notify]
>>> 0-HA-fast-150G-PVE1-client-1: disconnected
>>>
>>>
>>>
>>> I've got SSD disks (just for an info).
>>> Should I go and give a try for 3.5.2?
>>>
>>>
>>>
>>> 2014-08-05 13:06 GMT+03:00 Pranith Kumar Karampuri
>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>>>
>>> reply along with gluster-users please :-). May be
>>> you are hitting 'reply' instead of 'reply all'?
>>>
>>> Pranith
>>>
>>> On 08/05/2014 03:35 PM, Roman wrote:
>>>> To make sure and clean, I've created another VM
>>>> with raw format and goint to repeat those steps. So
>>>> now I've got two VM-s one with qcow2 format and
>>>> other with raw format. I will send another e-mail
>>>> shortly.
>>>>
>>>>
>>>> 2014-08-05 13:01 GMT+03:00 Pranith Kumar Karampuri
>>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>>>>
>>>>
>>>> On 08/05/2014 03:07 PM, Roman wrote:
>>>>> really, seems like the same file
>>>>>
>>>>> stor1:
>>>>> a951641c5230472929836f9fcede6b04
>>>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>>>
>>>>> stor2:
>>>>> a951641c5230472929836f9fcede6b04
>>>>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>>>
>>>>>
>>>>> one thing I've seen from logs, that somehow
>>>>> proxmox VE is connecting with wrong version to
>>>>> servers?
>>>>> [2014-08-05 09:23:45.218550] I
>>>>> [client-handshake.c:1659:select_server_supported_programs]
>>>>> 0-HA-fast-150G-PVE1-client-0: Using Program
>>>>> GlusterFS 3.3, Num (1298437), Version (330)
>>>> It is the rpc (over the network data
>>>> structures) version, which is not changed at
>>>> all from 3.3 so thats not a problem. So what is
>>>> the conclusion? Is your test case working now
>>>> or not?
>>>>
>>>> Pranith
>>>>
>>>>> but if I issue:
>>>>> root at pve1:~# glusterfs -V
>>>>> glusterfs 3.4.4 built on Jun 28 2014 03:44:57
>>>>> seems ok.
>>>>>
>>>>> server use 3.4.4 meanwhile
>>>>> [2014-08-05 09:23:45.117875] I
>>>>> [server-handshake.c:567:server_setvolume]
>>>>> 0-HA-fast-150G-PVE1-server: accepted client
>>>>> from
>>>>> stor1-9004-2014/08/05-09:23:45:93538-HA-fast-150G-PVE1-client-1-0
>>>>> (version: 3.4.4)
>>>>> [2014-08-05 09:23:49.103035] I
>>>>> [server-handshake.c:567:server_setvolume]
>>>>> 0-HA-fast-150G-PVE1-server: accepted client
>>>>> from
>>>>> stor1-8998-2014/08/05-09:23:45:89883-HA-fast-150G-PVE1-client-0-0
>>>>> (version: 3.4.4)
>>>>>
>>>>> if this could be the reason, of course.
>>>>> I did restart the Proxmox VE yesterday (just
>>>>> for an information)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2014-08-05 12:30 GMT+03:00 Pranith Kumar
>>>>> Karampuri <pkarampu at redhat.com
>>>>> <mailto:pkarampu at redhat.com>>:
>>>>>
>>>>>
>>>>> On 08/05/2014 02:33 PM, Roman wrote:
>>>>>> Waited long enough for now, still
>>>>>> different sizes and no logs about healing :(
>>>>>>
>>>>>> stor1
>>>>>> # file:
>>>>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>>>>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>>>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>>>>>>
>>>>>> root at stor1:~# du -sh
>>>>>> /exports/fast-test/150G/images/127/
>>>>>> 1.2G /exports/fast-test/150G/images/127/
>>>>>>
>>>>>>
>>>>>> stor2
>>>>>> # file:
>>>>>> exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>>>>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>>>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>>>>>>
>>>>>>
>>>>>> root at stor2:~# du -sh
>>>>>> /exports/fast-test/150G/images/127/
>>>>>> 1.4G /exports/fast-test/150G/images/127/
>>>>> According to the changelogs, the file
>>>>> doesn't need any healing. Could you stop
>>>>> the operations on the VMs and take md5sum
>>>>> on both these machines?
>>>>>
>>>>> Pranith
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014-08-05 11:49 GMT+03:00 Pranith Kumar
>>>>>> Karampuri <pkarampu at redhat.com
>>>>>> <mailto:pkarampu at redhat.com>>:
>>>>>>
>>>>>>
>>>>>> On 08/05/2014 02:06 PM, Roman wrote:
>>>>>>> Well, it seems like it doesn't see
>>>>>>> the changes were made to the volume
>>>>>>> ? I created two files 200 and 100 MB
>>>>>>> (from /dev/zero) after I
>>>>>>> disconnected the first brick. Then
>>>>>>> connected it back and got these logs:
>>>>>>>
>>>>>>> [2014-08-05 08:30:37.830150] I
>>>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>>>>>>> 0-glusterfs: No change in volfile,
>>>>>>> continuing
>>>>>>> [2014-08-05 08:30:37.830207] I
>>>>>>> [rpc-clnt.c:1676:rpc_clnt_reconfig]
>>>>>>> 0-HA-fast-150G-PVE1-client-0:
>>>>>>> changing port to 49153 (from 0)
>>>>>>> [2014-08-05 08:30:37.830239] W
>>>>>>> [socket.c:514:__socket_rwv]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: readv
>>>>>>> failed (No data available)
>>>>>>> [2014-08-05 08:30:37.831024] I
>>>>>>> [client-handshake.c:1659:select_server_supported_programs]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: Using
>>>>>>> Program GlusterFS 3.3, Num
>>>>>>> (1298437), Version (330)
>>>>>>> [2014-08-05 08:30:37.831375] I
>>>>>>> [client-handshake.c:1456:client_setvolume_cbk]
>>>>>>> 0-HA-fast-150G-PVE1-client-0:
>>>>>>> Connected to 10.250.0.1:49153
>>>>>>> <http://10.250.0.1:49153>, attached
>>>>>>> to remote volume
>>>>>>> '/exports/fast-test/150G'.
>>>>>>> [2014-08-05 08:30:37.831394] I
>>>>>>> [client-handshake.c:1468:client_setvolume_cbk]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: Server
>>>>>>> and Client lk-version numbers are
>>>>>>> not same, reopening the fds
>>>>>>> [2014-08-05 08:30:37.831566] I
>>>>>>> [client-handshake.c:450:client_set_lk_version_cbk]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: Server
>>>>>>> lk version = 1
>>>>>>>
>>>>>>>
>>>>>>> [2014-08-05 08:30:37.830150] I
>>>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk]
>>>>>>> 0-glusterfs: No change in volfile,
>>>>>>> continuing
>>>>>>> this line seems weird to me tbh.
>>>>>>> I do not see any traffic on switch
>>>>>>> interfaces between gluster servers,
>>>>>>> which means, there is no syncing
>>>>>>> between them.
>>>>>>> I tried to ls -l the files on the
>>>>>>> client and servers to trigger the
>>>>>>> healing, but seems like no success.
>>>>>>> Should I wait more?
>>>>>> Yes, it should take around 10-15
>>>>>> minutes. Could you provide 'getfattr
>>>>>> -d -m. -e hex <file-on-brick>' on
>>>>>> both the bricks.
>>>>>>
>>>>>> Pranith
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2014-08-05 11:25 GMT+03:00 Pranith
>>>>>>> Kumar Karampuri <pkarampu at redhat.com
>>>>>>> <mailto:pkarampu at redhat.com>>:
>>>>>>>
>>>>>>>
>>>>>>> On 08/05/2014 01:10 PM, Roman wrote:
>>>>>>>> Ahha! For some reason I was not
>>>>>>>> able to start the VM anymore,
>>>>>>>> Proxmox VE told me, that it is
>>>>>>>> not able to read the qcow2
>>>>>>>> header due to permission is
>>>>>>>> denied for some reason. So I
>>>>>>>> just deleted that file and
>>>>>>>> created a new VM. And the nex
>>>>>>>> message I've got was this:
>>>>>>> Seems like these are the
>>>>>>> messages where you took down the
>>>>>>> bricks before self-heal. Could
>>>>>>> you restart the run waiting for
>>>>>>> self-heals to complete before
>>>>>>> taking down the next brick?
>>>>>>>
>>>>>>> Pranith
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [2014-08-05 07:31:25.663412] E
>>>>>>>> [afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
>>>>>>>> 0-HA-fast-150G-PVE1-replicate-0: Unable
>>>>>>>> to self-heal contents of
>>>>>>>> '/images/124/vm-124-disk-1.qcow2'
>>>>>>>> (possible split-brain). Please
>>>>>>>> delete the file from all but
>>>>>>>> the preferred subvolume.-
>>>>>>>> Pending matrix: [ [ 0 60 ] [
>>>>>>>> 11 0 ] ]
>>>>>>>> [2014-08-05 07:31:25.663955] E
>>>>>>>> [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk]
>>>>>>>> 0-HA-fast-150G-PVE1-replicate-0: background
>>>>>>>> data self-heal failed on
>>>>>>>> /images/124/vm-124-disk-1.qcow2
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-08-05 10:13 GMT+03:00
>>>>>>>> Pranith Kumar Karampuri
>>>>>>>> <pkarampu at redhat.com
>>>>>>>> <mailto:pkarampu at redhat.com>>:
>>>>>>>>
>>>>>>>> I just responded to your
>>>>>>>> earlier mail about how the
>>>>>>>> log looks. The log comes on
>>>>>>>> the mount's logfile
>>>>>>>>
>>>>>>>> Pranith
>>>>>>>>
>>>>>>>> On 08/05/2014 12:41 PM,
>>>>>>>> Roman wrote:
>>>>>>>>> Ok, so I've waited enough,
>>>>>>>>> I think. Had no any
>>>>>>>>> traffic on switch ports
>>>>>>>>> between servers. Could not
>>>>>>>>> find any suitable log
>>>>>>>>> message about completed
>>>>>>>>> self-heal (waited about 30
>>>>>>>>> minutes). Plugged out the
>>>>>>>>> other server's UTP cable
>>>>>>>>> this time and got in the
>>>>>>>>> same situation:
>>>>>>>>> root at gluster-test1:~# cat
>>>>>>>>> /var/log/dmesg
>>>>>>>>> -bash: /bin/cat:
>>>>>>>>> Input/output error
>>>>>>>>>
>>>>>>>>> brick logs:
>>>>>>>>> [2014-08-05
>>>>>>>>> 07:09:03.005474] I
>>>>>>>>> [server.c:762:server_rpc_notify]
>>>>>>>>> 0-HA-fast-150G-PVE1-server: disconnecting
>>>>>>>>> connectionfrom
>>>>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>>>>>>>> [2014-08-05
>>>>>>>>> 07:09:03.005530] I
>>>>>>>>> [server-helpers.c:729:server_connection_put]
>>>>>>>>> 0-HA-fast-150G-PVE1-server: Shutting
>>>>>>>>> down connection
>>>>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>>>>>>>> [2014-08-05
>>>>>>>>> 07:09:03.005560] I
>>>>>>>>> [server-helpers.c:463:do_fd_cleanup]
>>>>>>>>> 0-HA-fast-150G-PVE1-server: fd
>>>>>>>>> cleanup on
>>>>>>>>> /images/124/vm-124-disk-1.qcow2
>>>>>>>>> [2014-08-05
>>>>>>>>> 07:09:03.005797] I
>>>>>>>>> [server-helpers.c:617:server_connection_destroy]
>>>>>>>>> 0-HA-fast-150G-PVE1-server: destroyed
>>>>>>>>> connection of
>>>>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-08-05 9:53 GMT+03:00
>>>>>>>>> Pranith Kumar Karampuri
>>>>>>>>> <pkarampu at redhat.com
>>>>>>>>> <mailto:pkarampu at redhat.com>>:
>>>>>>>>>
>>>>>>>>> Do you think it is
>>>>>>>>> possible for you to do
>>>>>>>>> these tests on the
>>>>>>>>> latest version 3.5.2?
>>>>>>>>> 'gluster volume heal
>>>>>>>>> <volname> info' would
>>>>>>>>> give you that
>>>>>>>>> information in
>>>>>>>>> versions > 3.5.1.
>>>>>>>>> Otherwise you will
>>>>>>>>> have to check it from
>>>>>>>>> either the logs, there
>>>>>>>>> will be self-heal
>>>>>>>>> completed message on
>>>>>>>>> the mount logs (or) by
>>>>>>>>> observing 'getfattr -d
>>>>>>>>> -m. -e hex
>>>>>>>>> <image-file-on-bricks>'
>>>>>>>>>
>>>>>>>>> Pranith
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/05/2014 12:09
>>>>>>>>> PM, Roman wrote:
>>>>>>>>>> Ok, I understand. I
>>>>>>>>>> will try this shortly.
>>>>>>>>>> How can I be sure,
>>>>>>>>>> that healing process
>>>>>>>>>> is done, if I am not
>>>>>>>>>> able to see its status?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-08-05 9:30
>>>>>>>>>> GMT+03:00 Pranith
>>>>>>>>>> Kumar Karampuri
>>>>>>>>>> <pkarampu at redhat.com
>>>>>>>>>> <mailto:pkarampu at redhat.com>>:
>>>>>>>>>>
>>>>>>>>>> Mounts will do
>>>>>>>>>> the healing, not
>>>>>>>>>> the
>>>>>>>>>> self-heal-daemon.
>>>>>>>>>> The problem I
>>>>>>>>>> feel is that
>>>>>>>>>> whichever process
>>>>>>>>>> does the healing
>>>>>>>>>> has the latest
>>>>>>>>>> information about
>>>>>>>>>> the good bricks
>>>>>>>>>> in this usecase.
>>>>>>>>>> Since for VM
>>>>>>>>>> usecase, mounts
>>>>>>>>>> should have the
>>>>>>>>>> latest
>>>>>>>>>> information, we
>>>>>>>>>> should let the
>>>>>>>>>> mounts do the
>>>>>>>>>> healing. If the
>>>>>>>>>> mount accesses
>>>>>>>>>> the VM image
>>>>>>>>>> either by someone
>>>>>>>>>> doing operations
>>>>>>>>>> inside the VM or
>>>>>>>>>> explicit stat on
>>>>>>>>>> the file it
>>>>>>>>>> should do the
>>>>>>>>>> healing.
>>>>>>>>>>
>>>>>>>>>> Pranith.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 08/05/2014
>>>>>>>>>> 10:39 AM, Roman
>>>>>>>>>> wrote:
>>>>>>>>>>> Hmmm, you told
>>>>>>>>>>> me to turn it
>>>>>>>>>>> off. Did I
>>>>>>>>>>> understood
>>>>>>>>>>> something wrong?
>>>>>>>>>>> After I issued
>>>>>>>>>>> the command
>>>>>>>>>>> you've sent me,
>>>>>>>>>>> I was not able
>>>>>>>>>>> to watch the
>>>>>>>>>>> healing process,
>>>>>>>>>>> it said, it
>>>>>>>>>>> won't be healed,
>>>>>>>>>>> becouse its
>>>>>>>>>>> turned off.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-08-05 5:39
>>>>>>>>>>> GMT+03:00
>>>>>>>>>>> Pranith Kumar
>>>>>>>>>>> Karampuri
>>>>>>>>>>> <pkarampu at redhat.com
>>>>>>>>>>> <mailto:pkarampu at redhat.com>>:
>>>>>>>>>>>
>>>>>>>>>>> You didn't
>>>>>>>>>>> mention
>>>>>>>>>>> anything
>>>>>>>>>>> about
>>>>>>>>>>> self-healing. Did
>>>>>>>>>>> you wait
>>>>>>>>>>> until the
>>>>>>>>>>> self-heal is
>>>>>>>>>>> complete?
>>>>>>>>>>>
>>>>>>>>>>> Pranith
>>>>>>>>>>>
>>>>>>>>>>> On
>>>>>>>>>>> 08/04/2014
>>>>>>>>>>> 05:49 PM,
>>>>>>>>>>> Roman wrote:
>>>>>>>>>>>> Hi!
>>>>>>>>>>>> Result is
>>>>>>>>>>>> pretty
>>>>>>>>>>>> same. I set
>>>>>>>>>>>> the switch
>>>>>>>>>>>> port down
>>>>>>>>>>>> for 1st
>>>>>>>>>>>> server, it
>>>>>>>>>>>> was ok.
>>>>>>>>>>>> Then set it
>>>>>>>>>>>> up back and
>>>>>>>>>>>> set other
>>>>>>>>>>>> server's
>>>>>>>>>>>> port off.
>>>>>>>>>>>> and it
>>>>>>>>>>>> triggered
>>>>>>>>>>>> IO error on
>>>>>>>>>>>> two virtual
>>>>>>>>>>>> machines:
>>>>>>>>>>>> one with
>>>>>>>>>>>> local root
>>>>>>>>>>>> FS but
>>>>>>>>>>>> network
>>>>>>>>>>>> mounted
>>>>>>>>>>>> storage.
>>>>>>>>>>>> and other
>>>>>>>>>>>> with
>>>>>>>>>>>> network
>>>>>>>>>>>> root FS.
>>>>>>>>>>>> 1st gave an
>>>>>>>>>>>> error on
>>>>>>>>>>>> copying to
>>>>>>>>>>>> or from the
>>>>>>>>>>>> mounted
>>>>>>>>>>>> network
>>>>>>>>>>>> disk, other
>>>>>>>>>>>> just gave
>>>>>>>>>>>> me an error
>>>>>>>>>>>> for even
>>>>>>>>>>>> reading
>>>>>>>>>>>> log.files.
>>>>>>>>>>>>
>>>>>>>>>>>> cat:
>>>>>>>>>>>> /var/log/alternatives.log:
>>>>>>>>>>>> Input/output error
>>>>>>>>>>>> then I
>>>>>>>>>>>> reset the
>>>>>>>>>>>> kvm VM and
>>>>>>>>>>>> it said me,
>>>>>>>>>>>> there is no
>>>>>>>>>>>> boot
>>>>>>>>>>>> device.
>>>>>>>>>>>> Next I
>>>>>>>>>>>> virtually
>>>>>>>>>>>> powered it
>>>>>>>>>>>> off and
>>>>>>>>>>>> then back
>>>>>>>>>>>> on and it
>>>>>>>>>>>> has booted.
>>>>>>>>>>>>
>>>>>>>>>>>> By the way,
>>>>>>>>>>>> did I have
>>>>>>>>>>>> to
>>>>>>>>>>>> start/stop
>>>>>>>>>>>> volume?
>>>>>>>>>>>>
>>>>>>>>>>>> >> Could
>>>>>>>>>>>> you do the
>>>>>>>>>>>> following
>>>>>>>>>>>> and test it
>>>>>>>>>>>> again?
>>>>>>>>>>>> >> gluster
>>>>>>>>>>>> volume set
>>>>>>>>>>>> <volname>
>>>>>>>>>>>> cluster.self-heal-daemon
>>>>>>>>>>>> off
>>>>>>>>>>>>
>>>>>>>>>>>> >>Pranith
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-08-04
>>>>>>>>>>>> 14:10
>>>>>>>>>>>> GMT+03:00
>>>>>>>>>>>> Pranith
>>>>>>>>>>>> Kumar
>>>>>>>>>>>> Karampuri
>>>>>>>>>>>> <pkarampu at redhat.com
>>>>>>>>>>>> <mailto:pkarampu at redhat.com>>:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On
>>>>>>>>>>>> 08/04/2014
>>>>>>>>>>>> 03:33
>>>>>>>>>>>> PM,
>>>>>>>>>>>> Roman
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hello!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Facing
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> problem as
>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>> here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> my set
>>>>>>>>>>>>> up is
>>>>>>>>>>>>> up and
>>>>>>>>>>>>> running,
>>>>>>>>>>>>> so i'm
>>>>>>>>>>>>> ready
>>>>>>>>>>>>> to
>>>>>>>>>>>>> help
>>>>>>>>>>>>> you
>>>>>>>>>>>>> back
>>>>>>>>>>>>> with
>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> setup:
>>>>>>>>>>>>> proxmox server
>>>>>>>>>>>>> as client
>>>>>>>>>>>>> 2
>>>>>>>>>>>>> gluster physical
>>>>>>>>>>>>> servers
>>>>>>>>>>>>>
>>>>>>>>>>>>> server
>>>>>>>>>>>>> side
>>>>>>>>>>>>> and
>>>>>>>>>>>>> client
>>>>>>>>>>>>> side
>>>>>>>>>>>>> both
>>>>>>>>>>>>> running atm
>>>>>>>>>>>>> 3.4.4
>>>>>>>>>>>>> glusterfs
>>>>>>>>>>>>> from
>>>>>>>>>>>>> gluster repo.
>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>>>>>> problem is:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.
>>>>>>>>>>>>> craeted replica
>>>>>>>>>>>>> bricks.
>>>>>>>>>>>>> 2.
>>>>>>>>>>>>> mounted in
>>>>>>>>>>>>> proxmox (tried
>>>>>>>>>>>>> both
>>>>>>>>>>>>> promox
>>>>>>>>>>>>> ways:
>>>>>>>>>>>>> via
>>>>>>>>>>>>> GUI
>>>>>>>>>>>>> and
>>>>>>>>>>>>> fstab
>>>>>>>>>>>>> (with
>>>>>>>>>>>>> backup
>>>>>>>>>>>>> volume
>>>>>>>>>>>>> line),
>>>>>>>>>>>>> btw
>>>>>>>>>>>>> while
>>>>>>>>>>>>> mounting
>>>>>>>>>>>>> via
>>>>>>>>>>>>> fstab
>>>>>>>>>>>>> I'm
>>>>>>>>>>>>> unable
>>>>>>>>>>>>> to
>>>>>>>>>>>>> launch
>>>>>>>>>>>>> a VM
>>>>>>>>>>>>> without cache,
>>>>>>>>>>>>> meanwhile
>>>>>>>>>>>>> direct-io-mode
>>>>>>>>>>>>> is
>>>>>>>>>>>>> enabled in
>>>>>>>>>>>>> fstab
>>>>>>>>>>>>> line)
>>>>>>>>>>>>> 3.
>>>>>>>>>>>>> installed
>>>>>>>>>>>>> VM
>>>>>>>>>>>>> 4.
>>>>>>>>>>>>> bring
>>>>>>>>>>>>> one
>>>>>>>>>>>>> volume
>>>>>>>>>>>>> down - ok
>>>>>>>>>>>>> 5.
>>>>>>>>>>>>> bringing
>>>>>>>>>>>>> up,
>>>>>>>>>>>>> waiting for
>>>>>>>>>>>>> sync
>>>>>>>>>>>>> is done.
>>>>>>>>>>>>> 6.
>>>>>>>>>>>>> bring
>>>>>>>>>>>>> other
>>>>>>>>>>>>> volume
>>>>>>>>>>>>> down -
>>>>>>>>>>>>> getting IO
>>>>>>>>>>>>> errors
>>>>>>>>>>>>> on VM
>>>>>>>>>>>>> guest
>>>>>>>>>>>>> and
>>>>>>>>>>>>> not
>>>>>>>>>>>>> able
>>>>>>>>>>>>> to
>>>>>>>>>>>>> restore the
>>>>>>>>>>>>> VM
>>>>>>>>>>>>> after
>>>>>>>>>>>>> I
>>>>>>>>>>>>> reset
>>>>>>>>>>>>> the VM
>>>>>>>>>>>>> via
>>>>>>>>>>>>> host.
>>>>>>>>>>>>> It
>>>>>>>>>>>>> says
>>>>>>>>>>>>> (no
>>>>>>>>>>>>> bootable
>>>>>>>>>>>>> media). After
>>>>>>>>>>>>> I shut
>>>>>>>>>>>>> it
>>>>>>>>>>>>> down
>>>>>>>>>>>>> (forced)
>>>>>>>>>>>>> and
>>>>>>>>>>>>> bring
>>>>>>>>>>>>> back
>>>>>>>>>>>>> up, it
>>>>>>>>>>>>> boots.
>>>>>>>>>>>> Could
>>>>>>>>>>>> you do
>>>>>>>>>>>> the
>>>>>>>>>>>> following
>>>>>>>>>>>> and
>>>>>>>>>>>> test it
>>>>>>>>>>>> again?
>>>>>>>>>>>> gluster
>>>>>>>>>>>> volume
>>>>>>>>>>>> set
>>>>>>>>>>>> <volname>
>>>>>>>>>>>> cluster.self-heal-daemon
>>>>>>>>>>>> off
>>>>>>>>>>>>
>>>>>>>>>>>> Pranith
>>>>>>>>>>>>>
>>>>>>>>>>>>> Need
>>>>>>>>>>>>> help.
>>>>>>>>>>>>> Tried
>>>>>>>>>>>>> 3.4.3,
>>>>>>>>>>>>> 3.4.4.
>>>>>>>>>>>>> Still
>>>>>>>>>>>>> missing pkg-s
>>>>>>>>>>>>> for
>>>>>>>>>>>>> 3.4.5
>>>>>>>>>>>>> for
>>>>>>>>>>>>> debian
>>>>>>>>>>>>> and
>>>>>>>>>>>>> 3.5.2
>>>>>>>>>>>>> (3.5.1
>>>>>>>>>>>>> always
>>>>>>>>>>>>> gives
>>>>>>>>>>>>> a
>>>>>>>>>>>>> healing error
>>>>>>>>>>>>> for
>>>>>>>>>>>>> some
>>>>>>>>>>>>> reason)
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best
>>>>>>>>>>>>> regards,
>>>>>>>>>>>>> Roman.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>>>>>>>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Roman.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Roman.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Roman.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Roman.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Roman.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Roman.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Roman.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Roman.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Roman.
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Roman.
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Roman.
>>
>>
>>
>>
>> --
>> Best regards,
>> Roman.
>
>
>
>
> --
> Best regards,
> Roman.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140806/87efccc5/attachment.html>
More information about the Gluster-users
mailing list