[Gluster-users] self-heal stops some vms (virtual machines)
Nick Majeran
nmajeran at gmail.com
Thu Feb 27 13:13:07 UTC 2014
I've had similar issues adding bricks and running a fix-layout as well.
> On Feb 27, 2014, at 3:56 AM, João Pagaime <joao.pagaime at gmail.com> wrote:
>
> yes, a real problem, enough to start thinking real hard on architecture scenarios
>
> sorry but I can't share any solutions at this time
>
> one (complicated) workaround would be to "medically induce a coma" on a VM as the self-heal starts on it, and resurrect it afterwards.
> I mean something like this:
> $ virsh suspend <vm-id>
> (do self-heal on vm's disks)
> $ virsh resume <vm-id>
> problems: several, including to the vm users, but better than a kernel lock-up. Feasibility problem: how to detect efficiently when the self-heal starts on a specific file on the brick
>
> another related problem may be how to mitigate IO starvation on the brick, when self-healing kicks in, since that process maybe a IO-hog. But I think this is a lesser problem
>
> best regards
> Joao
>
>
> Em 27-02-2014 09:08, Fabio Rosati escreveu:
>> Hi All,
>>
>> I run in exactly the same problem encountered by Joao.
>> After rebooting one of the GlusterFS nodes, self-heal starts and some VMs can't access their disk images anymore.
>>
>> Logs from one of the VMs after one gluster node has rebooted:
>>
>> Feb 25 23:35:47 fwrt2 kernel: EXT4-fs error (device dm-2): __ext4_get_inode_loc: unable to read inode block - inode=2145, block=417
>> Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15032608
>> Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15307504
>> Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15307552
>> Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15307568
>> Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15307504
>> Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 12972672
>> Feb 25 23:35:47 fwrt2 kernel: EXT4-fs error (device dm-1): ext4_find_entry: reading directory #123 offset 0
>> Feb 25 23:35:47 fwrt2 kernel: Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 2757 0 23 1393367747 e pipe failed
>> Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 9250632
>> Feb 25 23:35:47 fwrt2 kernel: Read-error on swap-device (253:0:30536)
>> Feb 25 23:35:47 fwrt2 kernel: Read-error on swap-device (253:0:30544)
>> [...]
>>
>>
>> I few hours later the VM seemed to be freezed and I had to kill and restart it, no more problems after reboot.
>>
>> This is the volume layout:
>>
>> Volume Name: gv_pri
>> Type: Distributed-Replicate
>> Volume ID: 3d91b91e-4d72-484f-8655-e5ed8d38bb28
>> Status: Started
>> Number of Bricks: 2 x 2 = 4
>> Transport-type: tcp
>> Bricks:
>> Brick1: nw1glus.gem.local:/glustexp/pri1/brick
>> Brick2: nw2glus.gem.local:/glustexp/pri1/brick
>> Brick3: nw3glus.gem.local:/glustexp/pri2/brick
>> Brick4: nw4glus.gem.local:/glustexp/pri2/brick
>> Options Reconfigured:
>> storage.owner-gid: 107
>> storage.owner-uid: 107
>> server.allow-insecure: on
>> network.remote-dio: on
>> performance.write-behind-window-size: 16MB
>> performance.cache-size: 128MB
>>
>> OS: CentOS 6.5
>> GlusterFS version: 3.4.2
>>
>> The qemu-kvm VMs access their qcow2 disk images using the native Gluster support (no fuse mount).
>> In the Gluster logs I didn't find anything special logged during self-heal but I can post them if needed.
>>
>> Anyone have an idea of what can cause these problems?
>>
>> Thank you
>> Fabio
>>
>>
>> ----- Messaggio originale -----
>> Da: "João Pagaime" <joao.pagaime at gmail.com>
>> A: Gluster-users at gluster.org
>> Inviato: Venerdì, 7 febbraio 2014 13:13:59
>> Oggetto: [Gluster-users] self-heal stops some vms (virtual machines)
>>
>> hello all
>>
>> I have a replicate volume that holds kvm vms (virtual machines)
>>
>> I had to stop one gluster-server for maintenance . That part of the
>> operation went well: no vms problems after shutdown
>>
>> the problems started after booting the gluster-server. Self-healing
>> started as expected, but some vms locked up with disk problems
>> (time-outs), as self-healing goes by them.
>> Some VMs did survive the self-healing . I suppose the ones with low IO
>> activity or less sensitive to disk problems
>>
>> is there some specific gluster configuration to enable a self-healing
>> ride-through on running-vms? (cluster.data-self-heal-algorithm is
>> already on the diff mode)
>>
>> is there some tweaks recommended to do on vms running on top of gluster?
>>
>> current config:
>>
>> gluster: 3.3.0-1.el6.x86_64
>>
>> --------------------- volume:
>> # gluster volume info VOL
>>
>> Volume Name: VOL
>> Type: Distributed-Replicate
>> Volume ID: f44182d9-24eb-4953-9cdd-71464f9517e0
>> Status: Started
>> Number of Bricks: 2 x 2 = 4
>> Transport-type: tcp
>> Bricks:
>> Brick1: one-gluster01:/san02-v2
>> Brick2: one-gluster02:/san02-v2
>> Brick3: one-gluster01:/san03
>> Brick4: one-gluster02:/san04
>> Options Reconfigured:
>> diagnostics.count-fop-hits: on
>> diagnostics.latency-measurement: on
>> nfs.disable: on
>> auth.allow:x
>> performance.flush-behind: off
>> cluster.self-heal-window-size: 1
>> performance.cache-size: 67108864
>> cluster.data-self-heal-algorithm: diff
>> performance.io-thread-count: 32
>> cluster.min-free-disk: 250GB
>>
>> thanks,
>> best regards,
>> joao
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
More information about the Gluster-users
mailing list