[Gluster-users] corruption using gluster and iSCSI with LIO
Дмитрий Глушенок
glush at jet.msk.su
Fri Dec 2 20:43:07 UTC 2016
Hi,
It can be that LIO service starts before /mnt gets mounted. In absence of backend file LIO has created the new one on root filesystem (/mnt directory). Then gluster volume was mounted over, but as backend file was kept open by LIO - it still was used instead of the right one on gluster volume. Then, when you turn off the first node - active path for iSCSI disk switches to the second node (with empty file, placed on root filesystem).
> 18 нояб. 2016 г., в 19:21, Olivier Lambert <lambert.olivier at gmail.com> написал(а):
>
> After Node 1 is DOWN, LIO on Node2 (iSCSI target) is not writing
> anymore in the local Gluster mount, but in the root partition.
>
> Despite "df -h" shows the Gluster brick mounted:
>
> /dev/mapper/centos-root 3,1G 3,1G 20K 100% /
> ...
> /dev/xvdb 61G 61G 956M 99% /bricks/brick1
> localhost:/gv0 61G 61G 956M 99% /mnt
>
> If I unmount it, I still see the "block.img" in /mnt which is filling
> the root space. So it's like Fuse is messing with the local Gluster
> mount, which could lead to the data corruption on the client level.
>
> It doesn't make sense for me... What am I missing?
>
> On Fri, Nov 18, 2016 at 5:00 PM, Olivier Lambert
> <lambert.olivier at gmail.com> wrote:
>> Yes, I did it only if I have the previous result of heal info ("number
>> of entries: 0"). But same result, as soon as the second Node is
>> offline (after they were both working/back online), everything is
>> corrupted.
>>
>> To recap:
>>
>> * Node 1 UP Node 2 UP -> OK
>> * Node 1 UP Node 2 DOWN -> OK (just a small lag for multipath to see
>> the path down and change if necessary)
>> * Node 1 UP Node 2 UP -> OK (and waiting to have no entries displayed
>> in heal command)
>> * Node 1 DOWN Node 2 UP -> NOT OK (data corruption)
>>
>> On Fri, Nov 18, 2016 at 3:39 PM, David Gossage
>> <dgossage at carouselchecks.com> wrote:
>>> On Fri, Nov 18, 2016 at 3:49 AM, Olivier Lambert <lambert.olivier at gmail.com>
>>> wrote:
>>>>
>>>> Hi David,
>>>>
>>>> What are the exact commands to be sure it's fine?
>>>>
>>>> Right now I got:
>>>>
>>>> # gluster volume heal gv0 info
>>>> Brick 10.0.0.1:/bricks/brick1/gv0
>>>> Status: Connected
>>>> Number of entries: 0
>>>>
>>>> Brick 10.0.0.2:/bricks/brick1/gv0
>>>> Status: Connected
>>>> Number of entries: 0
>>>>
>>>> Brick 10.0.0.3:/bricks/brick1/gv0
>>>> Status: Connected
>>>> Number of entries: 0
>>>>
>>>>
>>> Did you run this before taking down 2nd node to see if any heals were
>>> ongoing?
>>>
>>> Also I see you have sharding enabled. Are your files being served sharded
>>> already as well?
>>>
>>>>
>>>> Everything is online and working, but this command give a strange output:
>>>>
>>>> # gluster volume heal gv0 info heal-failed
>>>> Gathering list of heal failed entries on volume gv0 has been
>>>> unsuccessful on bricks that are down. Please check if all brick
>>>> processes are running.
>>>>
>>>> Is it normal?
>>>
>>>
>>> I don't think that is a valid command anymore as whern I run it I get same
>>> message and this is in logs
>>> [2016-11-18 14:35:02.260503] I [MSGID: 106533]
>>> [glusterd-volume-ops.c:878:__glusterd_handle_cli_heal_volume] 0-management:
>>> Received heal vol req for volume GLUSTER1
>>> [2016-11-18 14:35:02.263341] W [MSGID: 106530]
>>> [glusterd-volume-ops.c:1882:glusterd_handle_heal_cmd] 0-management: Command
>>> not supported. Please use "gluster volume heal GLUSTER1 info" and logs to
>>> find the heal information.
>>> [2016-11-18 14:35:02.263365] E [MSGID: 106301]
>>> [glusterd-syncop.c:1297:gd_stage_op_phase] 0-management: Staging of
>>> operation 'Volume Heal' failed on localhost : Command not supported. Please
>>> use "gluster volume heal GLUSTER1 info" and logs to find the heal
>>> information.
>>>
>>>>
>>>> On Fri, Nov 18, 2016 at 2:51 AM, David Gossage
>>>> <dgossage at carouselchecks.com> wrote:
>>>>>
>>>>> On Thu, Nov 17, 2016 at 6:42 PM, Olivier Lambert
>>>>> <lambert.olivier at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Okay, used the exact same config you provided, and adding an arbiter
>>>>>> node (node3)
>>>>>>
>>>>>> After halting node2, VM continues to work after a small "lag"/freeze.
>>>>>> I restarted node2 and it was back online: OK
>>>>>>
>>>>>> Then, after waiting few minutes, halting node1. And **just** at this
>>>>>> moment, the VM is corrupted (segmentation fault, /var/log folder empty
>>>>>> etc.)
>>>>>>
>>>>> Other than waiting a few minutes did you make sure heals had completed?
>>>>>
>>>>>>
>>>>>> dmesg of the VM:
>>>>>>
>>>>>> [ 1645.852905] EXT4-fs error (device xvda1):
>>>>>> htree_dirblock_to_tree:988: inode #19: block 8286: comm bash: bad
>>>>>> entry in directory: rec_len is smaller than minimal - offset=0(0),
>>>>>> inode=0, rec_len=0, name_len=0
>>>>>> [ 1645.854509] Aborting journal on device xvda1-8.
>>>>>> [ 1645.855524] EXT4-fs (xvda1): Remounting filesystem read-only
>>>>>>
>>>>>> And got a lot of " comm bash: bad entry in directory" messages then...
>>>>>>
>>>>>> Here is the current config with all Node back online:
>>>>>>
>>>>>> # gluster volume info
>>>>>>
>>>>>> Volume Name: gv0
>>>>>> Type: Replicate
>>>>>> Volume ID: 5f15c919-57e3-4648-b20a-395d9fe3d7d6
>>>>>> Status: Started
>>>>>> Snapshot Count: 0
>>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: 10.0.0.1:/bricks/brick1/gv0
>>>>>> Brick2: 10.0.0.2:/bricks/brick1/gv0
>>>>>> Brick3: 10.0.0.3:/bricks/brick1/gv0 (arbiter)
>>>>>> Options Reconfigured:
>>>>>> nfs.disable: on
>>>>>> performance.readdir-ahead: on
>>>>>> transport.address-family: inet
>>>>>> features.shard: on
>>>>>> features.shard-block-size: 16MB
>>>>>> network.remote-dio: enable
>>>>>> cluster.eager-lock: enable
>>>>>> performance.io-cache: off
>>>>>> performance.read-ahead: off
>>>>>> performance.quick-read: off
>>>>>> performance.stat-prefetch: on
>>>>>> performance.strict-write-ordering: off
>>>>>> cluster.server-quorum-type: server
>>>>>> cluster.quorum-type: auto
>>>>>> cluster.data-self-heal: on
>>>>>>
>>>>>>
>>>>>> # gluster volume status
>>>>>> Status of volume: gv0
>>>>>> Gluster process TCP Port RDMA Port Online
>>>>>> Pid
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Brick 10.0.0.1:/bricks/brick1/gv0 49152 0 Y
>>>>>> 1331
>>>>>> Brick 10.0.0.2:/bricks/brick1/gv0 49152 0 Y
>>>>>> 2274
>>>>>> Brick 10.0.0.3:/bricks/brick1/gv0 49152 0 Y
>>>>>> 2355
>>>>>> Self-heal Daemon on localhost N/A N/A Y
>>>>>> 2300
>>>>>> Self-heal Daemon on 10.0.0.3 N/A N/A Y
>>>>>> 10530
>>>>>> Self-heal Daemon on 10.0.0.2 N/A N/A Y
>>>>>> 2425
>>>>>>
>>>>>> Task Status of Volume gv0
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> There are no active volume tasks
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 17, 2016 at 11:35 PM, Olivier Lambert
>>>>>> <lambert.olivier at gmail.com> wrote:
>>>>>>> It's planned to have an arbiter soon :) It was just preliminary
>>>>>>> tests.
>>>>>>>
>>>>>>> Thanks for the settings, I'll test this soon and I'll come back to
>>>>>>> you!
>>>>>>>
>>>>>>> On Thu, Nov 17, 2016 at 11:29 PM, Lindsay Mathieson
>>>>>>> <lindsay.mathieson at gmail.com> wrote:
>>>>>>>> On 18/11/2016 8:17 AM, Olivier Lambert wrote:
>>>>>>>>>
>>>>>>>>> gluster volume info gv0
>>>>>>>>>
>>>>>>>>> Volume Name: gv0
>>>>>>>>> Type: Replicate
>>>>>>>>> Volume ID: 2f8658ed-0d9d-4a6f-a00b-96e9d3470b53
>>>>>>>>> Status: Started
>>>>>>>>> Snapshot Count: 0
>>>>>>>>> Number of Bricks: 1 x 2 = 2
>>>>>>>>> Transport-type: tcp
>>>>>>>>> Bricks:
>>>>>>>>> Brick1: 10.0.0.1:/bricks/brick1/gv0
>>>>>>>>> Brick2: 10.0.0.2:/bricks/brick1/gv0
>>>>>>>>> Options Reconfigured:
>>>>>>>>> nfs.disable: on
>>>>>>>>> performance.readdir-ahead: on
>>>>>>>>> transport.address-family: inet
>>>>>>>>> features.shard: on
>>>>>>>>> features.shard-block-size: 16MB
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> When hosting VM's its essential to set these options:
>>>>>>>>
>>>>>>>> network.remote-dio: enable
>>>>>>>> cluster.eager-lock: enable
>>>>>>>> performance.io-cache: off
>>>>>>>> performance.read-ahead: off
>>>>>>>> performance.quick-read: off
>>>>>>>> performance.stat-prefetch: on
>>>>>>>> performance.strict-write-ordering: off
>>>>>>>> cluster.server-quorum-type: server
>>>>>>>> cluster.quorum-type: auto
>>>>>>>> cluster.data-self-heal: on
>>>>>>>>
>>>>>>>> Also with replica two and quorum on (required) your volume will
>>>>>>>> become
>>>>>>>> read-only when one node goes down to prevent the possibility of
>>>>>>>> split-brain
>>>>>>>> - you *really* want to avoid that :)
>>>>>>>>
>>>>>>>> I'd recommend a replica 3 volume, that way 1 node can go down, but
>>>>>>>> the
>>>>>>>> other
>>>>>>>> two still form a quorum and will remain r/w.
>>>>>>>>
>>>>>>>> If the extra disks are not possible, then a Arbiter volume can be
>>>>>>>> setup
>>>>>>>> -
>>>>>>>> basically dummy files on the third node.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Lindsay Mathieson
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Gluster-users mailing list
>>>>>>>> Gluster-users at gluster.org
>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>>
>>>
>>>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161202/26c05d37/attachment.html>
More information about the Gluster-users
mailing list