[Gluster-users] corruption using gluster and iSCSI with LIO

Fri Dec 2 20:43:07 UTC 2016

Hi,

It can be that LIO service starts before /mnt gets mounted. In absence of backend file LIO has created the new one on root filesystem (/mnt directory). Then gluster volume was mounted over, but as backend file was kept open by LIO - it still was used instead of the right one on gluster volume. Then, when you turn off the first node - active path for iSCSI disk switches to the second node (with empty file, placed on root filesystem).

> 18 нояб. 2016 г., в 19:21, Olivier Lambert <lambert.olivier at gmail.com> написал(а):
> 
> After Node 1 is DOWN, LIO on Node2 (iSCSI target) is not writing
> anymore in the local Gluster mount, but in the root partition.
> 
> Despite "df -h" shows the Gluster brick mounted:
> 
> /dev/mapper/centos-root   3,1G    3,1G   20K 100% /
> ...
> /dev/xvdb                  61G     61G  956M  99% /bricks/brick1
> localhost:/gv0             61G     61G  956M  99% /mnt
> 
> If I unmount it, I still see the "block.img" in /mnt which is filling
> the root space. So it's like Fuse is messing with the local Gluster
> mount, which could lead to the data corruption on the client level.
> 
> It doesn't make sense for me... What am I missing?
> 
> On Fri, Nov 18, 2016 at 5:00 PM, Olivier Lambert
> <lambert.olivier at gmail.com> wrote:
>> Yes, I did it only if I have the previous result of heal info ("number
>> of entries: 0"). But same result, as soon as the second Node is
>> offline (after they were both working/back online), everything is
>> corrupted.
>> 
>> To recap:
>> 
>> * Node 1 UP Node 2 UP -> OK
>> * Node 1 UP Node 2 DOWN -> OK (just a small lag for multipath to see
>> the path down and change if necessary)
>> * Node 1 UP Node 2 UP -> OK (and waiting to have no entries displayed
>> in heal command)
>> * Node 1 DOWN Node 2 UP -> NOT OK (data corruption)
>> 
>> On Fri, Nov 18, 2016 at 3:39 PM, David Gossage
>> <dgossage at carouselchecks.com> wrote:
>>> On Fri, Nov 18, 2016 at 3:49 AM, Olivier Lambert <lambert.olivier at gmail.com>
>>> wrote:
>>>> 
>>>> Hi David,
>>>> 
>>>> What are the exact commands to be sure it's fine?
>>>> 
>>>> Right now I got:
>>>> 
>>>> # gluster volume heal gv0 info
>>>> Brick 10.0.0.1:/bricks/brick1/gv0
>>>> Status: Connected
>>>> Number of entries: 0
>>>> 
>>>> Brick 10.0.0.2:/bricks/brick1/gv0
>>>> Status: Connected
>>>> Number of entries: 0
>>>> 
>>>> Brick 10.0.0.3:/bricks/brick1/gv0
>>>> Status: Connected
>>>> Number of entries: 0
>>>> 
>>>> 
>>> Did you run this before taking down 2nd node to see if any heals were
>>> ongoing?
>>> 
>>> Also I see you have sharding enabled.  Are your files being served sharded
>>> already as well?
>>> 
>>>> 
>>>> Everything is online and working, but this command give a strange output:
>>>> 
>>>> # gluster volume heal gv0 info heal-failed
>>>> Gathering list of heal failed entries on volume gv0 has been
>>>> unsuccessful on bricks that are down. Please check if all brick
>>>> processes are running.
>>>> 
>>>> Is it normal?
>>> 
>>> 
>>> I don't think that is a valid command anymore as whern I run it I get same
>>> message and this is in logs
>>> [2016-11-18 14:35:02.260503] I [MSGID: 106533]
>>> [glusterd-volume-ops.c:878:__glusterd_handle_cli_heal_volume] 0-management:
>>> Received heal vol req for volume GLUSTER1
>>> [2016-11-18 14:35:02.263341] W [MSGID: 106530]
>>> [glusterd-volume-ops.c:1882:glusterd_handle_heal_cmd] 0-management: Command
>>> not supported. Please use "gluster volume heal GLUSTER1 info" and logs to
>>> find the heal information.
>>> [2016-11-18 14:35:02.263365] E [MSGID: 106301]
>>> [glusterd-syncop.c:1297:gd_stage_op_phase] 0-management: Staging of
>>> operation 'Volume Heal' failed on localhost : Command not supported. Please
>>> use "gluster volume heal GLUSTER1 info" and logs to find the heal
>>> information.
>>> 
>>>> 
>>>> On Fri, Nov 18, 2016 at 2:51 AM, David Gossage
>>>> <dgossage at carouselchecks.com> wrote:
>>>>> 
>>>>> On Thu, Nov 17, 2016 at 6:42 PM, Olivier Lambert
>>>>> <lambert.olivier at gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Okay, used the exact same config you provided, and adding an arbiter
>>>>>> node (node3)
>>>>>> 
>>>>>> After halting node2, VM continues to work after a small "lag"/freeze.
>>>>>> I restarted node2 and it was back online: OK
>>>>>> 
>>>>>> Then, after waiting few minutes, halting node1. And **just** at this
>>>>>> moment, the VM is corrupted (segmentation fault, /var/log folder empty
>>>>>> etc.)
>>>>>> 
>>>>> Other than waiting a few minutes did you make sure heals had completed?
>>>>> 
>>>>>> 
>>>>>> dmesg of the VM:
>>>>>> 
>>>>>> [ 1645.852905] EXT4-fs error (device xvda1):
>>>>>> htree_dirblock_to_tree:988: inode #19: block 8286: comm bash: bad
>>>>>> entry in directory: rec_len is smaller than minimal - offset=0(0),
>>>>>> inode=0, rec_len=0, name_len=0
>>>>>> [ 1645.854509] Aborting journal on device xvda1-8.
>>>>>> [ 1645.855524] EXT4-fs (xvda1): Remounting filesystem read-only
>>>>>> 
>>>>>> And got a lot of " comm bash: bad entry in directory" messages then...
>>>>>> 
>>>>>> Here is the current config with all Node back online:
>>>>>> 
>>>>>> # gluster volume info
>>>>>> 
>>>>>> Volume Name: gv0
>>>>>> Type: Replicate
>>>>>> Volume ID: 5f15c919-57e3-4648-b20a-395d9fe3d7d6
>>>>>> Status: Started
>>>>>> Snapshot Count: 0
>>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: 10.0.0.1:/bricks/brick1/gv0
>>>>>> Brick2: 10.0.0.2:/bricks/brick1/gv0
>>>>>> Brick3: 10.0.0.3:/bricks/brick1/gv0 (arbiter)
>>>>>> Options Reconfigured:
>>>>>> nfs.disable: on
>>>>>> performance.readdir-ahead: on
>>>>>> transport.address-family: inet
>>>>>> features.shard: on
>>>>>> features.shard-block-size: 16MB
>>>>>> network.remote-dio: enable
>>>>>> cluster.eager-lock: enable
>>>>>> performance.io-cache: off
>>>>>> performance.read-ahead: off
>>>>>> performance.quick-read: off
>>>>>> performance.stat-prefetch: on
>>>>>> performance.strict-write-ordering: off
>>>>>> cluster.server-quorum-type: server
>>>>>> cluster.quorum-type: auto
>>>>>> cluster.data-self-heal: on
>>>>>> 
>>>>>> 
>>>>>> # gluster volume status
>>>>>> Status of volume: gv0
>>>>>> Gluster process                             TCP Port  RDMA Port  Online
>>>>>> Pid
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> Brick 10.0.0.1:/bricks/brick1/gv0           49152     0          Y
>>>>>> 1331
>>>>>> Brick 10.0.0.2:/bricks/brick1/gv0           49152     0          Y
>>>>>> 2274
>>>>>> Brick 10.0.0.3:/bricks/brick1/gv0           49152     0          Y
>>>>>> 2355
>>>>>> Self-heal Daemon on localhost               N/A       N/A        Y
>>>>>> 2300
>>>>>> Self-heal Daemon on 10.0.0.3                N/A       N/A        Y
>>>>>> 10530
>>>>>> Self-heal Daemon on 10.0.0.2                N/A       N/A        Y
>>>>>> 2425
>>>>>> 
>>>>>> Task Status of Volume gv0
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> There are no active volume tasks
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Nov 17, 2016 at 11:35 PM, Olivier Lambert
>>>>>> <lambert.olivier at gmail.com> wrote:
>>>>>>> It's planned to have an arbiter soon :) It was just preliminary
>>>>>>> tests.
>>>>>>> 
>>>>>>> Thanks for the settings, I'll test this soon and I'll come back to
>>>>>>> you!
>>>>>>> 
>>>>>>> On Thu, Nov 17, 2016 at 11:29 PM, Lindsay Mathieson
>>>>>>> <lindsay.mathieson at gmail.com> wrote:
>>>>>>>> On 18/11/2016 8:17 AM, Olivier Lambert wrote:
>>>>>>>>> 
>>>>>>>>> gluster volume info gv0
>>>>>>>>> 
>>>>>>>>> Volume Name: gv0
>>>>>>>>> Type: Replicate
>>>>>>>>> Volume ID: 2f8658ed-0d9d-4a6f-a00b-96e9d3470b53
>>>>>>>>> Status: Started
>>>>>>>>> Snapshot Count: 0
>>>>>>>>> Number of Bricks: 1 x 2 = 2
>>>>>>>>> Transport-type: tcp
>>>>>>>>> Bricks:
>>>>>>>>> Brick1: 10.0.0.1:/bricks/brick1/gv0
>>>>>>>>> Brick2: 10.0.0.2:/bricks/brick1/gv0
>>>>>>>>> Options Reconfigured:
>>>>>>>>> nfs.disable: on
>>>>>>>>> performance.readdir-ahead: on
>>>>>>>>> transport.address-family: inet
>>>>>>>>> features.shard: on
>>>>>>>>> features.shard-block-size: 16MB
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> When hosting VM's its essential to set these options:
>>>>>>>> 
>>>>>>>> network.remote-dio: enable
>>>>>>>> cluster.eager-lock: enable
>>>>>>>> performance.io-cache: off
>>>>>>>> performance.read-ahead: off
>>>>>>>> performance.quick-read: off
>>>>>>>> performance.stat-prefetch: on
>>>>>>>> performance.strict-write-ordering: off
>>>>>>>> cluster.server-quorum-type: server
>>>>>>>> cluster.quorum-type: auto
>>>>>>>> cluster.data-self-heal: on
>>>>>>>> 
>>>>>>>> Also with replica two and quorum on (required) your volume will
>>>>>>>> become
>>>>>>>> read-only when one node goes down to prevent the possibility of
>>>>>>>> split-brain
>>>>>>>> - you *really* want to avoid that :)
>>>>>>>> 
>>>>>>>> I'd recommend a replica 3 volume, that way 1 node can go down, but
>>>>>>>> the
>>>>>>>> other
>>>>>>>> two still form a quorum and will remain r/w.
>>>>>>>> 
>>>>>>>> If the extra disks are not possible, then a Arbiter volume can be
>>>>>>>> setup
>>>>>>>> -
>>>>>>>> basically dummy files on the third node.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Lindsay Mathieson
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Gluster-users mailing list
>>>>>>>> Gluster-users at gluster.org
>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>> 
>>>>> 
>>> 
>>> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161202/26c05d37/attachment.html>