[Gluster-users] Replicated striped data lose

Krutika Dhananjay kdhananj at redhat.com
Tue Mar 15 12:03:43 UTC 2016


Hmm ok. Could you share the nfs.log content?

-Krutika

On Tue, Mar 15, 2016 at 1:45 PM, Mahdi Adnan <mahdi.adnan at earthlinktele.com>
wrote:

> Okay, here's what i did;
>
> Volume Name: v
> Type: Distributed-Replicate
> Volume ID: b348fd8e-b117-469d-bcc0-56a56bdfc930
> Status: Started
> Number of Bricks: 3 x 2 = 6
> Transport-type: tcp
> Bricks:
> Brick1: gfs001:/bricks/b001/v
> Brick2: gfs001:/bricks/b002/v
> Brick3: gfs001:/bricks/b003/v
> Brick4: gfs002:/bricks/b004/v
> Brick5: gfs002:/bricks/b005/v
> Brick6: gfs002:/bricks/b006/v
> Options Reconfigured:
> features.shard-block-size: 128MB
> features.shard: enable
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> performance.readdir-ahead: on
>
>
> same error.
> and still mounting using glusterfs will work just fine.
>
> Respectfully
> *Mahdi A. Mahdi*
> <mahdi.adnan at outlook.com>
>
> On 03/15/2016 11:04 AM, Krutika Dhananjay wrote:
>
> OK but what if you use it with replication? Do you still see the error? I
> think not.
> Could you give it a try and tell me what you find?
>
> -Krutika
>
> On Tue, Mar 15, 2016 at 1:23 PM, Mahdi Adnan <
> mahdi.adnan at earthlinktele.com> wrote:
>
>> Hi,
>>
>> I have created the following volume;
>>
>> Volume Name: v
>> Type: Distribute
>> Volume ID: 90de6430-7f83-4eda-a98f-ad1fabcf1043
>> Status: Started
>> Number of Bricks: 3
>> Transport-type: tcp
>> Bricks:
>> Brick1: gfs001:/bricks/b001/v
>> Brick2: gfs001:/bricks/b002/v
>> Brick3: gfs001:/bricks/b003/v
>> Options Reconfigured:
>> features.shard-block-size: 128MB
>> features.shard: enable
>> cluster.server-quorum-type: server
>> cluster.quorum-type: auto
>> network.remote-dio: enable
>> cluster.eager-lock: enable
>> performance.stat-prefetch: off
>> performance.io-cache: off
>> performance.read-ahead: off
>> performance.quick-read: off
>> performance.readdir-ahead: on
>>
>> and after mounting it in ESXi and trying to clone a VM to it, i got the
>> same error.
>>
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>>
>> On 03/15/2016 10:44 AM, Krutika Dhananjay wrote:
>>
>> Hi,
>>
>> Do not use sharding and stripe together in the same volume because
>> a) It is not recommended and there is no point in using both. Using
>> sharding alone on your volume should work fine.
>> b) Nobody tested it.
>> c) Like Niels said, stripe feature is virtually deprecated.
>>
>> I would suggest that you create an nx3 volume where n is the number of
>> distribute subvols you prefer, enable group virt options on it, and enable
>> sharding on it,
>> set the shard-block-size that you feel appropriate and then just start
>> off with VM image creation etc.
>> If you run into any issues even after you do this, let us know and we'll
>> help you out.
>>
>> -Krutika
>>
>> On Tue, Mar 15, 2016 at 1:07 PM, Mahdi Adnan <
>> <mahdi.adnan at earthlinktele.com>mahdi.adnan at earthlinktele.com> wrote:
>>
>>> Thanks Krutika,
>>>
>>> I have deleted the volume and created a new one.
>>> I found that it may be an issue with the NFS itself, i have created a
>>> new striped volume and enabled sharding and mounted it via glusterfs and it
>>> worked just fine, if i mount it with nfs it will fail and gives me the same
>>> errors.
>>>
>>> Respectfully
>>> *Mahdi A. Mahdi*
>>>
>>> On 03/15/2016 06:24 AM, Krutika Dhananjay wrote:
>>>
>>> Hi,
>>>
>>> So could you share the xattrs associated with the file at
>>> <BRICK_PATH>/.glusterfs/c3/e8/c3e88cc1-7e0a-4d46-9685-2d12131a5e1c
>>>
>>> Here's what you need to execute:
>>>
>>> # getfattr -d -m . -e hex
>>> /mnt/b1/v/.glusterfs/c3/e8/c3e88cc1-7e0a-4d46-9685-2d12131a5e1c      on the
>>> first node and
>>>
>>> # getfattr -d -m . -e hex
>>> /mnt/b2/v/.glusterfs/c3/e8/c3e88cc1-7e0a-4d46-9685-2d12131a5e1c      on the
>>> second.
>>>
>>>
>>> Also, it is normally advised to use a replica 3 volume as opposed to
>>> replica 2 volume to guard against split-brains.
>>>
>>> -Krutika
>>>
>>> On Mon, Mar 14, 2016 at 3:17 PM, Mahdi Adnan <
>>> <mahdi.adnan at earthlinktele.com>mahdi.adnan at earthlinktele.com> wrote:
>>>
>>>> sorry for serial posting but, i got new logs it might help..
>>>>
>>>> the message appear during the migration;
>>>>
>>>> /var/log/glusterfs/nfs.log
>>>>
>>>>
>>>> [2016-03-14 09:45:04.573765] I [MSGID: 109036]
>>>> [dht-common.c:8043:dht_log_new_layout_for_dir_selfheal] 0-testv-dht:
>>>> Setting layout of /New Virtual Machine_1 with [Subvol_name: testv-stripe-0,
>>>> Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ],
>>>> [2016-03-14 09:45:04.957499] E
>>>> [shard.c:369:shard_modify_size_and_block_count]
>>>> (-->/usr/lib64/glusterfs/3.7.8/xlator/cluster/distribute.so(dht_file_setattr_cbk+0x14f)
>>>> [0x7f27a13c067f]
>>>> -->/usr/lib64/glusterfs/3.7.8/xlator/features/shard.so(shard_common_setattr_cbk+0xcc)
>>>> [0x7f27a116681c]
>>>> -->/usr/lib64/glusterfs/3.7.8/xlator/features/shard.so(shard_modify_size_and_block_count+0xdd)
>>>> [0x7f27a116584d] ) 0-testv-shard: Failed to get
>>>> trusted.glusterfs.shard.file-size for c3e88cc1-7e0a-4d46-9685-2d12131a5e1c
>>>> [2016-03-14 09:45:04.957577] W [MSGID: 112199]
>>>> [nfs3-helpers.c:3418:nfs3_log_common_res] 0-nfs-nfsv3: /New Virtual
>>>> Machine_1/New Virtual Machine-flat.vmdk => (XID: 3fec5a26, SETATTR: NFS:
>>>> 22(Invalid argument for operation), POSIX: 22(Invalid argument)) [Invalid
>>>> argument]
>>>> [2016-03-14 09:45:05.079657] E [MSGID: 112069]
>>>> [nfs3.c:3649:nfs3_rmdir_resume] 0-nfs-nfsv3: No such file or directory: (
>>>> 192.168.221.52:826) testv : 00000000-0000-0000-0000-000000000001
>>>>
>>>>
>>>>
>>>> Respectfully
>>>>
>>>>
>>>> *Mahdi A. Mahd *
>>>> On 03/14/2016 11:14 AM, Mahdi Adnan wrote:
>>>>
>>>> So i have deployed a new server "Cisco UCS C220M4" and created a new
>>>> volume;
>>>>
>>>> Volume Name: testv
>>>> Type: Stripe
>>>> Volume ID: 55cdac79-fe87-4f1f-90c0-15c9100fe00b
>>>> Status: Started
>>>> Number of Bricks: 1 x 2 = 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: 10.70.0.250:/mnt/b1/v
>>>> Brick2: 10.70.0.250:/mnt/b2/v
>>>> Options Reconfigured:
>>>> nfs.disable: off
>>>> features.shard-block-size: 64MB
>>>> features.shard: enable
>>>> cluster.server-quorum-type: server
>>>> cluster.quorum-type: auto
>>>> network.remote-dio: enable
>>>> cluster.eager-lock: enable
>>>> performance.stat-prefetch: off
>>>> performance.io-cache: off
>>>> performance.read-ahead: off
>>>> performance.quick-read: off
>>>> performance.readdir-ahead: off
>>>>
>>>> same error ..
>>>>
>>>> can anyone share with me the info of a working striped volume ?
>>>>
>>>> On 03/14/2016 09:02 AM, Mahdi Adnan wrote:
>>>>
>>>> I have a pool of two bricks in the same server;
>>>>
>>>> Volume Name: k
>>>> Type: Stripe
>>>> Volume ID: 1e9281ce-2a8b-44e8-a0c6-e3ebf7416b2b
>>>> Status: Started
>>>> Number of Bricks: 1 x 2 = 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: gfs001:/bricks/t1/k
>>>> Brick2: gfs001:/bricks/t2/k
>>>> Options Reconfigured:
>>>> features.shard-block-size: 64MB
>>>> features.shard: on
>>>> cluster.server-quorum-type: server
>>>> cluster.quorum-type: auto
>>>> network.remote-dio: enable
>>>> cluster.eager-lock: enable
>>>> performance.stat-prefetch: off
>>>> performance.io-cache: off
>>>> performance.read-ahead: off
>>>> performance.quick-read: off
>>>> performance.readdir-ahead: off
>>>>
>>>> same issue ...
>>>> glusterfs 3.7.8 built on Mar 10 2016 20:20:45.
>>>>
>>>>
>>>> Respectfully
>>>> *Mahdi A. Mahdi*
>>>>
>>>> Systems Administrator
>>>> IT. Department
>>>> Earthlink Telecommunications <https://www.facebook.com/earthlinktele>
>>>>
>>>> Cell: 07903316180
>>>> Work: 3352
>>>> Skype: <mahdi.adnan at outlook.com>mahdi.adnan at outlook.com
>>>> On 03/14/2016 08:11 AM, Niels de Vos wrote:
>>>>
>>>> On Mon, Mar 14, 2016 at 08:12:27AM +0530, Krutika Dhananjay wrote:
>>>>
>>>> It would be better to use sharding over stripe for your vm use case. It
>>>> offers better distribution and utilisation of bricks and better heal
>>>> performance.
>>>> And it is well tested.
>>>>
>>>> Basically the "striping" feature is deprecated, "sharding" is its
>>>> improved replacement. I expect to see "striping" completely dropped in
>>>> the next major release.
>>>>
>>>> Niels
>>>>
>>>>
>>>>
>>>> Couple of things to note before you do that:
>>>> 1. Most of the bug fixes in sharding have gone into 3.7.8. So it is advised
>>>> that you use 3.7.8 or above.
>>>> 2. When you enable sharding on a volume, already existing files in the
>>>> volume do not get sharded. Only the files that are newly created from the
>>>> time sharding is enabled will.
>>>>     If you do want to shard the existing files, then you would need to cp
>>>> them to a temp name within the volume, and then rename them back to the
>>>> original file name.
>>>>
>>>> HTH,
>>>> Krutika
>>>>
>>>> On Sun, Mar 13, 2016 at 11:49 PM, Mahdi Adnan <mahdi.adnan at earthlinktele.com
>>>>
>>>> wrote:
>>>>
>>>> I couldn't find anything related to cache in the HBAs.
>>>> what logs are useful in my case ? i see only bricks logs which contains
>>>> nothing during the failure.
>>>>
>>>> ###
>>>> [2016-03-13 18:05:19.728614] E [MSGID: 113022] [posix.c:1232:posix_mknod]
>>>> 0-vmware-posix: mknod on
>>>> /bricks/b003/vmware/.shard/17d75e20-16f1-405e-9fa5-99ee7b1bd7f1.511 failed
>>>> [File exists]
>>>> [2016-03-13 18:07:23.337086] E [MSGID: 113022] [posix.c:1232:posix_mknod]
>>>> 0-vmware-posix: mknod on
>>>> /bricks/b003/vmware/.shard/eef2d538-8eee-4e58-bc88-fbf7dc03b263.4095 failed
>>>> [File exists]
>>>> [2016-03-13 18:07:55.027600] W [trash.c:1922:trash_rmdir] 0-vmware-trash:
>>>> rmdir issued on /.trashcan/, which is not permitted
>>>> [2016-03-13 18:07:55.027635] I [MSGID: 115056]
>>>> [server-rpc-fops.c:459:server_rmdir_cbk] 0-vmware-server: 41987: RMDIR
>>>> /.trashcan/internal_op (00000000-0000-0000-0000-000000000005/internal_op)
>>>> ==> (Operation not permitted) [Operation not permitted]
>>>> [2016-03-13 18:11:34.353441] I [login.c:81:gf_auth] 0-auth/login: allowed
>>>> user names: c0c72c37-477a-49a5-a305-3372c1c2f2b4
>>>> [2016-03-13 18:11:34.353463] I [MSGID: 115029]
>>>> [server-handshake.c:612:server_setvolume] 0-vmware-server: accepted client
>>>> from gfs002-2727-2016/03/13-20:17:43:613597-vmware-client-4-0-0 (version:
>>>> 3.7.8)
>>>> [2016-03-13 18:11:34.591139] I [login.c:81:gf_auth] 0-auth/login: allowed
>>>> user names: c0c72c37-477a-49a5-a305-3372c1c2f2b4
>>>> [2016-03-13 18:11:34.591173] I [MSGID: 115029]
>>>> [server-handshake.c:612:server_setvolume] 0-vmware-server: accepted client
>>>> from gfs002-2719-2016/03/13-20:17:42:609388-vmware-client-4-0-0 (version:
>>>> 3.7.8)
>>>> ###
>>>>
>>>> ESXi just keeps telling me "Cannot clone T: The virtual disk is either
>>>> corrupted or not a supported format.
>>>> error
>>>> 3/13/2016 9:06:20 PM
>>>> Clone virtual machine
>>>> T
>>>> VCENTER.LOCAL\Administrator
>>>> "
>>>>
>>>> My setup is 2 servers with a floating ip controlled by CTDB and my ESXi
>>>> server mount the NFS via the floating ip.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 03/13/2016 08:40 PM, pkoelle wrote:
>>>>
>>>>
>>>> Am 13.03.2016 um 18:22 schrieb David Gossage:
>>>>
>>>>
>>>> On Sun, Mar 13, 2016 at 11:07 AM, Mahdi Adnan <mahdi.adnan at earthlinktele.com
>>>>
>>>> wrote:
>>>>
>>>>
>>>> My HBAs are LSISAS1068E, and the filesystem is XFS.
>>>>
>>>> I tried EXT4 and it did not help.
>>>> I have created a stripted volume in one server with two bricks, same
>>>> issue.
>>>> and i tried a replicated volume with just "sharding enabled" same issue,
>>>> as soon as i disable the sharding it works just fine, niether sharding
>>>> nor
>>>> striping works for me.
>>>> i did follow up with some of threads in the mailing list and tried some
>>>> of
>>>> the fixes that worked with the others, none worked for me. :(
>>>>
>>>>
>>>>
>>>> Is it possible the LSI has write-cache enabled?
>>>>
>>>>
>>>> Why is that relevant? Even the backing filesystem has no idea if there is
>>>> a RAID or write cache or whatever. There are blocks and sync(), end of
>>>> story.
>>>> If you lose power and screw up your recovery OR do funky stuff with SAS
>>>> multipathing that might be an issue with a controller cache. AFAIK thats
>>>> not what we are talking about.
>>>>
>>>> I'm afraid but unless the OP has some logs from the server, a
>>>> reproducible testcase or a backtrace from client or server this isn't
>>>> getting us anywhere.
>>>>
>>>> cheers
>>>> Paul
>>>>
>>>>
>>>>
>>>> On 03/13/2016 06:54 PM, David Gossage wrote:
>>>>
>>>> On Sun, Mar 13, 2016 at 8:16 AM, Mahdi Adnan <mahdi.adnan at earthlinktele.com> wrote:
>>>>
>>>> Okay so i have enabled shard in my test volume and it did not help,
>>>>
>>>> stupidly enough, i have enabled it in a production volume
>>>> "Distributed-Replicate" and it currpted  half of my VMs.
>>>> I have updated Gluster to the latest and nothing seems to be changed in
>>>> my situation.
>>>> below the info of my volume;
>>>>
>>>>
>>>>
>>>> I was pointing at the settings in that email as an example for
>>>> corruption
>>>> fixing. I wouldn't recommend enabling sharding if you haven't gotten the
>>>> base working yet on that cluster. What HBA's are you using and what is
>>>> layout of filesystem for bricks?
>>>>
>>>>
>>>> Number of Bricks: 3 x 2 = 6
>>>>
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: gfs001:/bricks/b001/vmware
>>>> Brick2: gfs002:/bricks/b004/vmware
>>>> Brick3: gfs001:/bricks/b002/vmware
>>>> Brick4: gfs002:/bricks/b005/vmware
>>>> Brick5: gfs001:/bricks/b003/vmware
>>>> Brick6: gfs002:/bricks/b006/vmware
>>>> Options Reconfigured:
>>>> performance.strict-write-ordering: on
>>>> cluster.server-quorum-type: server
>>>> cluster.quorum-type: auto
>>>> network.remote-dio: enable
>>>> performance.stat-prefetch: disable
>>>> performance.io-cache: off
>>>> performance.read-ahead: off
>>>> performance.quick-read: off
>>>> cluster.eager-lock: enable
>>>> features.shard-block-size: 16MB
>>>> features.shard: on
>>>> performance.readdir-ahead: off
>>>>
>>>>
>>>> On 03/12/2016 08:11 PM, David Gossage wrote:
>>>>
>>>>
>>>> On Sat, Mar 12, 2016 at 10:21 AM, Mahdi Adnan <<mahdi.adnan at earthlinktele.com> <mahdi.adnan at earthlinktele.com>mahdi.adnan at earthlinktele.com> wrote:
>>>>
>>>> Both servers have HBA no RAIDs and i can setup a replicated or
>>>>
>>>> dispensers without any issues.
>>>> Logs are clean and when i tried to migrate a vm and got the error,
>>>> nothing showed up in the logs.
>>>> i tried mounting the volume into my laptop and it mounted fine but,
>>>> if i
>>>> use dd to create a data file it just hang and i cant cancel it, and i
>>>> cant
>>>> unmount it or anything, i just have to reboot.
>>>> The same servers have another volume on other bricks in a distributed
>>>> replicas, works fine.
>>>> I have even tried the same setup in a virtual environment (created two
>>>> vms and install gluster and created a replicated striped) and again
>>>> same
>>>> thing, data corruption.
>>>>
>>>>
>>>>
>>>> I'd look through mail archives for a topic "Shard in Production" I
>>>> think
>>>> it's called.  The shard portion may not be relevant but it does discuss
>>>> certain settings that had to be applied with regards to avoiding
>>>> corruption
>>>> with VM's.  You may want to try and disable the
>>>> performance.readdir-ahead
>>>> also.
>>>>
>>>>
>>>>
>>>>
>>>> On 03/12/2016 07:02 PM, David Gossage wrote:
>>>>
>>>>
>>>>
>>>> On Sat, Mar 12, 2016 at 9:51 AM, Mahdi Adnan <<mahdi.adnan at earthlinktele.com> <mahdi.adnan at earthlinktele.com>mahdi.adnan at earthlinktele.com> wrote:
>>>>
>>>> Thanks David,
>>>>
>>>> My settings are all defaults, i have just created the pool and
>>>> started
>>>> it.
>>>> I have set the settings as your recommendation and it seems to be the
>>>> same issue;
>>>>
>>>> Type: Striped-Replicate
>>>> Volume ID: 44adfd8c-2ed1-4aa5-b256-d12b64f7fc14
>>>> Status: Started
>>>> Number of Bricks: 1 x 2 x 2 = 4
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: gfs001:/bricks/t1/s
>>>> Brick2: gfs002:/bricks/t1/s
>>>> Brick3: gfs001:/bricks/t2/s
>>>> Brick4: gfs002:/bricks/t2/s
>>>> Options Reconfigured:
>>>> performance.stat-prefetch: off
>>>> network.remote-dio: on
>>>> cluster.eager-lock: enable
>>>> performance.io-cache: off
>>>> performance.read-ahead: off
>>>> performance.quick-read: off
>>>> performance.readdir-ahead: on
>>>>
>>>>
>>>>
>>>> Is their a raid controller perhaps doing any caching?
>>>>
>>>> In the gluster logs any errors being reported during migration
>>>> process?
>>>> Since they aren't in use yet have you tested making just mirrored
>>>> bricks
>>>> using different pairings of servers two at a time to see if problem
>>>> follows
>>>> certain machine or network ports?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 03/12/2016 03:25 PM, David Gossage wrote:
>>>>
>>>>
>>>>
>>>> On Sat, Mar 12, 2016 at 1:55 AM, Mahdi Adnan <<mahdi.adnan at earthlinktele.com> <mahdi.adnan at earthlinktele.com>mahdi.adnan at earthlinktele.com> wrote:
>>>>
>>>> Dears,
>>>>
>>>> I have created a replicated striped volume with two bricks and two
>>>> servers but I can't use it because when I mount it in ESXi and try
>>>> to
>>>> migrate a VM to it, the data get corrupted.
>>>> Is any one have any idea why is this happening ?
>>>>
>>>> Dell 2950 x2
>>>> Seagate 15k 600GB
>>>> CentOS 7.2
>>>> Gluster 3.7.8
>>>>
>>>> Appreciate your help.
>>>>
>>>>
>>>>
>>>> Most reports of this I have seen end up being settings related.  Post
>>>> gluster volume info. Below is what I have seen as most common
>>>> recommended
>>>> settings.
>>>> I'd hazard a guess you may have some the read ahead cache or prefetch
>>>> on.
>>>>
>>>> quick-read=off
>>>> read-ahead=off
>>>> io-cache=off
>>>> stat-prefetch=off
>>>> eager-lock=enable
>>>> remote-dio=on
>>>>
>>>>
>>>>
>>>> Mahdi Adnan
>>>> System Admin
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list<Gluster-users at gluster.org> <Gluster-users at gluster.org>Gluster-users at gluster.org<http://www.gluster.org/mailman/listinfo/gluster-users> <http://www.gluster.org/mailman/listinfo/gluster-users>http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>  _______________________________________________
>>>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>  _______________________________________________
>>>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160315/daea7922/attachment.html>


More information about the Gluster-users mailing list