[Gluster-devel] Physical HDD unplug test

Krutika Dhananjay kdhananj at redhat.com
Tue Aug 23 15:42:58 UTC 2016


Should be OK. But running the same version on both clients and servers is
always the safest bet.

-Krutika

On Mon, Aug 22, 2016 at 10:39 AM, qingwei wei <tchengwee at gmail.com> wrote:

> Hi,
>
> I updated my client to 3.7.14 and i no longer experience split-brain
> error. It seems like there is some changes on the client side that fix
> this error. Please note that, for this test, i still stick to 3.7.10
> server. Will it be an issue with client and server using different
> version?
>
> Cw
>
> On Tue, Aug 16, 2016 at 5:46 PM, Krutika Dhananjay <kdhananj at redhat.com>
> wrote:
> > 3.7.11 had quite a few bugs in afr and sharding+afr interop that were
> fixed
> > in 3.7.12.
> > Some of them were about files being reported as being in split-brain.
> > Chances are that some of them existed in 3.7.10 as well - which is what
> > you're using.
> >
> > Do you mind trying the same test with 3.7.12 or a later version?
> >
> > -Krutika
> >
> > On Tue, Aug 16, 2016 at 2:46 PM, qingwei wei <tchengwee at gmail.com>
> wrote:
> >>
> >> Hi Niels,
> >>
> >> My situation is that when i unplug the HDD physically, the FIO
> >> application exits with Input/Output error. However, when i do echo
> >> offline on the disk, the FIO application does freeze a bit but still
> >> manage to resume the IO workload after the freeze.
> >>
> >> From what i can see from the client log, the error is split-brain
> >> which does not make sense as i still have 2 working replicas.
> >>
> >> [2016-08-12 10:33:41.854283] E [MSGID:
> >> 108008][afr-transaction.c:1989:afr_transaction]
> >> 0-ad17hwssd7-replicate-0:
> >> Failing WRITE on gfid 665a43df-1ece-4c9a-a6ee-fcfa960d95bf:split-brain
> >> observed. [Input/output error]
> >>
> >> So anyone can share their testing experience on this type disruptive
> >> test on shard volume? Thanks!
> >>
> >> Regards,
> >>
> >> Cheng Wee
> >>
> >> On Tue, Aug 16, 2016 at 4:45 PM, Niels de Vos <ndevos at redhat.com>
> wrote:
> >> > On Tue, Aug 16, 2016 at 01:34:36PM +0800, qingwei wei wrote:
> >> >> Hi,
> >> >>
> >> >> I am currently trying to test the distributed replica (3 replicas)
> >> >> reliability when 1 brick is down. I tried using both software unplug
> >> >> method by issuing the exho offline > /sys/block/sdx/device/state and
> >> >> also physically unplug the HDD and i encountered 2 different
> outcomes.
> >> >> For software unplug, the FIO workload continue to run but for
> >> >> physically unplug the HDD, FIO workload cannot continue with the
> >> >> following error:
> >> >>
> >> >> [2016-08-12 10:33:41.854283] E [MSGID: 108008]
> >> >> [afr-transaction.c:1989:afr_transaction] 0-ad17hwssd7-replicate-0:
> >> >> Failing WRITE on gfid 665a43df-1ece-4c9a-a6ee-fcfa960d95bf:
> >> >> split-brain observed. [Input/output error]
> >> >>
> >> >> From the server where i unplug the disk, i can see the following:
> >> >>
> >> >> [2016-08-12 10:33:41.916456] D [MSGID: 0]
> >> >> [io-threads.c:351:iot_schedule] 0-ad17hwssd7-io-threads: LOOKUP
> >> >> scheduled as fast fop
> >> >> [2016-08-12 10:33:41.916666] D [MSGID: 115050]
> >> >> [server-rpc-fops.c:179:server_lookup_cbk] 0-ad17hwssd7-server: 8127:
> >> >> LOOKUP /.shard/150e99ee-ce3b-4b57-8c40-99b4ecdf3822.90
> >> >>
> >> >> (be318638-e8a0-4c6d-977d-7a937aa84806/150e99ee-ce3b-
> 4b57-8c40-99b4ecdf3822.90)
> >> >> ==> (No such file or directory) [No such file or directory]
> >> >> [2016-08-12 10:33:41.916804] D [MSGID: 101171]
> >> >> [client_t.c:417:gf_client_unref] 0-client_t:
> >> >>
> >> >> hp.dctopenstack.org-25780-2016/08/12-10:33:07:589960-
> ad17hwssd7-client-0-0-0:
> >> >> ref-count 1
> >> >> [2016-08-12 10:33:41.917098] D [MSGID: 101171]
> >> >> [client_t.c:333:gf_client_ref] 0-client_t:
> >> >>
> >> >> hp.dctopenstack.org-25780-2016/08/12-10:33:07:589960-
> ad17hwssd7-client-0-0-0:
> >> >> ref-count 2
> >> >> [2016-08-12 10:33:41.917145] W [MSGID: 115009]
> >> >> [server-resolve.c:571:server_resolve] 0-ad17hwssd7-server: no
> >> >> resolution type for (null) (LOOKUP)
> >> >> [2016-08-12 10:33:41.917182] E [MSGID: 115050]
> >> >> [server-rpc-fops.c:179:server_lookup_cbk] 0-ad17hwssd7-server: 8128:
> >> >> LOOKUP (null)
> >> >> (00000000-0000-0000-0000-000000000000/150e99ee-ce3b-
> 4b57-8c40-99b4ecdf3822.90)
> >> >> ==> (Invalid argument) [Invalid argument]
> >> >>
> >> >> I am using gluster 3.7.10 and the configuration is as follow:
> >> >>
> >> >> diagnostics.brick-log-level: DEBUG
> >> >> diagnostics.client-log-level: DEBUG
> >> >> performance.io-thread-count: 16
> >> >> client.event-threads: 2
> >> >> server.event-threads: 2
> >> >> features.shard-block-size: 16MB
> >> >> features.shard: on
> >> >> server.allow-insecure: on
> >> >> storage.owner-uid: 165
> >> >> storage.owner-gid: 165
> >> >> nfs.disable: true
> >> >> performance.quick-read: off
> >> >> performance.io-cache: off
> >> >> performance.read-ahead: off
> >> >> performance.stat-prefetch: off
> >> >> cluster.lookup-optimize: on
> >> >> cluster.quorum-type: auto
> >> >> cluster.server-quorum-type: server
> >> >> transport.address-family: inet
> >> >> performance.readdir-ahead: on
> >> >>
> >> >> This error only occur for sharding configuration. Do you guys perform
> >> >> this type of test before? Or do you think physically unplug the HDD
> is
> >> >> a valid test case?
> >> >
> >> > If you use replica-3, things should settle down again. The kernel and
> >> > teh brick process needs a little time to find out that the filesystem
> on
> >> > the disk that you pulled out is not responding anymore. The output og
> >> > "gluster volume status" should show that the brick process is offline.
> >> > As long as you have quorum, things should continue after a small delay
> >> > while waiting to mark the brick offline.
> >> >
> >> > People actually should test this scenario, it can be that power to
> disks
> >> > fail, or even (connections to) RAID-controllers. Hot-unplugging is
> >> > definitely a scenario that can emulate real-world problems.
> >> >
> >> > Niels
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160823/070e640f/attachment.html>


More information about the Gluster-devel mailing list