[Gluster-devel] Physical HDD unplug test

Tue Aug 16 08:45:08 UTC 2016

On Tue, Aug 16, 2016 at 01:34:36PM +0800, qingwei wei wrote:
> Hi,
> 
> I am currently trying to test the distributed replica (3 replicas)
> reliability when 1 brick is down. I tried using both software unplug
> method by issuing the exho offline > /sys/block/sdx/device/state and
> also physically unplug the HDD and i encountered 2 different outcomes.
> For software unplug, the FIO workload continue to run but for
> physically unplug the HDD, FIO workload cannot continue with the
> following error:
> 
> [2016-08-12 10:33:41.854283] E [MSGID: 108008]
> [afr-transaction.c:1989:afr_transaction] 0-ad17hwssd7-replicate-0:
> Failing WRITE on gfid 665a43df-1ece-4c9a-a6ee-fcfa960d95bf:
> split-brain observed. [Input/output error]
> 
> From the server where i unplug the disk, i can see the following:
> 
> [2016-08-12 10:33:41.916456] D [MSGID: 0]
> [io-threads.c:351:iot_schedule] 0-ad17hwssd7-io-threads: LOOKUP
> scheduled as fast fop
> [2016-08-12 10:33:41.916666] D [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-ad17hwssd7-server: 8127:
> LOOKUP /.shard/150e99ee-ce3b-4b57-8c40-99b4ecdf3822.90
> (be318638-e8a0-4c6d-977d-7a937aa84806/150e99ee-ce3b-4b57-8c40-99b4ecdf3822.90)
> ==> (No such file or directory) [No such file or directory]
> [2016-08-12 10:33:41.916804] D [MSGID: 101171]
> [client_t.c:417:gf_client_unref] 0-client_t:
> hp.dctopenstack.org-25780-2016/08/12-10:33:07:589960-ad17hwssd7-client-0-0-0:
> ref-count 1
> [2016-08-12 10:33:41.917098] D [MSGID: 101171]
> [client_t.c:333:gf_client_ref] 0-client_t:
> hp.dctopenstack.org-25780-2016/08/12-10:33:07:589960-ad17hwssd7-client-0-0-0:
> ref-count 2
> [2016-08-12 10:33:41.917145] W [MSGID: 115009]
> [server-resolve.c:571:server_resolve] 0-ad17hwssd7-server: no
> resolution type for (null) (LOOKUP)
> [2016-08-12 10:33:41.917182] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-ad17hwssd7-server: 8128:
> LOOKUP (null) (00000000-0000-0000-0000-000000000000/150e99ee-ce3b-4b57-8c40-99b4ecdf3822.90)
> ==> (Invalid argument) [Invalid argument]
> 
> I am using gluster 3.7.10 and the configuration is as follow:
> 
> diagnostics.brick-log-level: DEBUG
> diagnostics.client-log-level: DEBUG
> performance.io-thread-count: 16
> client.event-threads: 2
> server.event-threads: 2
> features.shard-block-size: 16MB
> features.shard: on
> server.allow-insecure: on
> storage.owner-uid: 165
> storage.owner-gid: 165
> nfs.disable: true
> performance.quick-read: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.stat-prefetch: off
> cluster.lookup-optimize: on
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> transport.address-family: inet
> performance.readdir-ahead: on
> 
> This error only occur for sharding configuration. Do you guys perform
> this type of test before? Or do you think physically unplug the HDD is
> a valid test case?

If you use replica-3, things should settle down again. The kernel and
teh brick process needs a little time to find out that the filesystem on
the disk that you pulled out is not responding anymore. The output og
"gluster volume status" should show that the brick process is offline.
As long as you have quorum, things should continue after a small delay
while waiting to mark the brick offline.

People actually should test this scenario, it can be that power to disks
fail, or even (connections to) RAID-controllers. Hot-unplugging is
definitely a scenario that can emulate real-world problems.

Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160816/6b89e5b3/attachment.sig>