[Gluster-users] "layout is NULL", "Failed to get node-uuid for [...] and other errors during rebalancing in 3.3.1
Pierre-Francois Laquerre
pierre.francois at nec-labs.com
Fri Nov 30 16:47:33 UTC 2012
I started rebalancing my volume after updating from 3.2.7 to 3.3.1.
After a few hours, I noticed a large number of failures in the rebalance
status:
> Node Rebalanced-files size scanned failures
> status
> --------- ----------- ----------- ----------- -----------
> ------------
> localhost 0 0Bytes 4288805 0
> stopped
> ml55 26275 206.2MB 4277101 14159
> stopped
> ml29 0 0Bytes 4288844 0
> stopped
> ml31 0 0Bytes 4288937 0
> stopped
> ml48 0 0Bytes 4288927 0
> stopped
> ml45 15041 50.8MB 4284304 41999
> stopped
> ml40 40690 413.3MB 4269721 1012
> stopped
> ml41 0 0Bytes 4288898 0
> stopped
> ml51 28558 212.7MB 4277442 32195
> stopped
> ml46 0 0Bytes 4288909 0
> stopped
> ml44 0 0Bytes 4288824 0
> stopped
> ml52 0 0Bytes 4288849 0
> stopped
> ml30 14252 183.7MB 4270711 25336
> stopped
> ml53 31431 354.9MB 4280450 31098
> stopped
> ml43 13773 2.7GB 4285256 28574
> stopped
> ml47 37618 241.3MB 4266889 24916
> stopped
which prompted me to look at the rebalance log:
> [2012-11-30 11:06:12.533580] W
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote
> operation failed: File exists. Path:
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.533657] E [dht-common.c:1911:dht_getxattr]
> 0-bigdata-dht: layout is NULL
> [2012-11-30 11:06:12.533702] E
> [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to
> get node-uuid for
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7
> [2012-11-30 11:06:12.545497] W
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13: remote
> operation failed: File exists. Path:
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.546039] W
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote
> operation failed: File exists. Path:
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.546159] E [dht-common.c:1911:dht_getxattr]
> 0-bigdata-dht: layout is NULL
> [2012-11-30 11:06:12.546199] E
> [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to
> get node-uuid for
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
> [2012-11-30 11:06:12.617940] W
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote
> operation failed: File exists. Path:
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.618024] W
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13: remote
> operation failed: File exists. Path:
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.618150] E [dht-common.c:1911:dht_getxattr]
> 0-bigdata-dht: layout is NULL
> [2012-11-30 11:06:12.618189] E
> [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to
> get node-uuid for
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
> [2012-11-30 11:06:12.620798] I
> [dht-common.c:954:dht_lookup_everywhere_cbk] 0-bigdata-dht: deleting
> stale linkfile
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_282643649_15d4108d95.t7
> on bigdata-replicate-6
[...] (at this point, I stopped rebalancing, and got the following in
the logs)
> [2012-11-30 11:06:33.152153] E
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
> failed for /foo/data/onemil/dataset/bar/f8old/baz/85
> [2012-11-30 11:06:33.153628] E
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
> failed for /foo/data/onemil/dataset/bar/f8old/baz
> [2012-11-30 11:06:33.154641] E
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
> failed for /foo/data/onemil/dataset/bar/f8old
> [2012-11-30 11:06:33.155602] E
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
> failed for /foo/data/onemil/dataset/bar
> [2012-11-30 11:06:33.156552] E
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
> failed for /foo/data/onemil/dataset
> [2012-11-30 11:06:33.157538] E
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
> failed for /foo/data/onemil
> [2012-11-30 11:06:33.158526] E
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
> failed for /foo/data
> [2012-11-30 11:06:33.159459] E
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
> failed for /foo
> [2012-11-30 11:06:33.160496] I
> [dht-rebalance.c:1626:gf_defrag_status_get] 0-glusterfs: Rebalance is
> stopped
> [2012-11-30 11:06:33.160518] I
> [dht-rebalance.c:1629:gf_defrag_status_get] 0-glusterfs: Files
> migrated: 14252, size: 192620657, lookups: 4270711, failures: 25336
> [2012-11-30 11:06:33.173344] W [glusterfsd.c:831:cleanup_and_exit]
> (-->/lib64/libc.so.6(clone+0x6d) [0x3d676e811d]
> (-->/lib64/libpthread.so.0() [0x3d68207851]
> (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405d4d]))) 0-:
> received signum (15), shutting down
This kind of error kept appearing many times per second. I cancelled the
rebalancing operation just in case it was doing anything bad. There
doesn't seem to be anything weird in the system logs of the bricks that
hold the files mentioned in the errors. The files are still accessible
through my mounted volume.
Any idea what might be wrong?
Thanks,
Pierre
volume info:
> Volume Name: bigdata
> Type: Distributed-Replicate
> Volume ID: 56498956-7b4b-4ee3-9d2b-4c8cfce26051
> Status: Started
> Number of Bricks: 20 x 2 = 40
> Transport-type: tcp
> Bricks:
> Brick1: ml43:/mnt/localb
> Brick2: ml44:/mnt/localb
> Brick3: ml43:/mnt/localc
> Brick4: ml44:/mnt/localc
> Brick5: ml45:/mnt/localb
> Brick6: ml46:/mnt/localb
> Brick7: ml45:/mnt/localc
> Brick8: ml46:/mnt/localc
> Brick9: ml47:/mnt/localb
> Brick10: ml48:/mnt/localb
> Brick11: ml47:/mnt/localc
> Brick12: ml48:/mnt/localc
> Brick13: ml45:/mnt/locald
> Brick14: ml46:/mnt/locald
> Brick15: ml47:/mnt/locald
> Brick16: ml48:/mnt/locald
> Brick17: ml51:/mnt/localb
> Brick18: ml52:/mnt/localb
> Brick19: ml51:/mnt/localc
> Brick20: ml52:/mnt/localc
> Brick21: ml51:/mnt/locald
> Brick22: ml52:/mnt/locald
> Brick23: ml53:/mnt/locald
> Brick24: ml54:/mnt/locald
> Brick25: ml53:/mnt/localc
> Brick26: ml54:/mnt/localc
> Brick27: ml53:/mnt/localb
> Brick28: ml54:/mnt/localb
> Brick29: ml55:/mnt/localb
> Brick30: ml29:/mnt/localb
> Brick31: ml55:/mnt/localc
> Brick32: ml29:/mnt/localc
> Brick33: ml30:/mnt/localc
> Brick34: ml31:/mnt/localc
> Brick35: ml30:/mnt/localb
> Brick36: ml31:/mnt/localb
> Brick37: ml40:/mnt/localb
> Brick38: ml41:/mnt/localb
> Brick39: ml40:/mnt/localc
> Brick40: ml41:/mnt/localc
> Options Reconfigured:
> performance.quick-read: on
> nfs.disable: on
> nfs.register-with-portmap: OFF
volume status:
> Status of volume: bigdata
> Gluster process Port Online Pid
> ------------------------------------------------------------------------------
> Brick ml43:/mnt/localb 24012 Y 2694
> Brick ml44:/mnt/localb 24012 Y
> 20374
> Brick ml43:/mnt/localc 24013 Y 2699
> Brick ml44:/mnt/localc 24013 Y
> 20379
> Brick ml45:/mnt/localb 24012 Y 3147
> Brick ml46:/mnt/localb 24012 Y
> 25789
> Brick ml45:/mnt/localc 24013 Y 3152
> Brick ml46:/mnt/localc 24013 Y
> 25794
> Brick ml47:/mnt/localb 24012 Y 3181
> Brick ml48:/mnt/localb 24012 Y 4852
> Brick ml47:/mnt/localc 24013 Y 3186
> Brick ml48:/mnt/localc 24013 Y 4857
> Brick ml45:/mnt/locald 24014 Y 3157
> Brick ml46:/mnt/locald 24014 Y
> 25799
> Brick ml47:/mnt/locald 24014 Y 3191
> Brick ml48:/mnt/locald 24014 Y 4862
> Brick ml51:/mnt/localb 24009 Y
> 30251
> Brick ml52:/mnt/localb 24012 Y
> 28541
> Brick ml51:/mnt/localc 24010 Y
> 30256
> Brick ml52:/mnt/localc 24013 Y
> 28546
> Brick ml51:/mnt/locald 24011 Y
> 30261
> Brick ml52:/mnt/locald 24014 Y
> 28551
> Brick ml53:/mnt/locald 24012 Y 9229
> Brick ml54:/mnt/locald 24012 Y 9341
> Brick ml53:/mnt/localc 24013 Y 9234
> Brick ml54:/mnt/localc 24013 Y 9346
> Brick ml53:/mnt/localb 24014 Y 9239
> Brick ml54:/mnt/localb 24014 Y 9351
> Brick ml55:/mnt/localb 24012 Y
> 30904
> Brick ml29:/mnt/localb 24012 Y
> 29233
> Brick ml55:/mnt/localc 24013 Y
> 30909
> Brick ml29:/mnt/localc 24013 Y
> 29238
> Brick ml30:/mnt/localc 24012 Y 6800
> Brick ml31:/mnt/localc N/A Y 22000
> Brick ml30:/mnt/localb 24013 Y 6805
> Brick ml31:/mnt/localb N/A Y 22005
> Brick ml40:/mnt/localb 24012 Y
> 26700
> Brick ml41:/mnt/localb 24012 Y
> 25762
> Brick ml40:/mnt/localc 24013 Y
> 26705
> Brick ml41:/mnt/localc 24013 Y
> 25767
> Self-heal Daemon on localhost N/A Y 20392
> Self-heal Daemon on ml55 N/A Y 30922
> Self-heal Daemon on ml54 N/A Y 9365
> Self-heal Daemon on ml52 N/A Y 28565
> Self-heal Daemon on ml29 N/A Y 29253
> Self-heal Daemon on ml30 N/A Y 6818
> Self-heal Daemon on ml43 N/A Y 2712
> Self-heal Daemon on ml47 N/A Y 3205
> Self-heal Daemon on ml46 N/A Y 25813
> Self-heal Daemon on ml40 N/A Y 26717
> Self-heal Daemon on ml31 N/A Y 22038
> Self-heal Daemon on ml48 N/A Y 4876
> Self-heal Daemon on ml45 N/A Y 3171
> Self-heal Daemon on ml51 N/A Y 30274
> Self-heal Daemon on ml41 N/A Y 25779
> Self-heal Daemon on ml53 N/A Y 9253
peer status:
> Number of Peers: 15
>
> Hostname: ml52
> Uuid: 4de42f67-4cca-4d28-8600-9018172563ba
> State: Peer in Cluster (Connected)
>
> Hostname: ml41
> Uuid: b404851f-dfd5-4746-a3bd-81bb0d888009
> State: Peer in Cluster (Connected)
>
> Hostname: ml46
> Uuid: af74d39b-09d6-47ba-9c3b-72d993dca4ce
> State: Peer in Cluster (Connected)
>
> Hostname: ml54
> Uuid: c55580fa-2c9d-493d-b9d1-3bce016c8b29
> State: Peer in Cluster (Connected)
>
> Hostname: ml51
> Uuid: 5491b6dc-0f96-43d9-95d9-a41018a8542c
> State: Peer in Cluster (Connected)
>
> Hostname: ml48
> Uuid: efd79145-bfd9-4eea-b7a7-50be18d9ffe0
> State: Peer in Cluster (Connected)
>
> Hostname: ml43
> Uuid: a9044e9a-39e1-4907-8921-43da870b7f31
> State: Peer in Cluster (Connected)
>
> Hostname: ml45
> Uuid: 0eebbceb-8f62-4c55-8160-41348f90e191
> State: Peer in Cluster (Connected)
>
> Hostname: ml47
> Uuid: e831092d-b196-46ec-947d-a5635e8fbd1e
> State: Peer in Cluster (Connected)
>
> Hostname: ml30
> Uuid: e56b4c57-a058-4464-a1e6-c4676ebf00cc
> State: Peer in Cluster (Connected)
>
> Hostname: ml40
> Uuid: ffcc06ae-100a-4fa2-888e-803a41ae946c
> State: Peer in Cluster (Connected)
>
> Hostname: ml55
> Uuid: 366339ed-52e5-4722-a1b3-e3bb1c49ea4f
> State: Peer in Cluster (Connected)
>
> Hostname: ml31
> Uuid: 699019f6-2f4a-45cb-bfa4-f209745f8a6d
> State: Peer in Cluster (Connected)
>
> Hostname: ml29
> Uuid: 58aa8a16-5d2b-4c06-8f06-2fd0f7fc5a37
> State: Peer in Cluster (Connected)
>
> Hostname: ml53
> Uuid: 1dc6ee08-c606-4755-8756-b553f66efa88
> State: Peer in Cluster (Connected)
gluster version:
> glusterfs 3.3.1 built on Oct 11 2012 21:49:37
rpms:
> glusterfs.x86_64 3.3.1-1.el6 @glusterfs-epel
> glusterfs-debuginfo.x86_64 3.3.1-1.el6 @glusterfs-epel
> glusterfs-fuse.x86_64 3.3.1-1.el6 @glusterfs-epel
> glusterfs-rdma.x86_64 3.3.1-1.el6 @glusterfs-epel
> glusterfs-server.x86_64 3.3.1-1.el6 @glusterfs-epel
kernel:
> Linux 2.6.32-131.17.1.el6.x86_64 #1 SMP Wed Oct 5 17:19:54 CDT 2011
> x86_64 x86_64 x86_64 GNU/Linux
OS: Scientific Linux 6.1 (this is based on CentOS)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121130/a09f841b/attachment.html>
More information about the Gluster-users
mailing list