[Gluster-users] "layout is NULL", "Failed to get node-uuid for [...] and other errors during rebalancing in 3.3.1

Fri Nov 30 16:47:33 UTC 2012

I started rebalancing my volume after updating from 3.2.7 to 3.3.1. 
After a few hours, I noticed a large number of failures in the rebalance 
status:

>       Node Rebalanced-files size       scanned      failures         
> status
>  ---------      -----------   -----------   ----------- -----------   
> ------------
>  localhost                0        0Bytes 4288805             0        
> stopped
>       ml55            26275       206.2MB       4277101 14159        
> stopped
>       ml29                0        0Bytes 4288844             0        
> stopped
>       ml31                0        0Bytes 4288937             0        
> stopped
>       ml48                0        0Bytes 4288927             0        
> stopped
>       ml45            15041        50.8MB       4284304 41999        
> stopped
>       ml40            40690       413.3MB       4269721 1012        
> stopped
>       ml41                0        0Bytes 4288898             0        
> stopped
>       ml51            28558       212.7MB       4277442 32195        
> stopped
>       ml46                0        0Bytes 4288909             0        
> stopped
>       ml44                0        0Bytes 4288824             0        
> stopped
>       ml52                0        0Bytes 4288849             0        
> stopped
>       ml30            14252       183.7MB       4270711 25336        
> stopped
>       ml53            31431       354.9MB       4280450 31098        
> stopped
>       ml43            13773         2.7GB       4285256 28574        
> stopped
>       ml47            37618       241.3MB       4266889 24916        
> stopped

which prompted me to look at the rebalance log:

> [2012-11-30 11:06:12.533580] W 
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote 
> operation failed: File exists. Path: 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7 
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.533657] E [dht-common.c:1911:dht_getxattr] 
> 0-bigdata-dht: layout is NULL
> [2012-11-30 11:06:12.533702] E 
> [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to 
> get node-uuid for 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7
> [2012-11-30 11:06:12.545497] W 
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13: remote 
> operation failed: File exists. Path: 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7 
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.546039] W 
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote 
> operation failed: File exists. Path: 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7 
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.546159] E [dht-common.c:1911:dht_getxattr] 
> 0-bigdata-dht: layout is NULL
> [2012-11-30 11:06:12.546199] E 
> [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to 
> get node-uuid for 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
> [2012-11-30 11:06:12.617940] W 
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote 
> operation failed: File exists. Path: 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7 
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.618024] W 
> [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13: remote 
> operation failed: File exists. Path: 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7 
> (00000000-0000-0000-0000-000000000000)
> [2012-11-30 11:06:12.618150] E [dht-common.c:1911:dht_getxattr] 
> 0-bigdata-dht: layout is NULL
> [2012-11-30 11:06:12.618189] E 
> [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to 
> get node-uuid for 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
> [2012-11-30 11:06:12.620798] I 
> [dht-common.c:954:dht_lookup_everywhere_cbk] 0-bigdata-dht: deleting 
> stale linkfile 
> /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_282643649_15d4108d95.t7 
> on bigdata-replicate-6

[...] (at this point, I stopped rebalancing, and got the following in 
the logs)
> [2012-11-30 11:06:33.152153] E 
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout 
> failed for /foo/data/onemil/dataset/bar/f8old/baz/85
> [2012-11-30 11:06:33.153628] E 
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout 
> failed for /foo/data/onemil/dataset/bar/f8old/baz
> [2012-11-30 11:06:33.154641] E 
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout 
> failed for /foo/data/onemil/dataset/bar/f8old
> [2012-11-30 11:06:33.155602] E 
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout 
> failed for /foo/data/onemil/dataset/bar
> [2012-11-30 11:06:33.156552] E 
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout 
> failed for /foo/data/onemil/dataset
> [2012-11-30 11:06:33.157538] E 
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout 
> failed for /foo/data/onemil
> [2012-11-30 11:06:33.158526] E 
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout 
> failed for /foo/data
> [2012-11-30 11:06:33.159459] E 
> [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout 
> failed for /foo
> [2012-11-30 11:06:33.160496] I 
> [dht-rebalance.c:1626:gf_defrag_status_get] 0-glusterfs: Rebalance is 
> stopped
> [2012-11-30 11:06:33.160518] I 
> [dht-rebalance.c:1629:gf_defrag_status_get] 0-glusterfs: Files 
> migrated: 14252, size: 192620657, lookups: 4270711, failures: 25336
> [2012-11-30 11:06:33.173344] W [glusterfsd.c:831:cleanup_and_exit] 
> (-->/lib64/libc.so.6(clone+0x6d) [0x3d676e811d] 
> (-->/lib64/libpthread.so.0() [0x3d68207851] 
> (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405d4d]))) 0-: 
> received signum (15), shutting down

This kind of error kept appearing many times per second. I cancelled the 
rebalancing operation just in case it was doing anything bad. There 
doesn't seem to be anything weird in the system logs of the bricks that 
hold the files mentioned in the errors. The files are still accessible 
through my mounted volume.

Any idea what might be wrong?

Thanks,

Pierre

volume info:
> Volume Name: bigdata
> Type: Distributed-Replicate
> Volume ID: 56498956-7b4b-4ee3-9d2b-4c8cfce26051
> Status: Started
> Number of Bricks: 20 x 2 = 40
> Transport-type: tcp
> Bricks:
> Brick1: ml43:/mnt/localb
> Brick2: ml44:/mnt/localb
> Brick3: ml43:/mnt/localc
> Brick4: ml44:/mnt/localc
> Brick5: ml45:/mnt/localb
> Brick6: ml46:/mnt/localb
> Brick7: ml45:/mnt/localc
> Brick8: ml46:/mnt/localc
> Brick9: ml47:/mnt/localb
> Brick10: ml48:/mnt/localb
> Brick11: ml47:/mnt/localc
> Brick12: ml48:/mnt/localc
> Brick13: ml45:/mnt/locald
> Brick14: ml46:/mnt/locald
> Brick15: ml47:/mnt/locald
> Brick16: ml48:/mnt/locald
> Brick17: ml51:/mnt/localb
> Brick18: ml52:/mnt/localb
> Brick19: ml51:/mnt/localc
> Brick20: ml52:/mnt/localc
> Brick21: ml51:/mnt/locald
> Brick22: ml52:/mnt/locald
> Brick23: ml53:/mnt/locald
> Brick24: ml54:/mnt/locald
> Brick25: ml53:/mnt/localc
> Brick26: ml54:/mnt/localc
> Brick27: ml53:/mnt/localb
> Brick28: ml54:/mnt/localb
> Brick29: ml55:/mnt/localb
> Brick30: ml29:/mnt/localb
> Brick31: ml55:/mnt/localc
> Brick32: ml29:/mnt/localc
> Brick33: ml30:/mnt/localc
> Brick34: ml31:/mnt/localc
> Brick35: ml30:/mnt/localb
> Brick36: ml31:/mnt/localb
> Brick37: ml40:/mnt/localb
> Brick38: ml41:/mnt/localb
> Brick39: ml40:/mnt/localc
> Brick40: ml41:/mnt/localc
> Options Reconfigured:
> performance.quick-read: on
> nfs.disable: on
> nfs.register-with-portmap: OFF

volume status:
> Status of volume: bigdata
> Gluster process                                         Port Online  Pid
> ------------------------------------------------------------------------------
> Brick ml43:/mnt/localb                                  24012 Y       2694
> Brick ml44:/mnt/localb                                  24012 Y       
> 20374
> Brick ml43:/mnt/localc                                  24013 Y       2699
> Brick ml44:/mnt/localc                                  24013 Y       
> 20379
> Brick ml45:/mnt/localb                                  24012 Y       3147
> Brick ml46:/mnt/localb                                  24012 Y       
> 25789
> Brick ml45:/mnt/localc                                  24013 Y       3152
> Brick ml46:/mnt/localc                                  24013 Y       
> 25794
> Brick ml47:/mnt/localb                                  24012 Y       3181
> Brick ml48:/mnt/localb                                  24012 Y       4852
> Brick ml47:/mnt/localc                                  24013 Y       3186
> Brick ml48:/mnt/localc                                  24013 Y       4857
> Brick ml45:/mnt/locald                                  24014 Y       3157
> Brick ml46:/mnt/locald                                  24014 Y       
> 25799
> Brick ml47:/mnt/locald                                  24014 Y       3191
> Brick ml48:/mnt/locald                                  24014 Y       4862
> Brick ml51:/mnt/localb                                  24009 Y       
> 30251
> Brick ml52:/mnt/localb                                  24012 Y       
> 28541
> Brick ml51:/mnt/localc                                  24010 Y       
> 30256
> Brick ml52:/mnt/localc                                  24013 Y       
> 28546
> Brick ml51:/mnt/locald                                  24011 Y       
> 30261
> Brick ml52:/mnt/locald                                  24014 Y       
> 28551
> Brick ml53:/mnt/locald                                  24012 Y       9229
> Brick ml54:/mnt/locald                                  24012 Y       9341
> Brick ml53:/mnt/localc                                  24013 Y       9234
> Brick ml54:/mnt/localc                                  24013 Y       9346
> Brick ml53:/mnt/localb                                  24014 Y       9239
> Brick ml54:/mnt/localb                                  24014 Y       9351
> Brick ml55:/mnt/localb                                  24012 Y       
> 30904
> Brick ml29:/mnt/localb                                  24012 Y       
> 29233
> Brick ml55:/mnt/localc                                  24013 Y       
> 30909
> Brick ml29:/mnt/localc                                  24013 Y       
> 29238
> Brick ml30:/mnt/localc                                  24012 Y       6800
> Brick ml31:/mnt/localc                                  N/A Y       22000
> Brick ml30:/mnt/localb                                  24013 Y       6805
> Brick ml31:/mnt/localb                                  N/A Y       22005
> Brick ml40:/mnt/localb                                  24012 Y       
> 26700
> Brick ml41:/mnt/localb                                  24012 Y       
> 25762
> Brick ml40:/mnt/localc                                  24013 Y       
> 26705
> Brick ml41:/mnt/localc                                  24013 Y       
> 25767
> Self-heal Daemon on localhost                           N/A Y       20392
> Self-heal Daemon on ml55                                N/A Y       30922
> Self-heal Daemon on ml54                                N/A Y       9365
> Self-heal Daemon on ml52                                N/A Y       28565
> Self-heal Daemon on ml29                                N/A Y       29253
> Self-heal Daemon on ml30                                N/A Y       6818
> Self-heal Daemon on ml43                                N/A Y       2712
> Self-heal Daemon on ml47                                N/A Y       3205
> Self-heal Daemon on ml46                                N/A Y       25813
> Self-heal Daemon on ml40                                N/A Y       26717
> Self-heal Daemon on ml31                                N/A Y       22038
> Self-heal Daemon on ml48                                N/A Y       4876
> Self-heal Daemon on ml45                                N/A Y       3171
> Self-heal Daemon on ml51                                N/A Y       30274
> Self-heal Daemon on ml41                                N/A Y       25779
> Self-heal Daemon on ml53                                N/A Y       9253

peer status:
> Number of Peers: 15
>
> Hostname: ml52
> Uuid: 4de42f67-4cca-4d28-8600-9018172563ba
> State: Peer in Cluster (Connected)
>
> Hostname: ml41
> Uuid: b404851f-dfd5-4746-a3bd-81bb0d888009
> State: Peer in Cluster (Connected)
>
> Hostname: ml46
> Uuid: af74d39b-09d6-47ba-9c3b-72d993dca4ce
> State: Peer in Cluster (Connected)
>
> Hostname: ml54
> Uuid: c55580fa-2c9d-493d-b9d1-3bce016c8b29
> State: Peer in Cluster (Connected)
>
> Hostname: ml51
> Uuid: 5491b6dc-0f96-43d9-95d9-a41018a8542c
> State: Peer in Cluster (Connected)
>
> Hostname: ml48
> Uuid: efd79145-bfd9-4eea-b7a7-50be18d9ffe0
> State: Peer in Cluster (Connected)
>
> Hostname: ml43
> Uuid: a9044e9a-39e1-4907-8921-43da870b7f31
> State: Peer in Cluster (Connected)
>
> Hostname: ml45
> Uuid: 0eebbceb-8f62-4c55-8160-41348f90e191
> State: Peer in Cluster (Connected)
>
> Hostname: ml47
> Uuid: e831092d-b196-46ec-947d-a5635e8fbd1e
> State: Peer in Cluster (Connected)
>
> Hostname: ml30
> Uuid: e56b4c57-a058-4464-a1e6-c4676ebf00cc
> State: Peer in Cluster (Connected)
>
> Hostname: ml40
> Uuid: ffcc06ae-100a-4fa2-888e-803a41ae946c
> State: Peer in Cluster (Connected)
>
> Hostname: ml55
> Uuid: 366339ed-52e5-4722-a1b3-e3bb1c49ea4f
> State: Peer in Cluster (Connected)
>
> Hostname: ml31
> Uuid: 699019f6-2f4a-45cb-bfa4-f209745f8a6d
> State: Peer in Cluster (Connected)
>
> Hostname: ml29
> Uuid: 58aa8a16-5d2b-4c06-8f06-2fd0f7fc5a37
> State: Peer in Cluster (Connected)
>
> Hostname: ml53
> Uuid: 1dc6ee08-c606-4755-8756-b553f66efa88
> State: Peer in Cluster (Connected)

gluster version:
> glusterfs 3.3.1 built on Oct 11 2012 21:49:37

rpms:
> glusterfs.x86_64 3.3.1-1.el6                @glusterfs-epel
> glusterfs-debuginfo.x86_64            3.3.1-1.el6 @glusterfs-epel
> glusterfs-fuse.x86_64                 3.3.1-1.el6 @glusterfs-epel
> glusterfs-rdma.x86_64                 3.3.1-1.el6 @glusterfs-epel
> glusterfs-server.x86_64               3.3.1-1.el6 @glusterfs-epel

kernel:
> Linux 2.6.32-131.17.1.el6.x86_64 #1 SMP Wed Oct 5 17:19:54 CDT 2011 
> x86_64 x86_64 x86_64 GNU/Linux

OS: Scientific Linux 6.1 (this is based on CentOS)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121130/a09f841b/attachment.html>