[Gluster-users] Problem listing files on geo-replicated volume after upgrade to 3.4.6

Fri Apr 17 16:53:02 UTC 2015

Hello all,

we have a problem on a geo-replicated volume after upgrade from 
glusterfs 3.3.2 to 3.4.6 on ubuntu 12.04.5 lts.
for e.g. a 'ls -l' on the mounted geo-replicated volume does not show 
the entire content while the same command on the underlying bricks shows 
the entire content.

the events in chronological order..:

we are running a 6 node distributed replicated Volume (vol1) which is 
geo-replicated to a 4 node distributed replicated Volume (vol2).
disk space on vol2 becomes insufficient so we needed to add two further 
nodes.
vol1 and vol2 is running on ubuntu 12.04 lts / glusterfs 3.3.2
we stopped the geo-replication, stopped the vol2 and updated the nodes 
of vol2 to the latest ubuntu 12.04.5 release (dist-upgrade) and to 
glusterfs 3.4.6. all gluster-clients which make use of vol2 were also 
updated from glusterfs-client 3.3.2 to 3.4.6.
then we added two further bricks to vol2 with the same software level 
(ubuntu 12.04.5 lts,gfs 3.4.6) like the first four nodes and started the 
volume vol2 again.
afterwards we started a rebalance process on vol2 and the 
geo-replication on the master-node of vol1. a check-script on 
geo-replication master is copying/deleting a testfile to vol1 in 
dependence of the existence of that file on vol2. everything seems to be 
ok so far...

after the rebalance process was finished (without errors) we observed an 
abnormality on vol2...the data on vol2 is somehow unequal 
distributed...the first two pairs shows a brick-usage of about 80% while 
the last added pair shows a brick-usage of about 50%. so we restarted 
the rebalance process twice but nothing changed...
however, more critical than that is the fact that since update and 
expansion of vol2 we cannot see/access all files by default on the 
mounted vol2 while the files are visible in their brick-directories...

example 1:

vol1 contains 446 files/directories, for e.g. directory /sdn/1051
vol1 is mounted to /sdn :
[ 15:54:28 ] - root at vol1  /sdn $ls -l | wc -l
446
[ 15:55:06 ] - root at vol1  /sdn $ls -l | grep 1051
drwxrwxrwx  5  1007  1013   12288 Jan 22 07:42 1051
[ 15:55:46 ] - root at vol1  /sdn $du -ks 1051
5588129    1051
[ 15:56:03 ] - root at vol1  /sdn $

vol2 contains 304 files/directoris, but 1051 is not listed. when i run a 
'du -ks /sdn/1051' or a 'ls -l /sdn/1051' on vol2 the directory becomes 
visible...
vol2 is mounted to /sdn :
[ 15:54:35 ] - root at vol2  /sdn $ls | wc -l
304
[ 15:56:19 ] - root at vol2  /sdn $ls -l | grep 1051
[ 15:56:28 ] - root at vol2  /sdn $du -ks 1051
5588001    1051
[ 15:56:43 ] - root at vol2  /sdn $ls -l | grep 1051
drwxrwxrwx  5  1007  1013   8255 Apr 17 15:56 1051
[ 15:56:59 ] - root at vol2  /sdn $ls | wc -l
305

example 2:
directory 2098 is visible on the brick but not on the gluster-volume.
after listing the named-directory it is visible on the gluster-volume again.
[ 16:11:00 ] - root at vol2  /sdn $ls | grep 2098
[ 16:12:21 ] - root at vol2  /sdn $ls -l /gluster-export/ | grep 2098
drwxrwxrwx  4  1015  1013   4096 Jan 18 03:07 2098

[ 16:12:28 ] - root at vol2  /sdn $ls -l /sdn/2098
...
[ 16:13:12 ] - root at vol2  /sdn $ls -l | grep 2098
drwxrwxrwx  4  1015  1013   8237 Apr 17 16:13 2098
[ 16:13:27 ] - root at vol2  /sdn $
[ 16:13:27 ] - root at vol2  /sdn $ls | wc -l
306

i did not found helpful hints in the gluster-logs, currently i'm 
frequently faced with following messages, but the missing directories on 
vol2 are not mentioned :

vol2 :
  $tail -f sdn.log
[2015-04-17 14:00:14.816730] I 
[dht-layout.c:726:dht_layout_dir_mismatch] 1-aut-wien-01-dht: /1011 - 
disk layout missing
[2015-04-17 14:00:14.816745] I [dht-common.c:638:dht_revalidate_cbk] 
1-aut-wien-01-dht: mismatching layouts for /1011
[2015-04-17 14:00:14.817590] I 
[dht-layout.c:726:dht_layout_dir_mismatch] 1-aut-wien-01-dht: /1005 - 
disk layout missing
[2015-04-17 14:00:14.817602] I [dht-common.c:638:dht_revalidate_cbk] 
1-aut-wien-01-dht: mismatching layouts for /1005

vol 1 is slightly smaller than vol2. all nodes are using the same 
disk-configuration and all bricks are xfs-formatted.
df -m :
vol1:/vol1  57217563 39230421  17987143   69% /sdn
vol2:/vol2  57217563 40399541  16818023  71% /sdn

currently i'm confused because i don't know the reason for this behaviour...
i guess it was not a good idea to update the geo-replication slave to 
3.4.6 while the master is still running 3.3.2, but I'm not sure.
possibly there is an issue with 3.4.6 itself and geo-replication does 
not have any influence on that.
for the first time i stopped the geo-replication.
can somebody point me to the cause or has helpful hints what to do next...?

best regards
dietmar