[Bugs] [Bug 1447392] New: [Brick MUX] : Rebalance fails.

bugzilla at redhat.com bugzilla at redhat.com
Tue May 2 15:24:33 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1447392

            Bug ID: 1447392
           Summary: [Brick MUX] : Rebalance fails.
           Product: GlusterFS
           Version: mainline
         Component: core
          Severity: low
          Priority: low
          Assignee: bugs at gluster.org
          Reporter: jthottan at redhat.com
                CC: amukherj at redhat.com, anoopcs at redhat.com,
                    asoman at redhat.com, bturner at redhat.com,
                    bugs at gluster.org, nbalacha at redhat.com,
                    nchilaka at redhat.com, rhinduja at redhat.com,
                    rhs-bugs at redhat.com, skoduri at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1446107



+++ This bug was initially created as a clone of Bug #1446107 +++

Description of problem:
------------------------

Created an EC volume.Enabled Brick multiplexing.Added bricks.Trigeered
rebalance.

Rebalance failed.


# gluster v rebalance testvol status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
      gqas014.sbu.lab.eng.bos.redhat.com                0        0Bytes        
    0             0             0            completed        0:00:00
volume rebalance: testvol: success
[root at gqas009 glusterfs]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes        
    0             0             0            completed        0:00:00
      gqas014.sbu.lab.eng.bos.redhat.com                0        0Bytes        
    0             1             0               failed        0:00:02
      gqas015.sbu.lab.eng.bos.redhat.com                0        0Bytes        
    0             1             0               failed        0:00:02
volume rebalance: butcher: success
[root at gqas009 glusterfs]# 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

mainline

How reproducible:
-----------------

2/2


Actual results:
--------------

Rebal fails.

Expected results:
-----------------

Rebal should not fail.

Additional info:
----------------

# gluster v info

Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 98d7434c-0466-4ff3-879b-3ee8c211c7b2
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: gqas009.sbu.lab.eng.bos.redhat.com:/bricks2/e1
Brick2: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/e1
Brick3: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/e1
Brick4: gqas009.sbu.lab.eng.bos.redhat.com:/bricks1/e1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/e1
Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/e1
Brick7: gqas009.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick8: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick9: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick10: gqas009.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Options Reconfigured:
cluster.lookup-optimize: on
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: enable

Volume Name: testvol
Type: Distribute
Volume ID: 2b12b3e7-a167-4538-b55b-9a4e181c622e
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A
Brick2: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/a
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: enable
[root at gqas009 glusterfs]# 


# gluster v status
Status of volume: butcher
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks2/e1                                   49153     0          Y       23917
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks2/e1                                   49153     0          Y       23218
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks2/e1                                   49153     0          Y       23687
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks1/e1                                   49153     0          Y       23917
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks1/e1                                   49153     0          Y       23218
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks1/e1                                   49153     0          Y       23687
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks6/A1                                   49153     0          Y       23917
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks6/A1                                   49153     0          Y       23218
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks6/A1                                   49153     0          Y       23687
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks8/A1                                   49153     0          Y       23917
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks8/A1                                   49153     0          Y       23218
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks8/A1                                   49153     0          Y       23687
Self-heal Daemon on localhost               N/A       N/A        Y       24098
Self-heal Daemon on gqas011.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       14859
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       23367
Self-heal Daemon on gqas015.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       23828

Task Status of Volume butcher
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 1314fe4c-0005-476a-b88c-4b52f93ffa62
Status               : failed              

Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks11/A                                   49153     0          Y       23218
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks5/a                                    49153     0          Y       23218

Task Status of Volume testvol
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 2745a8f1-336b-4bf0-baec-efe661a22dde
Status               : completed           

[

--- Additional comment from Ambarish on 2017-04-27 05:47:09 EDT ---



--- Additional comment from Nithya Balachandran on 2017-04-28 04:34:05 EDT ---

Why is lookup-optimize enabled here?

--- Additional comment from Ambarish on 2017-04-28 04:48:43 EDT ---

(In reply to Nithya Balachandran from comment #4)
> Why is lookup-optimize enabled here?

I was planning to run some perf tests - small files,specifically.

We generally enable lookup optimize and bump up the epoll threads for smallfile
stuff,like we mention in our admin guide as  perf enhancements.

--- Additional comment from Nithya Balachandran on 2017-04-28 05:19:57 EDT ---

>From the logs on 14,


[2017-04-27 06:27:17.434104] I [dht-rebalance.c:2739:gf_defrag_process_dir]
0-butcher-dht: migrate data called on /
[2017-04-27 06:27:19.305121] I [MSGID: 109063]
[dht-layout.c:713:dht_layout_normalize] 0-butcher-dht: Found anomalies in
/.trashcan (gfid = 00000000-0000-0000-0000-000000000005). Holes=1 overlaps=0
[2017-04-27 06:27:19.306255] W [MSGID: 109065]
[dht-selfheal.c:1410:dht_selfheal_dir_mkdir_lock_cbk] 0-butcher-dht: acquiring
inodelk failed for /.trashcan [Input/output error]
[2017-04-27 06:27:19.306350] E [MSGID: 109118]
[dht-rebalance.c:3365:gf_defrag_fix_layout] 0-butcher-dht: lookup failed
for:/.trashcan [Input/output error]

--- Additional comment from Nithya Balachandran on 2017-04-28 06:01:33 EDT ---

Looks like .trashcan is missing on some bricks:

e-1 returned -1 error: No such file or directory [No such file or directory]
[2017-04-28 09:38:42.138207] D [MSGID: 0] [dht-common.c:841:dht_lookup_dir_cbk]
0-butcher-dht: lookup of /.trashcan on butcher-disperse-1 returned error [No
such file or directory]
[2017-04-28 09:38:42.138237] I [MSGID: 109063]
[dht-layout.c:713:dht_layout_normalize] 0-butcher-dht: Found anomalies in
/.trashcan (gfid = 00000000-0000-0000-0000-000000000005). Holes=1 overlaps=0
[2017-04-28 09:38:42.138251] D [MSGID: 0] [dht-common.c:894:dht_lookup_dir_cbk]
0-butcher-dht: fixing assignment on /.trashcan
[2017-04-28 09:38:42.138277] D [MSGID: 0]
[dht-selfheal.c:1879:dht_selfheal_layout_new_directory] 0-butcher-dht: chunk
size = 0xffffffff / 7164468 = 599.481677
[2017-04-28 09:38:42.138293] D [MSGID: 0]
[dht-selfheal.c:1920:dht_selfheal_layout_new_directory] 0-butcher-dht:
assigning range size 0x7fffffff to butcher-disperse-0
[2017-04-28 09:38:42.138305] D [MSGID: 0]
[dht-selfheal.c:1920:dht_selfheal_layout_new_directory] 0-butcher-dht:
assigning range size 0x7fffffff to butcher-disperse-1
[2017-04-28 09:38:42.138785] D [MSGID: 114031]
[client-rpc-fops.c:1550:client3_3_inodelk_cbk] 0-butcher-client-4: remote
operation failed [Stale file handle]
[2017-04-28 09:38:42.138810] D [MSGID: 114031]
[client-rpc-fops.c:1550:client3_3_inodelk_cbk] 0-butcher-client-3: remote
operation failed [Stale file handle]
[2017-04-28 09:38:42.138812] D [MSGID: 0]
[client-rpc-fops.c:1553:client3_3_inodelk_cbk] 0-stack-trace: stack-address:
0x7fe58c080cb0, butcher-client-4 returned -1 error: Stale file handle [Stale
file handle]
[2017-04-28 09:38:42.138844] D [MSGID: 0]
[client-rpc-fops.c:1553:client3_3_inodelk_cbk] 0-stack-trace: stack-address:
0x7fe58c080cb0, butcher-client-3 returned -1 error: Stale file handle [Stale
file handle]
[2017-04-28 09:38:42.138866] D [MSGID: 0] [ec-combine.c:852:ec_combine_check]
0-butcher-disperse-0: Mismatching return code in answers of 'INODELK': -1 <-> 0
[2017-04-28 09:38:42.138904] D [logging.c:1953:_gf_msg_internal]
0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to
flush least recently used log message to disk
[2017-04-28 09:38:42.138899] D [MSGID: 0] [ec-combine.c:852:ec_combine_check]
0-butcher-disperse-0: Mismatching return code in answers of 'INODELK': -1 <-> 0
[2017-04-28 09:38:42.138903] D [MSGID: 114031]
[client-rpc-fops.c:1550:client3_3_inodelk_cbk] 0-butcher-client-5: remote
operation failed [Stale file handle]




Hitting a similar issue when trying to access the .trashcan dir from the mount
point.



[root at gqas014 ~]# mount -t glusterfs
gqas014.sbu.lab.eng.bos.redhat.com:/butcher /mnt/tests/
[root at gqas014 ~]# cd /mnt/test
-bash: cd: /mnt/test: No such file or directory
[root at gqas014 ~]# cd /mnt/tests/
[root at gqas014 tests]# l
-bash: l: command not found
[root at gqas014 tests]# ll
total 0
[root at gqas014 tests]# ll -a
ls: cannot access .trashcan: Input/output error
total 4
drwxr-xr-x 3 root root 4096 Apr 27 02:26 .
drwxr-xr-x 5 root root   55 Apr 26 09:07 ..
d????????? ? ?    ?       ?            ? .trashcan


I will be looking into this further.

--- Additional comment from Nithya Balachandran on 2017-04-28 06:09:55 EDT ---

Please leave the setup as is.

--- Additional comment from Nithya Balachandran on 2017-04-28 06:52:41 EDT ---

Looks like only one brick on each node has the .trashcan directory.

drwxr-xr-x 2 root root  6 Apr 27 02:25 internal_op
[root at gqas014 ~]# ll -a /bricks2/e1
total 0
drwxr-xr-x 4 root root  41 Apr 27 02:26 .
drwxr-xr-x 5 root root  36 Apr 27 02:24 ..
drw------- 8 root root 165 Apr 27 02:25 .glusterfs
drwxr-xr-x 3 root root  25 Apr 27 02:25 .trashcan
[root at gqas014 ~]# ll -a /bricks1/e1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 5 root root  36 Apr 27 02:24 ..
drw------- 8 root root 165 Apr 27 02:25 .glusterfs
[root at gqas014 ~]# ll -a /bricks6/A1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 3 root root  16 Apr 27 02:26 ..
drw------- 8 root root 165 Apr 27 02:27 .glusterfs
[root at gqas014 ~]# ll -a /bricks8/A1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 3 root root  16 Apr 27 02:26 ..
drw------- 8 root root 165 Apr 27 02:27 .glusterfs
[root at gqas014 ~]# 

****************************************************************

[root at gqas015 ~]# ll -a /bricks2/e1
total 0
drwxr-xr-x 4 root root  41 Apr 27 02:26 .
drwxr-xr-x 5 root root  36 Apr 27 02:24 ..
drw------- 8 root root 165 Apr 27 02:25 .glusterfs
drwxr-xr-x 3 root root  25 Apr 27 02:24 .trashcan
[root at gqas015 ~]# ll -a /bricks1/e1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 5 root root  36 Apr 27 02:24 ..
drw------- 8 root root 165 Apr 27 02:25 .glusterfs
[root at gqas015 ~]# ll -a /bricks6/A1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 3 root root  16 Apr 27 02:26 ..
drw------- 8 root root 165 Apr 27 02:27 .glusterfs
[root at gqas015 ~]# ll -a /bricks8/A1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 3 root root  16 Apr 27 02:26 ..
drw------- 8 root root 165 Apr 27 02:27 .glusterfs
[root at gqas015 ~]# 

****************************************************************

[root at gqas009 ~]# ll -a /bricks2/e1
total 0
drwxr-xr-x 4 root root  41 Apr 27 02:26 .
drwxr-xr-x 5 root root  36 Apr 27 02:24 ..
drw------- 8 root root 165 Apr 27 02:25 .glusterfs
drwxr-xr-x 3 root root  25 Apr 27 02:24 .trashcan
[root at gqas009 ~]# ll -a /bricks1/e1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 5 root root  36 Apr 27 02:24 ..
drw------- 8 root root 165 Apr 27 02:25 .glusterfs
[root at gqas009 ~]# ll -a /bricks6/A1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 3 root root  16 Apr 27 02:26 ..
drw------- 8 root root 165 Apr 27 02:27 .glusterfs
[root at gqas009 ~]# ll -a /bricks8/A1
total 0
drwxr-xr-x 3 root root  24 Apr 27 02:26 .
drwxr-xr-x 3 root root  16 Apr 27 02:26 ..
drw------- 8 root root 165 Apr 27 02:27 .glusterfs



I'm guessing that as the dir does not exist on enough bricks of the ec set, any
attempts by DHT to take locks on this fails.


The underlying problem here is the missing .trashcan directory. This sounds
very similar to BZ 1443941. 

Requesting the trash xlator folks to take a look at this.


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1446107
[Bug 1446107] [Brick MUX] : Rebalance fails.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list