[Bugs] [Bug 1469041] New: Rebalance hangs on remove-brick if the target volume changes

bugzilla at redhat.com bugzilla at redhat.com
Mon Jul 10 10:25:30 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1469041

            Bug ID: 1469041
           Summary: Rebalance hangs on remove-brick if the target volume
                    changes
           Product: Red Hat Gluster Storage
           Version: 3.3
         Component: distribute
          Assignee: nbalacha at redhat.com
          Reporter: nbalacha at redhat.com
        QA Contact: tdesala at redhat.com
                CC: bugs at gluster.org, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1469029



+++ This bug was initially created as a clone of Bug #1469029 +++

Description of problem:
The rebalance process hangs on a remove-brick operation if the original hashed
subvol fails the min-free-disk check and a different target is selected.

Version-Release number of selected component (if applicable):


How reproducible:
Consistently

Steps to Reproduce:
1. Create a 1x3 distribute volume with 1 GB bricks
2. Create enough 5MB files on the volume such that no 2 bricks can accommodate
all the files.
3. Run a remove-brick to remove one of the bricks

Actual results:

The rebalance hangs.

Expected results:
The rebalance process should terminate once all files are processed.

Additional info:

--- Additional comment from Nithya Balachandran on 2017-07-10 06:13:12 EDT ---

RCA:


>From gdb:


Thread 13 (Thread 0x7f7740823700 (LWP 27258)):
#0  0x00007f7748006bdd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f77491bd8cb in gf_timer_proc (data=0xd0af80) at timer.c:176
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f7740022700 (LWP 27259)):
#0  0x00007f7748007101 in sigwait () from /lib64/libpthread.so.0
#1  0x0000000000409e72 in glusterfs_sigwaiter (arg=0x7fff5bfe2da0) at
glusterfsd.c:2069
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f773f821700 (LWP 27260)):
#0  0x00007f774790b66d in nanosleep () from /lib64/libc.so.6
#1  0x00007f774790b504 in sleep () from /lib64/libc.so.6
#2  0x00007f77491dff7b in pool_sweeper (arg=0x0) at mem-pool.c:464
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f773f020700 (LWP 27261)):
#0  0x00007f7748003a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f77491f54ab in syncenv_task (proc=0xd0b7d0) at syncop.c:603
#2  0x00007f77491f5746 in syncenv_processor (thdata=0xd0b7d0) at syncop.c:695
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f773e81f700 (LWP 27262)):
#0  0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f773b66410f in gf_defrag_start_crawl (data=0x7f773400dfc0) at
dht-rebalance.c:4479
#2  0x00007f77491f4c7a in synctask_wrap (old_task=0x7f7724001400) at
syncop.c:375
#3  0x00007f7747893cf0 in ?? () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f773c357700 (LWP 27263)):
#0  0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f774921b1ef in event_dispatch_epoll_worker (data=0xd49290) at
event-epoll.c:638
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f773ab14700 (LWP 27264)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f774921b1ef in event_dispatch_epoll_worker (data=0x7f773401c720) at
event-epoll.c:638
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f77395fa700 (LWP 27266)):
#0  0x00007f774790b66d in nanosleep () from /lib64/libc.so.6
#1  0x00007f774790b504 in sleep () from /lib64/libc.so.6
#2  0x00007f773b6632af in dht_file_counter_thread (args=0x7f773401b430) at
dht-rebalance.c:4158
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f7738df9700 (LWP 27267)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
#4  0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0,
req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at
client-common.c:459
#5  0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0,
this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472
#6  0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0,
this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091
#7  0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0,
fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319
#8  0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f7738df8d1c)
    at dht-rebalance.c:1750
#9  0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340)
at dht-rebalance.c:2645
#10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f7723fff700 (LWP 27268)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0,
req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0,
xdata=0x7f7723ffd008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810,
this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f7714005810,
this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1,
off=4194304, flags=0, 
    iobref=0x7f7730002600, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0,
fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304,
iobref=0x7f7730002600, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0,
to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880,
hole_exists=0, 
---Type <return> to continue, or q <return> to quit---
    fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f7723ffed1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0)
at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f77237fe700 (LWP 27269)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0,
value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0,
this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0,
req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at
client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0,
fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0)
at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f771affd700 (LWP 27270)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0,
req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0,
xdata=0x7f771affb008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80,
this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f770c002c80,
this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1,
off=4194304, flags=0, 
    iobref=0x7f773000e800, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0,
fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304,
iobref=0x7f773000e800, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0,
to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880,
hole_exists=0, 
    fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f771affcd1c)
---Type <return> to continue, or q <return> to quit---
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0)
at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f774969a780 (LWP 27257)):
#0  0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f774921b446 in event_dispatch_epoll (event_pool=0xd01f70) at
event-epoll.c:732
#2  0x00007f77491de754 in event_dispatch (event_pool=0xd01f70) at event.c:124
#3  0x000000000040ab6a in main (argc=31, argv=0x7fff5bfe3ff8) at
glusterfsd.c:2479


There are 4 threads (2,3,4,5) which are hung.

(gdb) t 2
[Switching to thread 2 (Thread 0x7f771affd700 (LWP 27270))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0,
req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0,
xdata=0x7f771affb008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80,
this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f770c002c80,
this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1,
off=4194304, flags=0, 
    iobref=0x7f773000e800, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0,
fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304,
iobref=0x7f773000e800, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0,
to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880,
hole_exists=0, 
    fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f771affcd1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0)
at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
303            pthread_mutex_lock (&conf->lock);



(gdb) t 3
[Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0,
value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0,
this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0,
req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at
client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0,
fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0)
at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0,
value=0x7f77237fbd90) at fd.c:984
984            LOCK (&fd->lock);



(gdb) t 4
[Switching to thread 4 (Thread 0x7f7723fff700 (LWP 27268))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0,
req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0,
xdata=0x7f7723ffd008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810,
this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f7714005810,
this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1,
off=4194304, flags=0, 
    iobref=0x7f7730002600, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0,
fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304,
iobref=0x7f7730002600, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0,
to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880,
hole_exists=0, 
    fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f7723ffed1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0)
at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
303            pthread_mutex_lock (&conf->lock);



(gdb) t 5
[Switching to thread 5 (Thread 0x7f7738df9700 (LWP 27267))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
#4  0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0,
req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at
client-common.c:459
#5  0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0,
this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472
#6  0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0,
this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091
#7  0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0,
fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319
#8  0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f7738df8d1c)
    at dht-rebalance.c:1750
#9  0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340)
at dht-rebalance.c:2645
#10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
303            pthread_mutex_lock (&conf->lock);



Threads 2, 4 and 5 are waiting on conf->lock which is held by thread 3.
Thread 3 is waiting on fd->lock. However, it does not look like any other
thread is holding fd->lock.



>From thread 3:


(gdb) t 3
[Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0,
value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0,
this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0,
req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at
client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0,
fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0)
at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 10 
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
2200                    lk_ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL);
(gdb) p *loc
$17 = {path = 0x7f771c000da0 "/xfile-198", name = 0x7f771c000da1 "xfile-198",
inode = 0x7f771c001050, parent = 0x7f7724003520, 
  gfid = "\261\020w\327\370\251I\205\242,\202\031Z\322#A", pargfid = '\000'
<repeats 15 times>, "\001"}



>From the rebalance log:

2990 [2017-07-10 04:45:24.741931] I [dht-rebalance.c:1515:dht_migrate_file]
0-vol1-dht: /xfile-198: attempting to move from vol1-client-2 to vol1-client-1
2991 [2017-07-10 04:45:24.770386] W [MSGID: 0]
[dht-rebalance.c:926:__dht_check_free_space] 0-vol1-dht: Write will cross
min-free-disk for file - /xfile-198 on subvol - vol1     -client-1. Looking for
new subvol
2992 [2017-07-10 04:45:24.770428] I [MSGID: 0]
[dht-rebalance.c:985:__dht_check_free_space] 0-vol1-dht: new target found -
vol1-client-0 for file - /xfile-198
2993 [2017-07-10 04:45:24.778057] W [MSGID: 114031]
[client-rpc-fops.c:2004:client3_3_fallocate_cbk] 0-vol1-client-0: remote
operation failed [No space left on device]
2994 [2017-07-10 04:45:24.778094] E [MSGID: 109023]
[dht-rebalance.c:789:__dht_rebalance_create_dst_file] 0-vol1-dht: fallocate
failed for /xfile-198 on vol1-client-0 (No sp     ace left on device)
2995 [2017-07-10 04:45:24.778122] E [dht-rebalance.c:1670:dht_migrate_file]
0-vol1-dht: Create dst failed on - vol1-client-0 for file - /xfile-198


These are the last messages logged.

>From the dht_migrate_file code:


dht_migrate_file () {

...

        /* create the destination, with required modes/xattr */                 
        ret = __dht_rebalance_create_dst_file (this, to, from, loc, &stbuf,     
                                               &dst_fd, xattr, fop_errno);      
        if (ret) {                                                              
                gf_msg (this->name, GF_LOG_ERROR, 0, 0, "Create dst failed"     
                        " on - %s for file - %s", to->name, loc->path);         
                goto out;                                                       
        }                                                                       

        clean_dst = _gf_true;   <-- the dst file will be cleaned up             

        ret = __dht_check_free_space (this, to, from, loc, &stbuf, flag, conf,  
                                      &target_changed, &new_target,
&ignore_failure, fop_errno);
        if (target_changed) {                                                   
                /* Can't handle for hardlinks. Marking this as failure */       
                if (flag == GF_DHT_MIGRATE_HARDLINK_IN_PROGRESS ||
stbuf.ia_nlink > 1) { 
                        gf_msg (this->name, GF_LOG_ERROR, 0,                    
                                DHT_MSG_SUBVOL_INSUFF_SPACE, "Exiting migration
for"
                                " file - %s. flag - %d, stbuf.ia_nlink - %d",   
                               loc->path,  flag, stbuf.ia_nlink);               
                        ret = -1;                                               
                        goto out;                                               
                }                                                               


                ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL);             
                if (ret) {                                                      
                        gf_log (this->name, GF_LOG_WARNING,                     
                                "%s: failed to perform truncate on %s (%s)",    
                                loc->path, to->name, strerror (-ret));          
                        ret = -1;                                               
                }                                                               

                syncop_close (dst_fd);      <-- this is now an invalid fd for
the dst cleanup                                    

                old_target = to;                                                
                to = new_target;                                                

                /* if the file migration is successful to this new target, then 
                 * update the xattr on the old destination to point the new     
                 * destination. We need to do update this only post migration   
                 * as in case of failure the linkto needs to point to the
source
                 * subvol */                                                    
                ret = __dht_rebalance_create_dst_file (this, to, from, loc, 
                                                       &stbuf,
                                                       &dst_fd, xattr, 
                                                       fop_errno);
                if (ret) {                                                      
                        gf_log (this->name, GF_LOG_ERROR, "Create dst failed"   
                                " on - %s for file - %s", to->name, loc->path); 
                        goto out;                      

<<< If this fails here, clean_dst is set to true but the fd is invalid, causing
the hang on fd->lock>>>                         

                } else {                                                        
                        gf_msg (this->name, GF_LOG_INFO, 0, 0, "destination for
file "
                                "- %s is changed to - %s", loc->path,
to->name);
                }                                                               
        }          
...

--- Additional comment from Worker Ant on 2017-07-10 06:17:54 EDT ---

REVIEW: https://review.gluster.org/17735 (cluster/dht: Clear clean_dst flag on
target change) posted (#1) for review on master by N Balachandran
(nbalacha at redhat.com)


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1469029
[Bug 1469029] Rebalance hangs on remove-brick if the target volume changes
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=OMA6kfnSQD&a=cc_unsubscribe


More information about the Bugs mailing list