[Bugs] [Bug 1469459] New: Rebalance hangs on remove-brick if the target volume changes

bugzilla at redhat.com bugzilla at redhat.com
Tue Jul 11 10:06:11 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1469459

            Bug ID: 1469459
           Summary: Rebalance hangs on remove-brick if the target volume
                    changes
           Product: GlusterFS
           Version: 3.11
         Component: distribute
          Assignee: bugs at gluster.org
          Reporter: nbalacha at redhat.com
                CC: bugs at gluster.org
        Depends On: 1469029
            Blocks: 1469041



+++ This bug was initially created as a clone of Bug #1469029 +++

Description of problem:
The rebalance process hangs on a remove-brick operation if the original hashed
subvol fails the min-free-disk check and a different target is selected.

Version-Release number of selected component (if applicable):


How reproducible:
Consistently

Steps to Reproduce:
1. Create a 1x3 distribute volume with 1 GB bricks
2. Create enough 5MB files on the volume such that no 2 bricks can accommodate
all the files.
3. Run a remove-brick to remove one of the bricks

Actual results:

The rebalance hangs.

Expected results:
The rebalance process should terminate once all files are processed.

Additional info:

--- Additional comment from Nithya Balachandran on 2017-07-10 06:13:12 EDT ---

RCA:


>From gdb:


Thread 13 (Thread 0x7f7740823700 (LWP 27258)):
#0  0x00007f7748006bdd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f77491bd8cb in gf_timer_proc (data=0xd0af80) at timer.c:176
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f7740022700 (LWP 27259)):
#0  0x00007f7748007101 in sigwait () from /lib64/libpthread.so.0
#1  0x0000000000409e72 in glusterfs_sigwaiter (arg=0x7fff5bfe2da0) at
glusterfsd.c:2069
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f773f821700 (LWP 27260)):
#0  0x00007f774790b66d in nanosleep () from /lib64/libc.so.6
#1  0x00007f774790b504 in sleep () from /lib64/libc.so.6
#2  0x00007f77491dff7b in pool_sweeper (arg=0x0) at mem-pool.c:464
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f773f020700 (LWP 27261)):
#0  0x00007f7748003a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f77491f54ab in syncenv_task (proc=0xd0b7d0) at syncop.c:603
#2  0x00007f77491f5746 in syncenv_processor (thdata=0xd0b7d0) at syncop.c:695
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f773e81f700 (LWP 27262)):
#0  0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f773b66410f in gf_defrag_start_crawl (data=0x7f773400dfc0) at
dht-rebalance.c:4479
#2  0x00007f77491f4c7a in synctask_wrap (old_task=0x7f7724001400) at
syncop.c:375
#3  0x00007f7747893cf0 in ?? () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f773c357700 (LWP 27263)):
#0  0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f774921b1ef in event_dispatch_epoll_worker (data=0xd49290) at
event-epoll.c:638
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f773ab14700 (LWP 27264)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f774921b1ef in event_dispatch_epoll_worker (data=0x7f773401c720) at
event-epoll.c:638
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f77395fa700 (LWP 27266)):
#0  0x00007f774790b66d in nanosleep () from /lib64/libc.so.6
#1  0x00007f774790b504 in sleep () from /lib64/libc.so.6
#2  0x00007f773b6632af in dht_file_counter_thread (args=0x7f773401b430) at
dht-rebalance.c:4158
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f7738df9700 (LWP 27267)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
#4  0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0,
req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at
client-common.c:459
#5  0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0,
this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472
#6  0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0,
this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091
#7  0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0,
fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319
#8  0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f7738df8d1c)
    at dht-rebalance.c:1750
#9  0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340)
at dht-rebalance.c:2645
#10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f7723fff700 (LWP 27268)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0,
req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0,
xdata=0x7f7723ffd008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810,
this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f7714005810,
this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1,
off=4194304, flags=0, 
    iobref=0x7f7730002600, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0,
fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304,
iobref=0x7f7730002600, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0,
to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880,
hole_exists=0, 
---Type <return> to continue, or q <return> to quit---
    fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f7723ffed1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0)
at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f77237fe700 (LWP 27269)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0,
value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0,
this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0,
req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at
client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0,
fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0)
at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f771affd700 (LWP 27270)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0,
req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0,
xdata=0x7f771affb008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80,
this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f770c002c80,
this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1,
off=4194304, flags=0, 
    iobref=0x7f773000e800, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0,
fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304,
iobref=0x7f773000e800, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0,
to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880,
hole_exists=0, 
    fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f771affcd1c)
---Type <return> to continue, or q <return> to quit---
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0)
at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f774969a780 (LWP 27257)):
#0  0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f774921b446 in event_dispatch_epoll (event_pool=0xd01f70) at
event-epoll.c:732
#2  0x00007f77491de754 in event_dispatch (event_pool=0xd01f70) at event.c:124
#3  0x000000000040ab6a in main (argc=31, argv=0x7fff5bfe3ff8) at
glusterfsd.c:2479


There are 4 threads (2,3,4,5) which are hung.

(gdb) t 2
[Switching to thread 2 (Thread 0x7f771affd700 (LWP 27270))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0,
req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0,
xdata=0x7f771affb008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80,
this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f770c002c80,
this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1,
off=4194304, flags=0, 
    iobref=0x7f773000e800, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0,
fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304,
iobref=0x7f773000e800, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0,
to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880,
hole_exists=0, 
    fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f771affcd1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0)
at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
303            pthread_mutex_lock (&conf->lock);



(gdb) t 3
[Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0,
value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0,
this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0,
req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at
client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0,
fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0)
at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0,
value=0x7f77237fbd90) at fd.c:984
984            LOCK (&fd->lock);



(gdb) t 4
[Switching to thread 4 (Thread 0x7f7723fff700 (LWP 27268))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0,
req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0,
xdata=0x7f7723ffd008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810,
this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f7714005810,
this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1,
off=4194304, flags=0, 
    iobref=0x7f7730002600, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0,
fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304,
iobref=0x7f7730002600, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0,
to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880,
hole_exists=0, 
    fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f7723ffed1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0)
at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
303            pthread_mutex_lock (&conf->lock);



(gdb) t 5
[Switching to thread 5 (Thread 0x7f7738df9700 (LWP 27267))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
#4  0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0,
req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at
client-common.c:459
#5  0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0,
this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472
#6  0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0,
this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091
#7  0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0,
fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319
#8  0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f7738df8d1c)
    at dht-rebalance.c:1750
#9  0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340)
at dht-rebalance.c:2645
#10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
303            pthread_mutex_lock (&conf->lock);



Threads 2, 4 and 5 are waiting on conf->lock which is held by thread 3.
Thread 3 is waiting on fd->lock. However, it does not look like any other
thread is holding fd->lock.



>From thread 3:


(gdb) t 3
[Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0,
value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0,
this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0,
fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0,
req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at
client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400,
this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0,
fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0)
at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at
dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 10 
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0,
loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2,
fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
2200                    lk_ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL);
(gdb) p *loc
$17 = {path = 0x7f771c000da0 "/xfile-198", name = 0x7f771c000da1 "xfile-198",
inode = 0x7f771c001050, parent = 0x7f7724003520, 
  gfid = "\261\020w\327\370\251I\205\242,\202\031Z\322#A", pargfid = '\000'
<repeats 15 times>, "\001"}



>From the rebalance log:

2990 [2017-07-10 04:45:24.741931] I [dht-rebalance.c:1515:dht_migrate_file]
0-vol1-dht: /xfile-198: attempting to move from vol1-client-2 to vol1-client-1
2991 [2017-07-10 04:45:24.770386] W [MSGID: 0]
[dht-rebalance.c:926:__dht_check_free_space] 0-vol1-dht: Write will cross
min-free-disk for file - /xfile-198 on subvol - vol1     -client-1. Looking for
new subvol
2992 [2017-07-10 04:45:24.770428] I [MSGID: 0]
[dht-rebalance.c:985:__dht_check_free_space] 0-vol1-dht: new target found -
vol1-client-0 for file - /xfile-198
2993 [2017-07-10 04:45:24.778057] W [MSGID: 114031]
[client-rpc-fops.c:2004:client3_3_fallocate_cbk] 0-vol1-client-0: remote
operation failed [No space left on device]
2994 [2017-07-10 04:45:24.778094] E [MSGID: 109023]
[dht-rebalance.c:789:__dht_rebalance_create_dst_file] 0-vol1-dht: fallocate
failed for /xfile-198 on vol1-client-0 (No sp     ace left on device)
2995 [2017-07-10 04:45:24.778122] E [dht-rebalance.c:1670:dht_migrate_file]
0-vol1-dht: Create dst failed on - vol1-client-0 for file - /xfile-198


These are the last messages logged.

>From the dht_migrate_file code:


dht_migrate_file () {

...

        /* create the destination, with required modes/xattr */                 
        ret = __dht_rebalance_create_dst_file (this, to, from, loc, &stbuf,     
                                               &dst_fd, xattr, fop_errno);      
        if (ret) {                                                              
                gf_msg (this->name, GF_LOG_ERROR, 0, 0, "Create dst failed"     
                        " on - %s for file - %s", to->name, loc->path);         
                goto out;                                                       
        }                                                                       

        clean_dst = _gf_true;   <-- the dst file will be cleaned up             

        ret = __dht_check_free_space (this, to, from, loc, &stbuf, flag, conf,  
                                      &target_changed, &new_target,
&ignore_failure, fop_errno);
        if (target_changed) {                                                   
                /* Can't handle for hardlinks. Marking this as failure */       
                if (flag == GF_DHT_MIGRATE_HARDLINK_IN_PROGRESS ||
stbuf.ia_nlink > 1) { 
                        gf_msg (this->name, GF_LOG_ERROR, 0,                    
                                DHT_MSG_SUBVOL_INSUFF_SPACE, "Exiting migration
for"
                                " file - %s. flag - %d, stbuf.ia_nlink - %d",   
                               loc->path,  flag, stbuf.ia_nlink);               
                        ret = -1;                                               
                        goto out;                                               
                }                                                               


                ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL);             
                if (ret) {                                                      
                        gf_log (this->name, GF_LOG_WARNING,                     
                                "%s: failed to perform truncate on %s (%s)",    
                                loc->path, to->name, strerror (-ret));          
                        ret = -1;                                               
                }                                                               

                syncop_close (dst_fd);      <-- this is now an invalid fd for
the dst cleanup                                    

                old_target = to;                                                
                to = new_target;                                                

                /* if the file migration is successful to this new target, then 
                 * update the xattr on the old destination to point the new     
                 * destination. We need to do update this only post migration   
                 * as in case of failure the linkto needs to point to the
source
                 * subvol */                                                    
                ret = __dht_rebalance_create_dst_file (this, to, from, loc, 
                                                       &stbuf,
                                                       &dst_fd, xattr, 
                                                       fop_errno);
                if (ret) {                                                      
                        gf_log (this->name, GF_LOG_ERROR, "Create dst failed"   
                                " on - %s for file - %s", to->name, loc->path); 
                        goto out;                      

<<< If this fails here, clean_dst is set to true but the fd is invalid, causing
the hang on fd->lock>>>                         

                } else {                                                        
                        gf_msg (this->name, GF_LOG_INFO, 0, 0, "destination for
file "
                                "- %s is changed to - %s", loc->path,
to->name);
                }                                                               
        }          
...

--- Additional comment from Worker Ant on 2017-07-10 06:17:54 EDT ---

REVIEW: https://review.gluster.org/17735 (cluster/dht: Clear clean_dst flag on
target change) posted (#1) for review on master by N Balachandran
(nbalacha at redhat.com)

--- Additional comment from Worker Ant on 2017-07-11 01:06:17 EDT ---

COMMIT: https://review.gluster.org/17735 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit bd71ca4fdf2554dd22c0db70af132a11b966ef38
Author: N Balachandran <nbalacha at redhat.com>
Date:   Mon Jul 10 15:45:04 2017 +0530

    cluster/dht: Clear clean_dst flag on target change

    If the target of a file migration was changed because
    of min-free-disk limits, the dst_fd was closed but the
    clean_dst flag was not set to false. If the file could
    not be created on the new target for some reason, the
    ftruncate call to clean up the dst was sent on the now
    invalid fd causing the process to deadlock.

    Change-Id: I5bfa80f519b04567413d84229cf62d143c6e2f04
    BUG: 1469029
    Signed-off-by: N Balachandran <nbalacha at redhat.com>
    Reviewed-on: https://review.gluster.org/17735
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Amar Tumballi <amarts at redhat.com>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1469029
[Bug 1469029] Rebalance hangs on remove-brick if the target volume changes
https://bugzilla.redhat.com/show_bug.cgi?id=1469041
[Bug 1469041] Rebalance hangs on remove-brick if the target volume changes
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list