[Bugs] [Bug 1475258] New: [Geo-rep]: Geo-rep hangs in changelog mode

Wed Jul 26 10:10:01 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1475258

            Bug ID: 1475258
           Summary: [Geo-rep]: Geo-rep hangs in changelog mode
           Product: GlusterFS
           Version: 3.12
         Component: geo-replication
          Keywords: Triaged
          Severity: high
          Assignee: ndevos at redhat.com
          Reporter: ndevos at redhat.com
                CC: bugs at gluster.org, khiremat at redhat.com,
                    ndevos at redhat.com
        Depends On: 1475255
            Blocks: 1473826 (glusterfs-3.12.0)

+++ This bug was initially created as a clone of Bug #1475255 +++

Description of problem:
Geo-replication worker hangs and doesn't switch to 'Changelog Crawl'.
No data from master gets synced to slave. But the geo-rep works fine
with xsync as change_detector.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Setup geo-replication session between two gluster volumes

Actual results:
Geo-rep hangs and no data is synced

Expected results:
Geo-rep should not hang and should sync data

Additional info:
Analysis:
   It was found out that the culprit was patch
"https://review.gluster.org/#/c/17779/" which got into mainline. This patch is
causing a hang in 'libgfchangelog' library which the geo-replication uses to
get the changelogs from brick back end.

The back trace of the hang is as below.

Thread 1 (Thread 0x7ffff7fe5700 (LWP 11895)):
#0  pthread_spin_lock () at ../sysdeps/x86_64/nptl/pthread_spin_lock.S:32
#1  0x00007ffff7911af4 in mem_get (mem_pool=0x7ffff7bc5588 <pools+200>) at
mem-pool.c:758
#2  0x00007ffff7911791 in mem_get0 (mem_pool=0x7ffff7bc5588 <pools+200>) at
mem-pool.c:657
#3  0x00007ffff78dcd80 in log_buf_new () at logging.c:284
#4  0x00007ffff78e0c0a in _gf_msg_internal (domain=0x602b80 "gfchangelog",
file=0x7ffff7bd50cd "gf-changelog.c", 
    function=0x7ffff7bd52a0 <__FUNCTION__.17176>
"gf_changelog_register_generic", line=552, level=GF_LOG_INFO, errnum=0,
msgid=132028, appmsgstr=0x7fffffffd018, callstr=0x0, graph_id=0)
    at logging.c:1961
#5  0x00007ffff78e110c in _gf_msg (domain=0x602b80 "gfchangelog",
file=0x7ffff7bd50cd "gf-changelog.c", function=0x7ffff7bd52a0
<__FUNCTION__.17176> "gf_changelog_register_generic", 
    line=552, level=GF_LOG_INFO, errnum=0, trace=0, msgid=132028,
fmt=0x7ffff7bd51c8 "Registering brick: %s [notify filter: %d]") at
logging.c:2077
#6  0x00007ffff7bcd8dd in gf_changelog_register_generic (bricks=0x7fffffffd1c0,
count=0, ordered=1, logfile=0x400c0d "/tmp/change.log", lvl=9, xl=0x0) at
gf-changelog.c:549
#7  0x00007ffff7bcda84 in gf_changelog_register (brick_path=0x400c2a
"/bricks/brick0/b0", scratch_dir=0x400c1d "/tmp/scratch", log_file=0x400c0d
"/tmp/change.log", log_level=9, 
    max_reconnects=5) at gf-changelog.c:623
#8  0x00000000004009ff in main (argc=1, argv=0x7fffffffe328) at
get-changes.c:49

The call flow of first mem_get is as below
mem_get-->mem_get_pool_list--> pthread_getspecific(pool_key)

pthread_getspecific should have returned NULL as the pool_key is not set
because
mem_pools_init_early/mem_pools_init_late is not called in this code path. But
it returned some value and hence spin lock initialization didn't happen causing
this hang.

According to man page of pthread_getspecific

"The effect of calling pthread_getspecific()  or  pthread_setspecific()  with 
a  key value  not  obtained  from  pthread_key_create()  or after key has been
deleted with pthread_key_delete() is undefined."

So we should not be having this if condition below ? 

mem_pools_init_early (void)
{
        pthread_mutex_lock (&init_mutex);
        /* Use a pthread_key destructor to clean up when a thread exits.
         *
         * We won't increase init_count here, that is only done when the
         * pool_sweeper thread is started too.
         */
        if (pthread_getspecific (pool_key) == NULL) {
                /* key has not been created yet */
                if (pthread_key_create (&pool_key, pool_destructor) != 0) {
                        gf_log ("mem-pool", GF_LOG_CRITICAL,
                                "failed to initialize mem-pool key");
                }
        }
        pthread_mutex_unlock (&init_mutex);
}

And now is it mandatory to do mem_pool_init_early in all the code paths like
libgfchangelog?

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1473826
[Bug 1473826] GlusterFS 3.12.0 tracker
https://bugzilla.redhat.com/show_bug.cgi?id=1475255
[Bug 1475255] [Geo-rep]: Geo-rep hangs in changelog mode
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=VrCe5nyEpp&a=cc_unsubscribe