[Gluster-devel] important change to syncop infra

Wed Dec 11 10:04:42 UTC 2013

Hi Pranith,

it's really a very interesting and hard to find bug, but I'm wondering 
how we can prevent this to happen in the general case. There might be 
other operations based on pointers to thread local storage that will 
suffer this problem. Probably 'errno' is one of the most dangerous, but 
TLS is also used to resolve 'THIS' in a very similar way to errno, and 
there are temporary uuid and lkowner values also stored into TLS. More 
things could appear in the future. These are also potential cases to 
have problems with syncops.

An access to THIS before and after a syncop might trigger this bug, right ?

I think it's very difficult to track all these cases and handle them 
correctly. Another solution could be to tell the compiler to forget 
previous pointer values when a syncop is called, but I don't see any way 
to do so. Since it's storing pointers to values, any barrier or volatile 
seems useless.

Xavi

El 11/12/13 09:51, Pranith Kumar Karampuri ha escrit:
> hi,
>      We found a day-1 bug when syncop_xxx() infra is used inside a synctask with compilation optimization (CFLAGS -O2). This bug has been dormant for at least 2 years.
> There are around ~400(rebalance, replace-brick, bd, self-heal-daemon, quota, fuse lock/fd migration) places where syncop is used in the code base all of which are potential candidates which can take the hit.
>
> I sent first round of patch at http://review.gluster.com/6475 to catch regressions upstream.
> These are the files that are affected by the changes I introduced to fix this:
>
>   api/src/glfs-fops.c                             |  36 ++++++++++++++++++++++++++++++++++
>   api/src/glfs-handleops.c                        |  15 ++++++++++++++
>   api/src/glfs-internal.h                         |   7 +++++++
>   api/src/glfs-resolve.c                          |  10 ++++++++++
>   libglusterfs/src/syncop.c                       | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
>   xlators/cluster/afr/src/afr-self-heald.c        |  45 +++++++++++++++++++++++++++++-------------
>   xlators/cluster/afr/src/pump.c                  |  12 ++++++++++--
>   xlators/cluster/dht/src/dht-helper.c            |  24 +++++++++++++++--------
>   xlators/cluster/dht/src/dht-rebalance.c         | 168 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------
>   xlators/cluster/dht/src/dht-selfheal.c          |   6 ++++--
>   xlators/features/locks/src/posix.c              |   3 ++-
>   xlators/features/qemu-block/src/bdrv-xlator.c   |  15 ++++----------
>   xlators/features/qemu-block/src/qb-coroutines.c |  14 ++++++++++----
>   xlators/mount/fuse/src/fuse-bridge.c            |  16 ++++++++++-----
>
> Please review your respective component for these changes in gerrit.
>
> Thanks
> Pranith.
>
> Detailed explanation of the Root cause:
> We found the bug in 'gf_defrag_migrate_data' in rebalance operation:
>
> Lets look at interesting parts of the function:
>
> int
> gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
>                          dict_t *migrate_data)
> {
> .....
> code section - [ Loop ]
>          while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
>                                         &entries)) != 0) {
> .....
> code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a thread)
>                  /* Need to keep track of ENOENT errno, that means, there is no
>                     need to send more readdirp() */
>                  readdir_operrno = errno;
> .....
> code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
>                          ret = syncop_getxattr (this, &entry_loc, &dict,
>                                                 GF_XATTR_LINKINFO_KEY);
> code section - [ ERRNO-2]   (checking for failures of syncop_getxattr(). This may not always be executed in same thread which executed [SYNCOP-1])
>                          if (ret < 0) {
>                                  if (errno != ENODATA) {
>                                          loglevel = GF_LOG_ERROR;
>                                          defrag->total_failures += 1;
> .....
> }
>
> the function above could be executed by thread(t1) till [SYNCOP-1] and code from [ERRNO-2] can be executed by a different thread(t2) because of the way syncop-infra schedules the tasks.
>
> when the code is compiled with -O2 optimization this is the assembly code that is generated:
>   [ERRNO-1]
> 1165                        readdir_operrno = errno; <<---- errno gets expanded as *(__errno_location())
>     0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0 <__errno_location at plt>
>     0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------ Address returned by __errno_location() is stored in a special location in stack for later use.
>     0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
>     0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
> ....
>   [ERRNO-2]
> 1281                                        if (errno != ENODATA) {
>     0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----  Because it already stored the address returned by __errno_location(), it just dereferences the address to get the errno value. BUT THIS CODE NEED NOT BE EXECUTED BY SAME THREAD!!!
>     0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
>     0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
>     0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi
>
> The problem is that __errno_location() value of t1 and t2 are different. So [ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though t2 is executing [ERRNO-2] code section.
>
> When code is compiled without any optimization for [ERRNO-2]:
> 1281                                        if (errno != ENODATA) {
>     0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300 <__errno_location at plt><<--- As it is calling __errno_location() again it gets the location from t2 so it works as intended.
>     0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
>     0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
>     0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1 <gf_defrag_migrate_data+2287>
>
> Fix:
> We decided to make syncop_xxx() return (-errno) value as the return value in case of errors and all the functions which make syncop_xxx() will need to use (-ret) to figure out the reason for failure in case of syncop_xxx() failures.
>
> Pranith
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel