[Gluster-devel] important change to syncop infra

Wed Dec 11 11:02:27 UTC 2013

> it's really a very interesting and hard to find bug, but I'm wondering
> how we can prevent this to happen in the general case. There might be
> other operations based on pointers to thread local storage that will
> suffer this problem. Probably 'errno' is one of the most dangerous, but
> TLS is also used to resolve 'THIS' in a very similar way to errno, and
> there are temporary uuid and lkowner values also stored into TLS. More
> things could appear in the future. These are also potential cases to
> have problems with syncops.
 
> An access to THIS before and after a syncop might trigger this bug, right ?

for errno it happens because it has the attribute __attribute__((const)).

# grep errno_location /usr/include/bits/errno.h
extern int *__errno_location (void) __THROW __attribute__ ((__const__));
#   define errno (*__errno_location ())

(The following article explains it best IMO: http://lwn.net/Articles/285332)

But it is not there for lkowner, uuid-buf, THIS. Just to confirm, I added the following code
------------------------------------------------------
+        xlator_t                *tmp_this = NULL;
 
         gf_log (this->name, GF_LOG_INFO, "migrate data called on %s",
                 loc->path);
@@ -1258,8 +1259,10 @@ gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
                          * and also that guarantees that file has to be mostly
                          * migrated */
 
+                        tmp_this = THIS;
                         ret = syncop_getxattr (this, &entry_loc, &dict,
                                                GF_XATTR_LINKINFO_KEY);
+                        tmp_this = THIS;
                         if (ret < 0) {
                                 if (errno != ENODATA) {
                                         loglevel = GF_LOG_ERROR;
@@ -1267,7 +1270,7 @@ gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
                                 } else {
                                         loglevel = GF_LOG_TRACE;
                                 }
-                                gf_log (this->name, loglevel, "%s: failed to "
+                                gf_log (tmp_this->name, loglevel, "%s: failed to "
-----------------------------------------------------
assembly code it generated:
=====================================================
1262	                        tmp_this = THIS;
   0x00007faf17120dbc <+1788>:	xor    %eax,%eax
   0x00007faf17120dbe <+1790>:	callq  0x7faf171191d0 <__glusterfs_this_location at plt> <<----------------first call to THIS

1263	                        ret = syncop_getxattr (this, &entry_loc, &dict,
   0x00007faf17120dc3 <+1795>:	mov    0x60(%rsp),%rdx
   0x00007faf17120dc8 <+1800>:	mov    0x28(%rsp),%rsi
   0x00007faf17120dcd <+1805>:	lea    0x2b550(%rip),%rcx        # 0x7faf1714c324
   0x00007faf17120dd4 <+1812>:	mov    0x18(%rsp),%rdi
   0x00007faf17120dd9 <+1817>:	callq  0x7faf17118a30 <syncop_getxattr at plt>
   0x00007faf17120dde <+1822>:	mov    %eax,%ebp

1264	                                               GF_XATTR_LINKINFO_KEY);
---Type <return> to continue, or q <return> to quit---
1265	                        tmp_this = THIS;
   0x00007faf17120de0 <+1824>:	xor    %eax,%eax
   0x00007faf17120de2 <+1826>:	callq  0x7faf171191d0 <__glusterfs_this_location at plt> <<----------------second call to THIS
   0x00007faf17120de9 <+1833>:	mov    (%rax),%r15

1266	                        if (ret < 0) {
   0x00007faf17120de7 <+1831>:	test   %ebp,%ebp
   0x00007faf17120dec <+1836>:	js     0x7faf1712100e <gf_defrag_migrate_data+2382>

1267	                                if (errno != ENODATA) {
   0x00007faf1712100e <+2382>:	mov    0x50(%rsp),%rax <<<---------------------------------------------check that errno does not callq __errno_location
   0x00007faf17121013 <+2387>:	mov    $0x9,%ebp
   0x00007faf17121018 <+2392>:	mov    (%rax),%edi
   0x00007faf1712101a <+2394>:	cmp    $0x3d,%edi

================================================

> I think it's very difficult to track all these cases and handle them
> correctly. Another solution could be to tell the compiler to forget
> previous pointer values when a syncop is called, but I don't see any way
> to do so. Since it's storing pointers to values, any barrier or volatile
> seems useless.

We can do this by putting __attribute__((noinline)) for such functions. But since errno is not in our control we can't do that.

> Xavi
> 
> 
> El 11/12/13 09:51, Pranith Kumar Karampuri ha escrit:
> > hi,
> >      We found a day-1 bug when syncop_xxx() infra is used inside a synctask
> >      with compilation optimization (CFLAGS -O2). This bug has been dormant
> >      for at least 2 years.
> > There are around ~400(rebalance, replace-brick, bd, self-heal-daemon,
> > quota, fuse lock/fd migration) places where syncop is used in the code
> > base all of which are potential candidates which can take the hit.
> >
> > I sent first round of patch at http://review.gluster.com/6475 to catch
> > regressions upstream.
> > These are the files that are affected by the changes I introduced to fix
> > this:
> >
> >   api/src/glfs-fops.c                             |  36
> >   ++++++++++++++++++++++++++++++++++
> >   api/src/glfs-handleops.c                        |  15 ++++++++++++++
> >   api/src/glfs-internal.h                         |   7 +++++++
> >   api/src/glfs-resolve.c                          |  10 ++++++++++
> >   libglusterfs/src/syncop.c                       | 117
> >   ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
> >   xlators/cluster/afr/src/afr-self-heald.c        |  45
> >   +++++++++++++++++++++++++++++-------------
> >   xlators/cluster/afr/src/pump.c                  |  12 ++++++++++--
> >   xlators/cluster/dht/src/dht-helper.c            |  24
> >   +++++++++++++++--------
> >   xlators/cluster/dht/src/dht-rebalance.c         | 168
> >   ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------
> >   xlators/cluster/dht/src/dht-selfheal.c          |   6 ++++--
> >   xlators/features/locks/src/posix.c              |   3 ++-
> >   xlators/features/qemu-block/src/bdrv-xlator.c   |  15 ++++----------
> >   xlators/features/qemu-block/src/qb-coroutines.c |  14 ++++++++++----
> >   xlators/mount/fuse/src/fuse-bridge.c            |  16 ++++++++++-----
> >
> > Please review your respective component for these changes in gerrit.
> >
> > Thanks
> > Pranith.
> >
> > Detailed explanation of the Root cause:
> > We found the bug in 'gf_defrag_migrate_data' in rebalance operation:
> >
> > Lets look at interesting parts of the function:
> >
> > int
> > gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t
> > *loc,
> >                          dict_t *migrate_data)
> > {
> > .....
> > code section - [ Loop ]
> >          while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
> >                                         &entries)) != 0) {
> > .....
> > code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno
> > by a thread)
> >                  /* Need to keep track of ENOENT errno, that means, there
> >                  is no
> >                     need to send more readdirp() */
> >                  readdir_operrno = errno;
> > .....
> > code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
> >                          ret = syncop_getxattr (this, &entry_loc, &dict,
> >                                                 GF_XATTR_LINKINFO_KEY);
> > code section - [ ERRNO-2]   (checking for failures of syncop_getxattr().
> > This may not always be executed in same thread which executed [SYNCOP-1])
> >                          if (ret < 0) {
> >                                  if (errno != ENODATA) {
> >                                          loglevel = GF_LOG_ERROR;
> >                                          defrag->total_failures += 1;
> > .....
> > }
> >
> > the function above could be executed by thread(t1) till [SYNCOP-1] and code
> > from [ERRNO-2] can be executed by a different thread(t2) because of the
> > way syncop-infra schedules the tasks.
> >
> > when the code is compiled with -O2 optimization this is the assembly code
> > that is generated:
> >   [ERRNO-1]
> > 1165                        readdir_operrno = errno; <<---- errno gets
> > expanded as *(__errno_location())
> >     0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0
> >     <__errno_location at plt>
> >     0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------
> >     Address returned by __errno_location() is stored in a special location
> >     in stack for later use.
> >     0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
> >     0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
> > ....
> >   [ERRNO-2]
> > 1281                                        if (errno != ENODATA) {
> >     0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----
> >     Because it already stored the address returned by __errno_location(),
> >     it just dereferences the address to get the errno value. BUT THIS CODE
> >     NEED NOT BE EXECUTED BY SAME THREAD!!!
> >     0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
> >     0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
> >     0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi
> >
> > The problem is that __errno_location() value of t1 and t2 are different. So
> > [ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though
> > t2 is executing [ERRNO-2] code section.
> >
> > When code is compiled without any optimization for [ERRNO-2]:
> > 1281                                        if (errno != ENODATA) {
> >     0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300
> >     <__errno_location at plt><<--- As it is calling __errno_location() again
> >     it gets the location from t2 so it works as intended.
> >     0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
> >     0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
> >     0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1
> >     <gf_defrag_migrate_data+2287>
> >
> > Fix:
> > We decided to make syncop_xxx() return (-errno) value as the return value
> > in case of errors and all the functions which make syncop_xxx() will need
> > to use (-ret) to figure out the reason for failure in case of syncop_xxx()
> > failures.
> >
> > Pranith
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> 
>