[Gluster-devel] important change to syncop infra

Wed Dec 11 14:44:32 UTC 2013

So... do we need to spin up new packages for existing releases?

----- Original Message -----
> hi Xavier,
> 
> > The str_uuid pointer used can point to the uuid of another thread that
> > could have been modified. I don't know very well the internals of
> > syncops and synctask, but I think it's even possible that another task
> > running on the same thread while the current one was sleeping could
> > modify the contents of str_uuid.
> > 
> > This case seems very obvious and useless but maybe other combinations
> > are harder to detect.
> > 
> > I think that currently they are only used by uuid_utoa() and
> > lkowner_utoa() to write log messages, which seems safe, but we will have
> > to be careful when using syncops to avoid possible future problems like
> > this.
> 
> Probably the reason why I never considered these functions as a danger is for
> some reason
> I always thought they are supposed to be used in logging functions. After
> your
> response I went through the commits and indeed they are introduced just for
> logging.
> 
> Here is the snippet from the bug
> (https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-2308)
>       as part of which this is introduced:
> 
> "This function would take in a uuid_t and return a static char buffer.
> This would be useful in logging the uuid_t. We already make use of
> uuid_unparse() provided by libuuid to convert uuid_t to string but that
> takes an additional out buffer parameter, does not return the out
> buffer  and hence cannot be used directly within logging functions like
> gf_log (). This necessitates uuid_unparse () to be done even if we are
> not going to log the uuid (due to the log message being at a lower log
> level) and hence it does have a performance impact. In all such
> instances, let us get rid of uuid_unparse() and use uuid_utoa() instead.
> 
> Prototype :-> char * uuid_utoa (uuid_t uuid)"
> 
> lkowner_utoa is implemented much later for similar usage.
> 
> Both these functions have re-entrant variants, lkowner_utoa_r() and
> uuid_utoa_r().
> 
> So while the case that you showed for uuid_utoa() exposes the bug, I am not
> sure if it is supposed to be used in that way to begin with.
> Even in gf_log() if we want to log more than 1 uuid we should not be using
> this function because it ends up logging same uuid twice.
> 
> Pranith.
> 
> > Xavi
> > 
> > El 11/12/13 12:02, Pranith Kumar Karampuri ha escrit:
> > >> it's really a very interesting and hard to find bug, but I'm wondering
> > >> how we can prevent this to happen in the general case. There might be
> > >> other operations based on pointers to thread local storage that will
> > >> suffer this problem. Probably 'errno' is one of the most dangerous, but
> > >> TLS is also used to resolve 'THIS' in a very similar way to errno, and
> > >> there are temporary uuid and lkowner values also stored into TLS. More
> > >> things could appear in the future. These are also potential cases to
> > >> have problems with syncops.
> > >   
> > >> An access to THIS before and after a syncop might trigger this bug,
> > >> right
> > >> ?
> > > for errno it happens because it has the attribute __attribute__((const)).
> > >
> > > # grep errno_location /usr/include/bits/errno.h
> > > extern int *__errno_location (void) __THROW __attribute__ ((__const__));
> > > #   define errno (*__errno_location ())
> > >
> > > (The following article explains it best IMO:
> > > http://lwn.net/Articles/285332)
> > >
> > > But it is not there for lkowner, uuid-buf, THIS. Just to confirm, I added
> > > the following code
> > > ------------------------------------------------------
> > > +        xlator_t                *tmp_this = NULL;
> > >   
> > >           gf_log (this->name, GF_LOG_INFO, "migrate data called on %s",
> > >                   loc->path);
> > > @@ -1258,8 +1259,10 @@ gf_defrag_migrate_data (xlator_t *this,
> > > gf_defrag_info_t *defrag, loc_t *loc,
> > >                            * and also that guarantees that file has to be
> > >                            mostly
> > >                            * migrated */
> > >   
> > > +                        tmp_this = THIS;
> > >                           ret = syncop_getxattr (this, &entry_loc, &dict,
> > >                                                  GF_XATTR_LINKINFO_KEY);
> > > +                        tmp_this = THIS;
> > >                           if (ret < 0) {
> > >                                   if (errno != ENODATA) {
> > >                                           loglevel = GF_LOG_ERROR;
> > > @@ -1267,7 +1270,7 @@ gf_defrag_migrate_data (xlator_t *this,
> > > gf_defrag_info_t *defrag, loc_t *loc,
> > >                                   } else {
> > >                                           loglevel = GF_LOG_TRACE;
> > >                                   }
> > > -                                gf_log (this->name, loglevel, "%s:
> > > failed
> > > to "
> > > +                                gf_log (tmp_this->name, loglevel, "%s:
> > > failed to "
> > > -----------------------------------------------------
> > > assembly code it generated:
> > > =====================================================
> > > 1262	                        tmp_this = THIS;
> > >     0x00007faf17120dbc <+1788>:	xor    %eax,%eax
> > >     0x00007faf17120dbe <+1790>:	callq  0x7faf171191d0
> > >     <__glusterfs_this_location at plt> <<----------------first call to THIS
> > >
> > > 1263	                        ret = syncop_getxattr (this, &entry_loc,
> > > &dict,
> > >     0x00007faf17120dc3 <+1795>:	mov    0x60(%rsp),%rdx
> > >     0x00007faf17120dc8 <+1800>:	mov    0x28(%rsp),%rsi
> > >     0x00007faf17120dcd <+1805>:	lea    0x2b550(%rip),%rcx        #
> > >     0x7faf1714c324
> > >     0x00007faf17120dd4 <+1812>:	mov    0x18(%rsp),%rdi
> > >     0x00007faf17120dd9 <+1817>:	callq  0x7faf17118a30
> > >     <syncop_getxattr at plt>
> > >     0x00007faf17120dde <+1822>:	mov    %eax,%ebp
> > >
> > > 1264
> > > 	                                               GF_XATTR_LINKINFO_KEY);
> > > ---Type <return> to continue, or q <return> to quit---
> > > 1265	                        tmp_this = THIS;
> > >     0x00007faf17120de0 <+1824>:	xor    %eax,%eax
> > >     0x00007faf17120de2 <+1826>:	callq  0x7faf171191d0
> > >     <__glusterfs_this_location at plt> <<----------------second call to THIS
> > >     0x00007faf17120de9 <+1833>:	mov    (%rax),%r15
> > >
> > > 1266	                        if (ret < 0) {
> > >     0x00007faf17120de7 <+1831>:	test   %ebp,%ebp
> > >     0x00007faf17120dec <+1836>:	js     0x7faf1712100e
> > >     <gf_defrag_migrate_data+2382>
> > >
> > > 1267	                                if (errno != ENODATA) {
> > >     0x00007faf1712100e <+2382>:	mov    0x50(%rsp),%rax
> > >     <<<---------------------------------------------check that errno does
> > >     not callq __errno_location
> > >     0x00007faf17121013 <+2387>:	mov    $0x9,%ebp
> > >     0x00007faf17121018 <+2392>:	mov    (%rax),%edi
> > >     0x00007faf1712101a <+2394>:	cmp    $0x3d,%edi
> > >
> > > ================================================
> > >
> > >> I think it's very difficult to track all these cases and handle them
> > >> correctly. Another solution could be to tell the compiler to forget
> > >> previous pointer values when a syncop is called, but I don't see any way
> > >> to do so. Since it's storing pointers to values, any barrier or volatile
> > >> seems useless.
> > > We can do this by putting __attribute__((noinline)) for such functions.
> > > But
> > > since errno is not in our control we can't do that.
> > >
> > >> Xavi
> > >>
> > >>
> > >> El 11/12/13 09:51, Pranith Kumar Karampuri ha escrit:
> > >>> hi,
> > >>>       We found a day-1 bug when syncop_xxx() infra is used inside a
> > >>>       synctask
> > >>>       with compilation optimization (CFLAGS -O2). This bug has been
> > >>>       dormant
> > >>>       for at least 2 years.
> > >>> There are around ~400(rebalance, replace-brick, bd, self-heal-daemon,
> > >>> quota, fuse lock/fd migration) places where syncop is used in the code
> > >>> base all of which are potential candidates which can take the hit.
> > >>>
> > >>> I sent first round of patch at http://review.gluster.com/6475 to catch
> > >>> regressions upstream.
> > >>> These are the files that are affected by the changes I introduced to
> > >>> fix
> > >>> this:
> > >>>
> > >>>    api/src/glfs-fops.c                             |  36
> > >>>    ++++++++++++++++++++++++++++++++++
> > >>>    api/src/glfs-handleops.c                        |  15 ++++++++++++++
> > >>>    api/src/glfs-internal.h                         |   7 +++++++
> > >>>    api/src/glfs-resolve.c                          |  10 ++++++++++
> > >>>    libglusterfs/src/syncop.c                       | 117
> > >>>    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
> > >>>    xlators/cluster/afr/src/afr-self-heald.c        |  45
> > >>>    +++++++++++++++++++++++++++++-------------
> > >>>    xlators/cluster/afr/src/pump.c                  |  12 ++++++++++--
> > >>>    xlators/cluster/dht/src/dht-helper.c            |  24
> > >>>    +++++++++++++++--------
> > >>>    xlators/cluster/dht/src/dht-rebalance.c         | 168
> > >>>    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------
> > >>>    xlators/cluster/dht/src/dht-selfheal.c          |   6 ++++--
> > >>>    xlators/features/locks/src/posix.c              |   3 ++-
> > >>>    xlators/features/qemu-block/src/bdrv-xlator.c   |  15 ++++----------
> > >>>    xlators/features/qemu-block/src/qb-coroutines.c |  14 ++++++++++----
> > >>>    xlators/mount/fuse/src/fuse-bridge.c            |  16
> > >>>    ++++++++++-----
> > >>>
> > >>> Please review your respective component for these changes in gerrit.
> > >>>
> > >>> Thanks
> > >>> Pranith.
> > >>>
> > >>> Detailed explanation of the Root cause:
> > >>> We found the bug in 'gf_defrag_migrate_data' in rebalance operation:
> > >>>
> > >>> Lets look at interesting parts of the function:
> > >>>
> > >>> int
> > >>> gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t
> > >>> *loc,
> > >>>                           dict_t *migrate_data)
> > >>> {
> > >>> .....
> > >>> code section - [ Loop ]
> > >>>           while ((ret = syncop_readdirp (this, fd, 131072, offset,
> > >>>           NULL,
> > >>>                                          &entries)) != 0) {
> > >>> .....
> > >>> code section - [ ERRNO-1 ] (errno of readdirp is stored in
> > >>> readdir_operrno
> > >>> by a thread)
> > >>>                   /* Need to keep track of ENOENT errno, that means,
> > >>>                   there
> > >>>                   is no
> > >>>                      need to send more readdirp() */
> > >>>                   readdir_operrno = errno;
> > >>> .....
> > >>> code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
> > >>>                           ret = syncop_getxattr (this, &entry_loc,
> > >>>                           &dict,
> > >>>                                                  GF_XATTR_LINKINFO_KEY);
> > >>> code section - [ ERRNO-2]   (checking for failures of
> > >>> syncop_getxattr().
> > >>> This may not always be executed in same thread which executed
> > >>> [SYNCOP-1])
> > >>>                           if (ret < 0) {
> > >>>                                   if (errno != ENODATA) {
> > >>>                                           loglevel = GF_LOG_ERROR;
> > >>>                                           defrag->total_failures += 1;
> > >>> .....
> > >>> }
> > >>>
> > >>> the function above could be executed by thread(t1) till [SYNCOP-1] and
> > >>> code
> > >>> from [ERRNO-2] can be executed by a different thread(t2) because of the
> > >>> way syncop-infra schedules the tasks.
> > >>>
> > >>> when the code is compiled with -O2 optimization this is the assembly
> > >>> code
> > >>> that is generated:
> > >>>    [ERRNO-1]
> > >>> 1165                        readdir_operrno = errno; <<---- errno gets
> > >>> expanded as *(__errno_location())
> > >>>      0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0
> > >>>      <__errno_location at plt>
> > >>>      0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------
> > >>>      Address returned by __errno_location() is stored in a special
> > >>>      location
> > >>>      in stack for later use.
> > >>>      0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
> > >>>      0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
> > >>> ....
> > >>>    [ERRNO-2]
> > >>> 1281                                        if (errno != ENODATA) {
> > >>>      0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----
> > >>>      Because it already stored the address returned by
> > >>>      __errno_location(),
> > >>>      it just dereferences the address to get the errno value. BUT THIS
> > >>>      CODE
> > >>>      NEED NOT BE EXECUTED BY SAME THREAD!!!
> > >>>      0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
> > >>>      0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
> > >>>      0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi
> > >>>
> > >>> The problem is that __errno_location() value of t1 and t2 are
> > >>> different.
> > >>> So
> > >>> [ERRNO-2] ends up reading errno of t1 instead of errno of t2 even
> > >>> though
> > >>> t2 is executing [ERRNO-2] code section.
> > >>>
> > >>> When code is compiled without any optimization for [ERRNO-2]:
> > >>> 1281                                        if (errno != ENODATA) {
> > >>>      0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300
> > >>>      <__errno_location at plt><<--- As it is calling __errno_location()
> > >>>      again
> > >>>      it gets the location from t2 so it works as intended.
> > >>>      0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
> > >>>      0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
> > >>>      0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1
> > >>>      <gf_defrag_migrate_data+2287>
> > >>>
> > >>> Fix:
> > >>> We decided to make syncop_xxx() return (-errno) value as the return
> > >>> value
> > >>> in case of errors and all the functions which make syncop_xxx() will
> > >>> need
> > >>> to use (-ret) to figure out the reason for failure in case of
> > >>> syncop_xxx()
> > >>> failures.
> > >>>
> > >>> Pranith
> > >>>
> > >>> _______________________________________________
> > >>> Gluster-devel mailing list
> > >>> Gluster-devel at nongnu.org
> > >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > >>
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at nongnu.org
> > > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > 
> > 
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>