[Gluster-devel] important change to syncop infra

Wed Dec 11 12:26:53 UTC 2013

I missed the const attribute. Thanks for the explanation. This should 
eliminate the THIS problem, as you proved, however other cases are still 
possible that won't work correctly.

Basically, any TLS pointer passed to user code instead of being 
dereferenced will have problems. In this case, uuid and lkowner have 
this problem. Consider the following code:

@@ -1129,6 +1129,7 @@ gf_defrag_migrate_data (xlator_t *this, 
gf_defrag_info_t *defrag, loc_t *loc,
          struct timeval           start          = {0,};
          int32_t                  err            = 0;
          int                      loglevel       = GF_LOG_TRACE;
+        char *                   str_uuid       = NULL;

          gf_log (this->name, GF_LOG_INFO, "migrate data called on %s",
                  loc->path);
@@ -1275,8 +1276,10 @@ gf_defrag_migrate_data (xlator_t *this, 
gf_defrag_info_t *defrag, loc_t *loc,
                           * and also that guarantees that file has to 
be mostly
                           * migrated */

+                        str_uuid = uuid_utoa(entry_loc.gfid);
                          ret = syncop_getxattr (this, &entry_loc, &dict,
GF_XATTR_LINKINFO_KEY);
+                        gf_log(this->name, GF_LOG_DEBUG, "uuid in tls: 
%s", str_uuid);
                          if (ret < 0) {
                                  if (errno != ENODATA) {
                                          loglevel = GF_LOG_ERROR;


It seems obvious, but this is the generated assembly code anyway:

1279
     2e8f:       48 8b 7c 24 60          mov    0x60(%rsp),%rdi
     2e94:       e8 00 00 00 00          callq  2e99 
<gf_defrag_migrate_data+0x719> <<<=== get pointer to TLS object
                         2e95: R_X86_64_PLT32    uuid_utoa-0x4
1280
     2e99:       4c 8b 74 24 20          mov    0x20(%rsp),%r14
     2e9e:       48 8b 54 24 70          mov    0x70(%rsp),%rdx
     2ea3:       48 8d 0d 00 00 00 00    lea 0x0(%rip),%rcx        # 
2eaa <gf_defrag_migrate_data+0x72a>
                         2ea6: R_X86_64_PC32     .LC82-0x4
     2eaa:       48 8b 74 24 30          mov    0x30(%rsp),%rsi
1279
     2eaf:       49 89 c7                mov 
%rax,%r15                           <<<=== store the TLS pointer
1280
     2eb2:       4c 89 f7                mov    %r14,%rdi
     2eb5:       e8 00 00 00 00          callq  2eba 
<gf_defrag_migrate_data+0x73a> <<<=== possible thread switch
                         2eb6: R_X86_64_PLT32 syncop_getxattr-0x4
1282
     2eba:       4c 89 3c 24             mov 
%r15,(%rsp)                         <<<=== use the TLS pointer
     2ebe:       49 8b 3e                mov    (%r14),%rdi
     2ec1:       4c 8d 0d 00 00 00 00    lea 0x0(%rip),%r9        # 2ec8 
<gf_defrag_migrate_data+0x748>
                         2ec4: R_X86_64_PC32     .LC83-0x4
     2ec8:       48 8d 15 00 00 00 00    lea 0x0(%rip),%rdx        # 
2ecf <gf_defrag_migrate_data+0x74f>
                         2ecb: R_X86_64_PC32     .rodata+0xbc
     2ecf:       48 8d 35 00 00 00 00    lea 0x0(%rip),%rsi        # 
2ed6 <gf_defrag_migrate_data+0x756>
                         2ed2: R_X86_64_PC32     .LC1-0x4
1280
     2ed6:       89 c5                   mov    %eax,%ebp
1282
     2ed8:       41 b8 08 00 00 00       mov    $0x8,%r8d
     2ede:       31 c0                   xor    %eax,%eax
     2ee0:       b9 02 05 00 00          mov    $0x502,%ecx
     2ee5:       e8 00 00 00 00          callq  2eea 
<gf_defrag_migrate_data+0x76a>
                         2ee6: R_X86_64_PLT32    _gf_log-0x4


The str_uuid pointer used can point to the uuid of another thread that 
could have been modified. I don't know very well the internals of 
syncops and synctask, but I think it's even possible that another task 
running on the same thread while the current one was sleeping could 
modify the contents of str_uuid.

This case seems very obvious and useless but maybe other combinations 
are harder to detect.

I think that currently they are only used by uuid_utoa() and 
lkowner_utoa() to write log messages, which seems safe, but we will have 
to be careful when using syncops to avoid possible future problems like 
this.

Xavi

El 11/12/13 12:02, Pranith Kumar Karampuri ha escrit:
>> it's really a very interesting and hard to find bug, but I'm wondering
>> how we can prevent this to happen in the general case. There might be
>> other operations based on pointers to thread local storage that will
>> suffer this problem. Probably 'errno' is one of the most dangerous, but
>> TLS is also used to resolve 'THIS' in a very similar way to errno, and
>> there are temporary uuid and lkowner values also stored into TLS. More
>> things could appear in the future. These are also potential cases to
>> have problems with syncops.
>   
>> An access to THIS before and after a syncop might trigger this bug, right ?
> for errno it happens because it has the attribute __attribute__((const)).
>
> # grep errno_location /usr/include/bits/errno.h
> extern int *__errno_location (void) __THROW __attribute__ ((__const__));
> #   define errno (*__errno_location ())
>
> (The following article explains it best IMO: http://lwn.net/Articles/285332)
>
> But it is not there for lkowner, uuid-buf, THIS. Just to confirm, I added the following code
> ------------------------------------------------------
> +        xlator_t                *tmp_this = NULL;
>   
>           gf_log (this->name, GF_LOG_INFO, "migrate data called on %s",
>                   loc->path);
> @@ -1258,8 +1259,10 @@ gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
>                            * and also that guarantees that file has to be mostly
>                            * migrated */
>   
> +                        tmp_this = THIS;
>                           ret = syncop_getxattr (this, &entry_loc, &dict,
>                                                  GF_XATTR_LINKINFO_KEY);
> +                        tmp_this = THIS;
>                           if (ret < 0) {
>                                   if (errno != ENODATA) {
>                                           loglevel = GF_LOG_ERROR;
> @@ -1267,7 +1270,7 @@ gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
>                                   } else {
>                                           loglevel = GF_LOG_TRACE;
>                                   }
> -                                gf_log (this->name, loglevel, "%s: failed to "
> +                                gf_log (tmp_this->name, loglevel, "%s: failed to "
> -----------------------------------------------------
> assembly code it generated:
> =====================================================
> 1262	                        tmp_this = THIS;
>     0x00007faf17120dbc <+1788>:	xor    %eax,%eax
>     0x00007faf17120dbe <+1790>:	callq  0x7faf171191d0 <__glusterfs_this_location at plt> <<----------------first call to THIS
>
> 1263	                        ret = syncop_getxattr (this, &entry_loc, &dict,
>     0x00007faf17120dc3 <+1795>:	mov    0x60(%rsp),%rdx
>     0x00007faf17120dc8 <+1800>:	mov    0x28(%rsp),%rsi
>     0x00007faf17120dcd <+1805>:	lea    0x2b550(%rip),%rcx        # 0x7faf1714c324
>     0x00007faf17120dd4 <+1812>:	mov    0x18(%rsp),%rdi
>     0x00007faf17120dd9 <+1817>:	callq  0x7faf17118a30 <syncop_getxattr at plt>
>     0x00007faf17120dde <+1822>:	mov    %eax,%ebp
>
> 1264	                                               GF_XATTR_LINKINFO_KEY);
> ---Type <return> to continue, or q <return> to quit---
> 1265	                        tmp_this = THIS;
>     0x00007faf17120de0 <+1824>:	xor    %eax,%eax
>     0x00007faf17120de2 <+1826>:	callq  0x7faf171191d0 <__glusterfs_this_location at plt> <<----------------second call to THIS
>     0x00007faf17120de9 <+1833>:	mov    (%rax),%r15
>
> 1266	                        if (ret < 0) {
>     0x00007faf17120de7 <+1831>:	test   %ebp,%ebp
>     0x00007faf17120dec <+1836>:	js     0x7faf1712100e <gf_defrag_migrate_data+2382>
>
> 1267	                                if (errno != ENODATA) {
>     0x00007faf1712100e <+2382>:	mov    0x50(%rsp),%rax <<<---------------------------------------------check that errno does not callq __errno_location
>     0x00007faf17121013 <+2387>:	mov    $0x9,%ebp
>     0x00007faf17121018 <+2392>:	mov    (%rax),%edi
>     0x00007faf1712101a <+2394>:	cmp    $0x3d,%edi
>
> ================================================
>
>> I think it's very difficult to track all these cases and handle them
>> correctly. Another solution could be to tell the compiler to forget
>> previous pointer values when a syncop is called, but I don't see any way
>> to do so. Since it's storing pointers to values, any barrier or volatile
>> seems useless.
> We can do this by putting __attribute__((noinline)) for such functions. But since errno is not in our control we can't do that.
>
>> Xavi
>>
>>
>> El 11/12/13 09:51, Pranith Kumar Karampuri ha escrit:
>>> hi,
>>>       We found a day-1 bug when syncop_xxx() infra is used inside a synctask
>>>       with compilation optimization (CFLAGS -O2). This bug has been dormant
>>>       for at least 2 years.
>>> There are around ~400(rebalance, replace-brick, bd, self-heal-daemon,
>>> quota, fuse lock/fd migration) places where syncop is used in the code
>>> base all of which are potential candidates which can take the hit.
>>>
>>> I sent first round of patch at http://review.gluster.com/6475 to catch
>>> regressions upstream.
>>> These are the files that are affected by the changes I introduced to fix
>>> this:
>>>
>>>    api/src/glfs-fops.c                             |  36
>>>    ++++++++++++++++++++++++++++++++++
>>>    api/src/glfs-handleops.c                        |  15 ++++++++++++++
>>>    api/src/glfs-internal.h                         |   7 +++++++
>>>    api/src/glfs-resolve.c                          |  10 ++++++++++
>>>    libglusterfs/src/syncop.c                       | 117
>>>    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
>>>    xlators/cluster/afr/src/afr-self-heald.c        |  45
>>>    +++++++++++++++++++++++++++++-------------
>>>    xlators/cluster/afr/src/pump.c                  |  12 ++++++++++--
>>>    xlators/cluster/dht/src/dht-helper.c            |  24
>>>    +++++++++++++++--------
>>>    xlators/cluster/dht/src/dht-rebalance.c         | 168
>>>    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------
>>>    xlators/cluster/dht/src/dht-selfheal.c          |   6 ++++--
>>>    xlators/features/locks/src/posix.c              |   3 ++-
>>>    xlators/features/qemu-block/src/bdrv-xlator.c   |  15 ++++----------
>>>    xlators/features/qemu-block/src/qb-coroutines.c |  14 ++++++++++----
>>>    xlators/mount/fuse/src/fuse-bridge.c            |  16 ++++++++++-----
>>>
>>> Please review your respective component for these changes in gerrit.
>>>
>>> Thanks
>>> Pranith.
>>>
>>> Detailed explanation of the Root cause:
>>> We found the bug in 'gf_defrag_migrate_data' in rebalance operation:
>>>
>>> Lets look at interesting parts of the function:
>>>
>>> int
>>> gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t
>>> *loc,
>>>                           dict_t *migrate_data)
>>> {
>>> .....
>>> code section - [ Loop ]
>>>           while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
>>>                                          &entries)) != 0) {
>>> .....
>>> code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno
>>> by a thread)
>>>                   /* Need to keep track of ENOENT errno, that means, there
>>>                   is no
>>>                      need to send more readdirp() */
>>>                   readdir_operrno = errno;
>>> .....
>>> code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
>>>                           ret = syncop_getxattr (this, &entry_loc, &dict,
>>>                                                  GF_XATTR_LINKINFO_KEY);
>>> code section - [ ERRNO-2]   (checking for failures of syncop_getxattr().
>>> This may not always be executed in same thread which executed [SYNCOP-1])
>>>                           if (ret < 0) {
>>>                                   if (errno != ENODATA) {
>>>                                           loglevel = GF_LOG_ERROR;
>>>                                           defrag->total_failures += 1;
>>> .....
>>> }
>>>
>>> the function above could be executed by thread(t1) till [SYNCOP-1] and code
>>> from [ERRNO-2] can be executed by a different thread(t2) because of the
>>> way syncop-infra schedules the tasks.
>>>
>>> when the code is compiled with -O2 optimization this is the assembly code
>>> that is generated:
>>>    [ERRNO-1]
>>> 1165                        readdir_operrno = errno; <<---- errno gets
>>> expanded as *(__errno_location())
>>>      0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0
>>>      <__errno_location at plt>
>>>      0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------
>>>      Address returned by __errno_location() is stored in a special location
>>>      in stack for later use.
>>>      0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
>>>      0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
>>> ....
>>>    [ERRNO-2]
>>> 1281                                        if (errno != ENODATA) {
>>>      0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----
>>>      Because it already stored the address returned by __errno_location(),
>>>      it just dereferences the address to get the errno value. BUT THIS CODE
>>>      NEED NOT BE EXECUTED BY SAME THREAD!!!
>>>      0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
>>>      0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
>>>      0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi
>>>
>>> The problem is that __errno_location() value of t1 and t2 are different. So
>>> [ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though
>>> t2 is executing [ERRNO-2] code section.
>>>
>>> When code is compiled without any optimization for [ERRNO-2]:
>>> 1281                                        if (errno != ENODATA) {
>>>      0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300
>>>      <__errno_location at plt><<--- As it is calling __errno_location() again
>>>      it gets the location from t2 so it works as intended.
>>>      0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
>>>      0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
>>>      0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1
>>>      <gf_defrag_migrate_data+2287>
>>>
>>> Fix:
>>> We decided to make syncop_xxx() return (-errno) value as the return value
>>> in case of errors and all the functions which make syncop_xxx() will need
>>> to use (-ret) to figure out the reason for failure in case of syncop_xxx()
>>> failures.
>>>
>>> Pranith
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel