[Gluster-devel] important change to syncop infra
Xavier Hernandez
xhernandez at datalab.es
Wed Dec 11 12:26:53 UTC 2013
I missed the const attribute. Thanks for the explanation. This should
eliminate the THIS problem, as you proved, however other cases are still
possible that won't work correctly.
Basically, any TLS pointer passed to user code instead of being
dereferenced will have problems. In this case, uuid and lkowner have
this problem. Consider the following code:
@@ -1129,6 +1129,7 @@ gf_defrag_migrate_data (xlator_t *this,
gf_defrag_info_t *defrag, loc_t *loc,
struct timeval start = {0,};
int32_t err = 0;
int loglevel = GF_LOG_TRACE;
+ char * str_uuid = NULL;
gf_log (this->name, GF_LOG_INFO, "migrate data called on %s",
loc->path);
@@ -1275,8 +1276,10 @@ gf_defrag_migrate_data (xlator_t *this,
gf_defrag_info_t *defrag, loc_t *loc,
* and also that guarantees that file has to
be mostly
* migrated */
+ str_uuid = uuid_utoa(entry_loc.gfid);
ret = syncop_getxattr (this, &entry_loc, &dict,
GF_XATTR_LINKINFO_KEY);
+ gf_log(this->name, GF_LOG_DEBUG, "uuid in tls:
%s", str_uuid);
if (ret < 0) {
if (errno != ENODATA) {
loglevel = GF_LOG_ERROR;
It seems obvious, but this is the generated assembly code anyway:
1279
2e8f: 48 8b 7c 24 60 mov 0x60(%rsp),%rdi
2e94: e8 00 00 00 00 callq 2e99
<gf_defrag_migrate_data+0x719> <<<=== get pointer to TLS object
2e95: R_X86_64_PLT32 uuid_utoa-0x4
1280
2e99: 4c 8b 74 24 20 mov 0x20(%rsp),%r14
2e9e: 48 8b 54 24 70 mov 0x70(%rsp),%rdx
2ea3: 48 8d 0d 00 00 00 00 lea 0x0(%rip),%rcx #
2eaa <gf_defrag_migrate_data+0x72a>
2ea6: R_X86_64_PC32 .LC82-0x4
2eaa: 48 8b 74 24 30 mov 0x30(%rsp),%rsi
1279
2eaf: 49 89 c7 mov
%rax,%r15 <<<=== store the TLS pointer
1280
2eb2: 4c 89 f7 mov %r14,%rdi
2eb5: e8 00 00 00 00 callq 2eba
<gf_defrag_migrate_data+0x73a> <<<=== possible thread switch
2eb6: R_X86_64_PLT32 syncop_getxattr-0x4
1282
2eba: 4c 89 3c 24 mov
%r15,(%rsp) <<<=== use the TLS pointer
2ebe: 49 8b 3e mov (%r14),%rdi
2ec1: 4c 8d 0d 00 00 00 00 lea 0x0(%rip),%r9 # 2ec8
<gf_defrag_migrate_data+0x748>
2ec4: R_X86_64_PC32 .LC83-0x4
2ec8: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx #
2ecf <gf_defrag_migrate_data+0x74f>
2ecb: R_X86_64_PC32 .rodata+0xbc
2ecf: 48 8d 35 00 00 00 00 lea 0x0(%rip),%rsi #
2ed6 <gf_defrag_migrate_data+0x756>
2ed2: R_X86_64_PC32 .LC1-0x4
1280
2ed6: 89 c5 mov %eax,%ebp
1282
2ed8: 41 b8 08 00 00 00 mov $0x8,%r8d
2ede: 31 c0 xor %eax,%eax
2ee0: b9 02 05 00 00 mov $0x502,%ecx
2ee5: e8 00 00 00 00 callq 2eea
<gf_defrag_migrate_data+0x76a>
2ee6: R_X86_64_PLT32 _gf_log-0x4
The str_uuid pointer used can point to the uuid of another thread that
could have been modified. I don't know very well the internals of
syncops and synctask, but I think it's even possible that another task
running on the same thread while the current one was sleeping could
modify the contents of str_uuid.
This case seems very obvious and useless but maybe other combinations
are harder to detect.
I think that currently they are only used by uuid_utoa() and
lkowner_utoa() to write log messages, which seems safe, but we will have
to be careful when using syncops to avoid possible future problems like
this.
Xavi
El 11/12/13 12:02, Pranith Kumar Karampuri ha escrit:
>> it's really a very interesting and hard to find bug, but I'm wondering
>> how we can prevent this to happen in the general case. There might be
>> other operations based on pointers to thread local storage that will
>> suffer this problem. Probably 'errno' is one of the most dangerous, but
>> TLS is also used to resolve 'THIS' in a very similar way to errno, and
>> there are temporary uuid and lkowner values also stored into TLS. More
>> things could appear in the future. These are also potential cases to
>> have problems with syncops.
>
>> An access to THIS before and after a syncop might trigger this bug, right ?
> for errno it happens because it has the attribute __attribute__((const)).
>
> # grep errno_location /usr/include/bits/errno.h
> extern int *__errno_location (void) __THROW __attribute__ ((__const__));
> # define errno (*__errno_location ())
>
> (The following article explains it best IMO: http://lwn.net/Articles/285332)
>
> But it is not there for lkowner, uuid-buf, THIS. Just to confirm, I added the following code
> ------------------------------------------------------
> + xlator_t *tmp_this = NULL;
>
> gf_log (this->name, GF_LOG_INFO, "migrate data called on %s",
> loc->path);
> @@ -1258,8 +1259,10 @@ gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
> * and also that guarantees that file has to be mostly
> * migrated */
>
> + tmp_this = THIS;
> ret = syncop_getxattr (this, &entry_loc, &dict,
> GF_XATTR_LINKINFO_KEY);
> + tmp_this = THIS;
> if (ret < 0) {
> if (errno != ENODATA) {
> loglevel = GF_LOG_ERROR;
> @@ -1267,7 +1270,7 @@ gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
> } else {
> loglevel = GF_LOG_TRACE;
> }
> - gf_log (this->name, loglevel, "%s: failed to "
> + gf_log (tmp_this->name, loglevel, "%s: failed to "
> -----------------------------------------------------
> assembly code it generated:
> =====================================================
> 1262 tmp_this = THIS;
> 0x00007faf17120dbc <+1788>: xor %eax,%eax
> 0x00007faf17120dbe <+1790>: callq 0x7faf171191d0 <__glusterfs_this_location at plt> <<----------------first call to THIS
>
> 1263 ret = syncop_getxattr (this, &entry_loc, &dict,
> 0x00007faf17120dc3 <+1795>: mov 0x60(%rsp),%rdx
> 0x00007faf17120dc8 <+1800>: mov 0x28(%rsp),%rsi
> 0x00007faf17120dcd <+1805>: lea 0x2b550(%rip),%rcx # 0x7faf1714c324
> 0x00007faf17120dd4 <+1812>: mov 0x18(%rsp),%rdi
> 0x00007faf17120dd9 <+1817>: callq 0x7faf17118a30 <syncop_getxattr at plt>
> 0x00007faf17120dde <+1822>: mov %eax,%ebp
>
> 1264 GF_XATTR_LINKINFO_KEY);
> ---Type <return> to continue, or q <return> to quit---
> 1265 tmp_this = THIS;
> 0x00007faf17120de0 <+1824>: xor %eax,%eax
> 0x00007faf17120de2 <+1826>: callq 0x7faf171191d0 <__glusterfs_this_location at plt> <<----------------second call to THIS
> 0x00007faf17120de9 <+1833>: mov (%rax),%r15
>
> 1266 if (ret < 0) {
> 0x00007faf17120de7 <+1831>: test %ebp,%ebp
> 0x00007faf17120dec <+1836>: js 0x7faf1712100e <gf_defrag_migrate_data+2382>
>
> 1267 if (errno != ENODATA) {
> 0x00007faf1712100e <+2382>: mov 0x50(%rsp),%rax <<<---------------------------------------------check that errno does not callq __errno_location
> 0x00007faf17121013 <+2387>: mov $0x9,%ebp
> 0x00007faf17121018 <+2392>: mov (%rax),%edi
> 0x00007faf1712101a <+2394>: cmp $0x3d,%edi
>
> ================================================
>
>> I think it's very difficult to track all these cases and handle them
>> correctly. Another solution could be to tell the compiler to forget
>> previous pointer values when a syncop is called, but I don't see any way
>> to do so. Since it's storing pointers to values, any barrier or volatile
>> seems useless.
> We can do this by putting __attribute__((noinline)) for such functions. But since errno is not in our control we can't do that.
>
>> Xavi
>>
>>
>> El 11/12/13 09:51, Pranith Kumar Karampuri ha escrit:
>>> hi,
>>> We found a day-1 bug when syncop_xxx() infra is used inside a synctask
>>> with compilation optimization (CFLAGS -O2). This bug has been dormant
>>> for at least 2 years.
>>> There are around ~400(rebalance, replace-brick, bd, self-heal-daemon,
>>> quota, fuse lock/fd migration) places where syncop is used in the code
>>> base all of which are potential candidates which can take the hit.
>>>
>>> I sent first round of patch at http://review.gluster.com/6475 to catch
>>> regressions upstream.
>>> These are the files that are affected by the changes I introduced to fix
>>> this:
>>>
>>> api/src/glfs-fops.c | 36
>>> ++++++++++++++++++++++++++++++++++
>>> api/src/glfs-handleops.c | 15 ++++++++++++++
>>> api/src/glfs-internal.h | 7 +++++++
>>> api/src/glfs-resolve.c | 10 ++++++++++
>>> libglusterfs/src/syncop.c | 117
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
>>> xlators/cluster/afr/src/afr-self-heald.c | 45
>>> +++++++++++++++++++++++++++++-------------
>>> xlators/cluster/afr/src/pump.c | 12 ++++++++++--
>>> xlators/cluster/dht/src/dht-helper.c | 24
>>> +++++++++++++++--------
>>> xlators/cluster/dht/src/dht-rebalance.c | 168
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------
>>> xlators/cluster/dht/src/dht-selfheal.c | 6 ++++--
>>> xlators/features/locks/src/posix.c | 3 ++-
>>> xlators/features/qemu-block/src/bdrv-xlator.c | 15 ++++----------
>>> xlators/features/qemu-block/src/qb-coroutines.c | 14 ++++++++++----
>>> xlators/mount/fuse/src/fuse-bridge.c | 16 ++++++++++-----
>>>
>>> Please review your respective component for these changes in gerrit.
>>>
>>> Thanks
>>> Pranith.
>>>
>>> Detailed explanation of the Root cause:
>>> We found the bug in 'gf_defrag_migrate_data' in rebalance operation:
>>>
>>> Lets look at interesting parts of the function:
>>>
>>> int
>>> gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t
>>> *loc,
>>> dict_t *migrate_data)
>>> {
>>> .....
>>> code section - [ Loop ]
>>> while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
>>> &entries)) != 0) {
>>> .....
>>> code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno
>>> by a thread)
>>> /* Need to keep track of ENOENT errno, that means, there
>>> is no
>>> need to send more readdirp() */
>>> readdir_operrno = errno;
>>> .....
>>> code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
>>> ret = syncop_getxattr (this, &entry_loc, &dict,
>>> GF_XATTR_LINKINFO_KEY);
>>> code section - [ ERRNO-2] (checking for failures of syncop_getxattr().
>>> This may not always be executed in same thread which executed [SYNCOP-1])
>>> if (ret < 0) {
>>> if (errno != ENODATA) {
>>> loglevel = GF_LOG_ERROR;
>>> defrag->total_failures += 1;
>>> .....
>>> }
>>>
>>> the function above could be executed by thread(t1) till [SYNCOP-1] and code
>>> from [ERRNO-2] can be executed by a different thread(t2) because of the
>>> way syncop-infra schedules the tasks.
>>>
>>> when the code is compiled with -O2 optimization this is the assembly code
>>> that is generated:
>>> [ERRNO-1]
>>> 1165 readdir_operrno = errno; <<---- errno gets
>>> expanded as *(__errno_location())
>>> 0x00007fd149d48b60 <+496>: callq 0x7fd149d410c0
>>> <__errno_location at plt>
>>> 0x00007fd149d48b72 <+514>: mov %rax,0x50(%rsp) <<------
>>> Address returned by __errno_location() is stored in a special location
>>> in stack for later use.
>>> 0x00007fd149d48b77 <+519>: mov (%rax),%eax
>>> 0x00007fd149d48b79 <+521>: mov %eax,0x78(%rsp)
>>> ....
>>> [ERRNO-2]
>>> 1281 if (errno != ENODATA) {
>>> 0x00007fd149d492ae <+2366>: mov 0x50(%rsp),%rax <<-----
>>> Because it already stored the address returned by __errno_location(),
>>> it just dereferences the address to get the errno value. BUT THIS CODE
>>> NEED NOT BE EXECUTED BY SAME THREAD!!!
>>> 0x00007fd149d492b3 <+2371>: mov $0x9,%ebp
>>> 0x00007fd149d492b8 <+2376>: mov (%rax),%edi
>>> 0x00007fd149d492ba <+2378>: cmp $0x3d,%edi
>>>
>>> The problem is that __errno_location() value of t1 and t2 are different. So
>>> [ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though
>>> t2 is executing [ERRNO-2] code section.
>>>
>>> When code is compiled without any optimization for [ERRNO-2]:
>>> 1281 if (errno != ENODATA) {
>>> 0x00007fd58e7a326f <+2237>: callq 0x7fd58e797300
>>> <__errno_location at plt><<--- As it is calling __errno_location() again
>>> it gets the location from t2 so it works as intended.
>>> 0x00007fd58e7a3274 <+2242>: mov (%rax),%eax
>>> 0x00007fd58e7a3276 <+2244>: cmp $0x3d,%eax
>>> 0x00007fd58e7a3279 <+2247>: je 0x7fd58e7a32a1
>>> <gf_defrag_migrate_data+2287>
>>>
>>> Fix:
>>> We decided to make syncop_xxx() return (-errno) value as the return value
>>> in case of errors and all the functions which make syncop_xxx() will need
>>> to use (-ret) to figure out the reason for failure in case of syncop_xxx()
>>> failures.
>>>
>>> Pranith
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel
mailing list