[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Ravishankar N
ravishankar at redhat.com
Thu Apr 16 07:03:39 UTC 2020
The patch by itself is only making changes specific to AFR, so it should
not affect other translators. But I wonder how readdir-ahead is enabled
in your gnfs stack. All performance xlators are turned off in gnfs
except write-behind and AFAIK, there is no way to enable them via the
CLI. Did you custom edit your gnfs volfile to add readdir-ahead? If yes,
does the crash go-away if you remove the xlator from the nfs volfile?
Regards,
Ravi
On 16/04/20 8:47 am, Erik Jacobson wrote:
> It is important to note that our testing has shown zero split-brain
> errors since the patch... And that it is significantly harder to
> hit the seg fault than it was to hit split-brain before. It's still
> sufficiently frequent that we can't let it out the door. In my intensive
> test case (found elsewhere in the thread), it would 100% hit the problem
> with 57 nodes every time at least once. With the patch, zero split
> brain, but maybe 1 in 4 runs would seg fault. We didn't have a seg
> fault problem previously. This is all within the context of 1 of the 3
> servers in the subvolume being down. I hit the seg fault once with just
> 57 nodes booting (using NFS for their root FS) and no other load.
>
>
> Scott was able to take an analysis pass. Any suggestions? his words
> follow:
>
>
> The segfault appears to occur in read-ahead functionality. We will keep
> the core in case it needs to be looked at again, being sure to copy off
> all necessary metadata to maintain adequate symbol lookup within gdb.
> It may also be possible to breakpoint immediately prior to the segfault,
> but setting the right conditions may prove to be difficult.
>
> A bit of analysis:
>
> Prior to the segfault, the op_errno field in a struct rda_fd_ctx packet
> shows an ENOENT error. The packet is from the call_frame_t parameter of
> rda_fill_fd_cbk() (Backtrace #2) The following shows the progression
> from the call_frame_t parameter to the op_errno field of the rda_fd_ctx
> structure.
>
> (gdb) print {call_frame_t}0x7fe5acf18eb8
> $26 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next =
> 0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
> this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock =
> 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
> __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list =
> {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>,
> __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL,
> begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
> tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from =
> 0x0, unwind_to = 0x0}
>
> (gdb) print {struct rda_local}0x7fe5ac1dbc78
> $27 = {ctx = 0x7fe5ace46590, fd = 0x7fe60433d8b8, xattrs = 0x0, inode =
> 0x0, offset = 0, generation = 0, skip_dir = 0}
>
> (gdb) print {struct rda_fd_ctx}0x7fe5ace46590
> $28 = {cur_offset = 0, cur_size = 638, next_offset = 1538, state = 36,
> lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0,
> __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision =
> 0, __list = {__prev = 0x0, __next = 0x0}},
> __size = '\000' <repeats 39 times>, __align = 0}}, entries =
> {{list = {next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}, {
> next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}}, d_ino = 0,
> d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_flags = 0, ia_ino = 0,
> ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0,
> ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0,
> ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0,
> ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0,
> ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = '\000'
> <repeats 15 times>, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000',
> sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000',
> write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000',
> write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000',
> write = 0 '\000', exec = 0 '\000'}}}, dict = 0x0, inode = 0x0,
> d_name = 0x7fe5ace466a8 ""}, fill_frame = 0x0, stub = 0x0, op_errno
> = 2, xattrs = 0x0, writes_during_prefetch = 0x0, prefetching = {
> lk = 0x7fe5ace466d0 "", value = 0}}
>
> The segfault occurs at the bottom of rda_fill_fd_cbk() where the rpc
> call stack frames are being destroyed. The following are what I believe
> to be the three frames that are intended to be destroyed, but it is
> unclear which packet is causing the problem. If I were to dig more into
> this, I will use ddd (graphical debugger). It's been a while since I've
> done low level debugging like this, so I'm a bit rusty.
>
> (gdb) print {call_frame_t}0x7fe5acf18eb8
> $34 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next =
> 0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
> this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock =
> 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
> __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list =
> {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>,
> __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL,
> begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
> tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from =
> 0x0, unwind_to = 0x0}
> (gdb) print {call_frame_t}0x7fe5ac6d6ce0
> $35 = {root = 0x0, parent = 0x563f5a955920, frames = {next =
> 0x7fe5ac096298, prev = 0x7fe5acf18ec8}, local = 0x0, this = 0x108a,
> ret = 0x25f90b3c, ref_count = 0, lock = {spinlock = 0, mutex =
> {__data = {__lock = 0, __count = 0, __owner = 1586972324, __nusers = 0,
> __kind = 210092664, __spins = 0, __elision = 0, __list =
> {__prev = 0x0, __next = 0x0}},
> __size =
> "\000\000\000\000\000\000\000\000\244F\227^\000\000\000\000x\302\205\f",
> '\000' <repeats 19 times>, __align = 0}},
> cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec =
> 0, tv_nsec = 0}, end = {tv_sec = 0, tv_nsec = 0}, wind_from = 0x0,
> wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0}
> (gdb) print {call_frame_t}0x7fe5ac096288
> $36 = {root = 0x7fe5ac378860, parent = 0x7fe5acf18eb8, frames = {next =
> 0x7fe5acf18ec8, prev = 0x7fe5ac6d6cf0}, local = 0x0,
> this = 0x7fe63c014000, ret = 0x7fe63bb5d350 <rda_fill_fd_cbk>,
> ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0,
> __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins =
> 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
> __size = '\000' <repeats 39 times>, __align = 0}}, cookie =
> 0x7fe5ac096288, complete = true, op = GF_FOP_READDIRP, begin = {
> tv_sec = 4234, tv_nsec = 637078816}, end = {tv_sec = 4234, tv_nsec
> = 803882755},
> wind_from = 0x7fe63bb5e8c0 <__FUNCTION__.22226> "rda_fill_fd",
> wind_to = 0x7fe63bb5e3f0 "(this->children->xlator)->fops->readdirp",
> unwind_from = 0x7fe63bdd8a80 <__FUNCTION__.20442> "afr_readdir_cbk",
> unwind_to = 0x7fe63bb5dfbb "rda_fill_fd_cbk"}
>
>
>
> On 4/15/20 8:14 AM, Erik Jacobson wrote:
>> Scott - I was going to start with gluster74 since that is what he
>> started at but it applies well to glsuter72 so I'll start tthere.
>>
>> Getting ready to go. The patch detail is interesting. This is probably
>> why it took hiim a bit longer. It wasn't a trivial patch.
>
>
> On Wed, Apr 15, 2020 at 12:45:57PM -0500, Erik Jacobson wrote:
>>> The new split-brain issue is much harder to reproduce, but after several
>> (correcting to say new seg fault issue, the split brain is gone!!)
>>
>>> intense runs, it usually hits once.
>>>
>>> We switched to pure gluster74 plus your patch so we're apples to apples
>>> now.
>>>
>>> I'm going to see if Scott can help debug it.
>>>
>>> Here is the back trace info from the core dump:
>>>
>>> -rw-r----- 1 root root 1.9G Apr 15 12:40 core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000
>>> -rw-r----- 1 root root 221M Apr 15 12:40 core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000.lz4
>>> drwxrwxrwt 9 root root 20K Apr 15 12:40 .
>>> [root at leader3 tmp]#
>>> [root at leader3 tmp]#
>>> [root at leader3 tmp]# gdb core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000
>>> GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8
>>> Copyright (C) 2018 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law.
>>> Type "show copying" and "show warranty" for details.
>>> This GDB was configured as "x86_64-redhat-linux-gnu".
>>> Type "show configuration" for configuration details.
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>.
>>> Find the GDB manual and other documentation resources online at:
>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>
>>> For help, type "help".
>>> Type "apropos word" to search for commands related to "word"...
>>> [New LWP 61102]
>>> [New LWP 61085]
>>> [New LWP 61087]
>>> [New LWP 61117]
>>> [New LWP 61086]
>>> [New LWP 61108]
>>> [New LWP 61089]
>>> [New LWP 61090]
>>> [New LWP 61121]
>>> [New LWP 61088]
>>> [New LWP 61091]
>>> [New LWP 61093]
>>> [New LWP 61095]
>>> [New LWP 61092]
>>> [New LWP 61094]
>>> [New LWP 61098]
>>> [New LWP 61096]
>>> [New LWP 61097]
>>> [New LWP 61084]
>>> [New LWP 61100]
>>> [New LWP 61103]
>>> [New LWP 61104]
>>> [New LWP 61099]
>>> [New LWP 61105]
>>> [New LWP 61101]
>>> [New LWP 61106]
>>> [New LWP 61109]
>>> [New LWP 61107]
>>> [New LWP 61112]
>>> [New LWP 61119]
>>> [New LWP 61110]
>>> [New LWP 61111]
>>> [New LWP 61118]
>>> [New LWP 61123]
>>> [New LWP 61122]
>>> [New LWP 61113]
>>> [New LWP 61114]
>>> [New LWP 61120]
>>> [New LWP 61116]
>>> [New LWP 61115]
>>>
>>> warning: core file may not match specified executable file.
>>> Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done.
>>> done.
>>>
>>> warning: Ignoring non-absolute filename: <linux-vdso.so.1>
>>> Missing separate debuginfo for linux-vdso.so.1
>>> Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib64/libthread_db.so.1".
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of ELF segments
>>> Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/run/gluster/n'.
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0 0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
>>> at ../../../../libglusterfs/src/glusterfs/stack.h:193
>>> 193 FRAME_DESTROY(frame);
>>> [Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))]
>>> Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 zlib-1.2.11-10.el8.x86_64
>>> (gdb) bt
>>> #0 0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
>>> at ../../../../libglusterfs/src/glusterfs/stack.h:193
>>> #1 STACK_DESTROY (stack=0x7fe5ac6d65f8)
>>> at ../../../../libglusterfs/src/glusterfs/stack.h:193
>>> #2 rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=<optimized out>,
>>> this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=<optimized out>,
>>> xdata=0x0) at readdir-ahead.c:623
>>> #3 0x00007fe63bd6c3aa in afr_readdir_cbk (frame=<optimized out>,
>>> cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>,
>>> op_errno=<optimized out>, subvol_entries=<optimized out>, xdata=0x0)
>>> at afr-dir-read.c:234
>>> #4 0x00007fe6400a1e07 in client4_0_readdirp_cbk (req=<optimized out>,
>>> iov=<optimized out>, count=<optimized out>, myframe=0x7fe5ace0eda8)
>>> at client-rpc-fops_v2.c:2338
>>> #5 0x00007fe6479ca115 in rpc_clnt_handle_reply (
>>> clnt=clnt at entry=0x7fe63c0663f0, pollin=pollin at entry=0x7fe60c1737a0)
>>> at rpc-clnt.c:764
>>> #6 0x00007fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780,
>>> mydata=0x7fe63c066420, event=<optimized out>, data=0x7fe60c1737a0)
>>> at rpc-clnt.c:931
>>> #7 0x00007fe6479c707b in rpc_transport_notify (
>>> this=this at entry=0x7fe63c066780,
>>> event=event at entry=RPC_TRANSPORT_MSG_RECEIVED,
>>> data=data at entry=0x7fe60c1737a0) at rpc-transport.c:545
>>> #8 0x00007fe640da893c in socket_event_poll_in_async (xl=<optimized out>,
>>> async=0x7fe60c1738c8) at socket.c:2601
>>> #9 0x00007fe640db03dc in gf_async (
>>> cbk=0x7fe640da8910 <socket_event_poll_in_async>, xl=<optimized out>,
>>> async=0x7fe60c1738c8) at ../../../../libglusterfs/src/glusterfs/async.h:189
>>> #10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780)
>>> at socket.c:2642
>>> #11 socket_event_handler (fd=fd at entry=19, idx=idx at entry=10, gen=gen at entry=1,
>>> data=data at entry=0x7fe63c066780, poll_in=<optimized out>,
>>> poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000')
>>> at socket.c:3040
>>> #12 0x00007fe647c84a5b in event_dispatch_epoll_handler (event=0x7fe617ffe014,
>>> event_pool=0x563f5a98c750) at event-epoll.c:650
>>> #13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at event-epoll.c:763
>>> #14 0x00007fe6467a72de in start_thread () from /lib64/libpthread.so.0
>>> #15 0x00007fe645fffa63 in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>> On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote:
>>>> After several successful runs of the test case, we thought we were
>>>> solved. Indeed, split-brain is gone.
>>>>
>>>> But we're triggering a seg fault now, even in a less loaded case.
>>>>
>>>> We're going to switch to gluster74, which was your intention, and report
>>>> back.
>>>>
>>>> On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
>>>>>> Attached the wrong patch by mistake in my previous mail. Sending the correct
>>>>>> one now.
>>>>> Early results loook GREAT !!
>>>>>
>>>>> We'll keep beating on it. We applied it to glsuter72 as that is what we
>>>>> have to ship with. It applied fine with some line moves.
>>>>>
>>>>> If you would like us to also run a test with gluster74 so that you can
>>>>> say that's tested, we can run that test. I can do a special build.
>>>>>
>>>>> THANK YOU!!
>>>>>
>>>>>>
>>>>>> -Ravi
>>>>>>
>>>>>>
>>>>>> On 15/04/20 2:05 pm, Ravishankar N wrote:
>>>>>>
>>>>>>
>>>>>> On 10/04/20 2:06 am, Erik Jacobson wrote:
>>>>>>
>>>>>> Once again thanks for sticking with us. Here is a reply from Scott
>>>>>> Titus. If you have something for us to try, we'd love it. The code had
>>>>>> your patch applied when gdb was run:
>>>>>>
>>>>>>
>>>>>> Here is the addr2line output for those addresses. Very interesting
>>>>>> command, of
>>>>>> which I was not aware.
>>>>>>
>>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
>>>>>> cluster/
>>>>>> afr.so 0x6f735
>>>>>> afr_lookup_metadata_heal_check
>>>>>> afr-common.c:2803
>>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
>>>>>> cluster/
>>>>>> afr.so 0x6f0b9
>>>>>> afr_lookup_done
>>>>>> afr-common.c:2455
>>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
>>>>>> cluster/
>>>>>> afr.so 0x5c701
>>>>>> afr_inode_event_gen_reset
>>>>>> afr-common.c:755
>>>>>>
>>>>>>
>>>>>> Right, so afr_lookup_done() is resetting the event gen to zero. This looks
>>>>>> like a race between lookup and inode refresh code paths. We made some
>>>>>> changes to the event generation logic in AFR. Can you apply the attached
>>>>>> patch and see if it fixes the split-brain issue? It should apply cleanly on
>>>>>> glusterfs-7.4.
>>>>>>
>>>>>> Thanks,
>>>>>> Ravi
>>>>>>
>>>>>>
>>>>>> ________
>>>>>>
>>>>>>
>>>>>>
>>>>>> Community Meeting Calendar:
>>>>>>
>>>>>> Schedule -
>>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>>>> Bridge: https://bluejeans.com/441850968
>>>>>>
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>>> >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
>>>>>> From: Ravishankar N <ravishankar at redhat.com>
>>>>>> Date: Wed, 15 Apr 2020 13:53:26 +0530
>>>>>> Subject: [PATCH] afr: event gen changes
>>>>>>
>>>>>> The general idea of the changes is to prevent resetting event generation
>>>>>> to zero in the inode ctx, since event gen is something that should
>>>>>> follow 'causal order'.
>>>>>>
>>>>>> Change #1:
>>>>>> For a read txn, in inode refresh cbk, if event_generation is
>>>>>> found zero, we are failing the read fop. This is not needed
>>>>>> because change in event gen is only a marker for the next inode refresh to
>>>>>> happen and should not be taken into account by the current read txn.
>>>>>>
>>>>>> Change #2:
>>>>>> The event gen being zero above can happen if there is a racing lookup,
>>>>>> which resets even get (in afr_lookup_done) if there are non zero afr
>>>>>> xattrs. The resetting is done only to trigger an inode refresh and a
>>>>>> possible client side heal on the next lookup. That can be acheived by
>>>>>> setting the need_refresh flag in the inode ctx. So replaced all
>>>>>> occurences of resetting even gen to zero with a call to
>>>>>> afr_inode_need_refresh_set().
>>>>>>
>>>>>> Change #3:
>>>>>> In both lookup and discover path, we are doing an inode refresh which is
>>>>>> not required since all 3 essentially do the same thing- update the inode
>>>>>> ctx with the good/bad copies from the brick replies. Inode refresh also
>>>>>> triggers background heals, but I think it is okay to do it when we call
>>>>>> refresh during the read and write txns and not in the lookup path.
>>>>>>
>>>>>> Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
>>>>>> Signed-off-by: Ravishankar N <ravishankar at redhat.com>
>>>>>> ---
>>>>>> ...ismatch-resolution-with-fav-child-policy.t | 8 +-
>>>>>> xlators/cluster/afr/src/afr-common.c | 92 ++++---------------
>>>>>> xlators/cluster/afr/src/afr-dir-write.c | 6 +-
>>>>>> xlators/cluster/afr/src/afr.h | 5 +-
>>>>>> 4 files changed, 29 insertions(+), 82 deletions(-)
>>>>>>
>>>>>> diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
>>>>>> index f4aa351e4..12af0c854 100644
>>>>>> --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
>>>>>> +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
>>>>>> @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
>>>>>> #We know that second brick has the bigger size file
>>>>>> BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\ -f1)
>>>>>>
>>>>>> -TEST ls $M0/f3
>>>>>> -TEST cat $M0/f3
>>>>>> +TEST ls $M0 #Trigger entry heal via readdir inode refresh
>>>>>> +TEST cat $M0/f3 #Trigger data heal via readv inode refresh
>>>>>> EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
>>>>>>
>>>>>> #gfid split-brain should be resolved
>>>>>> @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
>>>>>> EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2
>>>>>> EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2
>>>>>>
>>>>>> -TEST ls $M0/f4
>>>>>> -TEST cat $M0/f4
>>>>>> +TEST ls $M0 #Trigger entry heal via readdir inode refresh
>>>>>> +TEST cat $M0/f4 #Trigger data heal via readv inode refresh
>>>>>> EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
>>>>>>
>>>>>> #gfid split-brain should be resolved
>>>>>> diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
>>>>>> index 61f21795e..319665a14 100644
>>>>>> --- a/xlators/cluster/afr/src/afr-common.c
>>>>>> +++ b/xlators/cluster/afr/src/afr-common.c
>>>>>> @@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
>>>>>> metadatamap |= (1 << index);
>>>>>> }
>>>>>> if (metadatamap_old != metadatamap) {
>>>>>> - event = 0;
>>>>>> + __afr_inode_need_refresh_set(inode, this);
>>>>>> }
>>>>>> break;
>>>>>>
>>>>>> @@ -295,7 +295,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
>>>>>> datamap |= (1 << index);
>>>>>> }
>>>>>> if (datamap_old != datamap)
>>>>>> - event = 0;
>>>>>> + __afr_inode_need_refresh_set(inode, this);
>>>>>> break;
>>>>>>
>>>>>> default:
>>>>>> @@ -458,34 +458,6 @@ out:
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> -int
>>>>>> -__afr_inode_event_gen_reset_small(inode_t *inode, xlator_t *this)
>>>>>> -{
>>>>>> - int ret = -1;
>>>>>> - uint16_t datamap = 0;
>>>>>> - uint16_t metadatamap = 0;
>>>>>> - uint32_t event = 0;
>>>>>> - uint64_t val = 0;
>>>>>> - afr_inode_ctx_t *ctx = NULL;
>>>>>> -
>>>>>> - ret = __afr_inode_ctx_get(this, inode, &ctx);
>>>>>> - if (ret)
>>>>>> - return ret;
>>>>>> -
>>>>>> - val = ctx->read_subvol;
>>>>>> -
>>>>>> - metadatamap = (val & 0x000000000000ffff) >> 0;
>>>>>> - datamap = (val & 0x00000000ffff0000) >> 16;
>>>>>> - event = 0;
>>>>>> -
>>>>>> - val = ((uint64_t)metadatamap) | (((uint64_t)datamap) << 16) |
>>>>>> - (((uint64_t)event) << 32);
>>>>>> -
>>>>>> - ctx->read_subvol = val;
>>>>>> -
>>>>>> - return ret;
>>>>>> -}
>>>>>> -
>>>>>> int
>>>>>> __afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
>>>>>> unsigned char *metadata, int *event_p)
>>>>>> @@ -556,22 +528,6 @@ out:
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> -int
>>>>>> -__afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
>>>>>> -{
>>>>>> - afr_private_t *priv = NULL;
>>>>>> - int ret = -1;
>>>>>> -
>>>>>> - priv = this->private;
>>>>>> -
>>>>>> - if (priv->child_count <= 16)
>>>>>> - ret = __afr_inode_event_gen_reset_small(inode, this);
>>>>>> - else
>>>>>> - ret = -1;
>>>>>> -
>>>>>> - return ret;
>>>>>> -}
>>>>>> -
>>>>>> int
>>>>>> afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
>>>>>> unsigned char *metadata, int *event_p)
>>>>>> @@ -721,30 +677,22 @@ out:
>>>>>> return need_refresh;
>>>>>> }
>>>>>>
>>>>>> -static int
>>>>>> -afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
>>>>>> +int
>>>>>> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
>>>>>> {
>>>>>> int ret = -1;
>>>>>> afr_inode_ctx_t *ctx = NULL;
>>>>>>
>>>>>> - GF_VALIDATE_OR_GOTO(this->name, inode, out);
>>>>>> -
>>>>>> - LOCK(&inode->lock);
>>>>>> - {
>>>>>> - ret = __afr_inode_ctx_get(this, inode, &ctx);
>>>>>> - if (ret)
>>>>>> - goto unlock;
>>>>>> -
>>>>>> + ret = __afr_inode_ctx_get(this, inode, &ctx);
>>>>>> + if (ret == 0) {
>>>>>> ctx->need_refresh = _gf_true;
>>>>>> }
>>>>>> -unlock:
>>>>>> - UNLOCK(&inode->lock);
>>>>>> -out:
>>>>>> +
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> int
>>>>>> -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
>>>>>> +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
>>>>>> {
>>>>>> int ret = -1;
>>>>>>
>>>>>> @@ -754,7 +702,7 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
>>>>>> "Resetting event gen for %s", uuid_utoa(inode->gfid));
>>>>>> LOCK(&inode->lock);
>>>>>> {
>>>>>> - ret = __afr_inode_event_gen_reset(inode, this);
>>>>>> + ret = __afr_inode_need_refresh_set(inode, this);
>>>>>> }
>>>>>> UNLOCK(&inode->lock);
>>>>>> out:
>>>>>> @@ -1187,7 +1135,7 @@ afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err)
>>>>>> ret = afr_inode_get_readable(frame, inode, this, local->readable,
>>>>>> &event_generation, local->transaction.type);
>>>>>>
>>>>>> - if (ret == -EIO || (local->is_read_txn && !event_generation)) {
>>>>>> + if (ret == -EIO) {
>>>>>> /* No readable subvolume even after refresh ==> splitbrain.*/
>>>>>> if (!priv->fav_child_policy) {
>>>>>> err = EIO;
>>>>>> @@ -2451,7 +2399,7 @@ afr_lookup_done(call_frame_t *frame, xlator_t *this)
>>>>>> if (read_subvol == -1)
>>>>>> goto cant_interpret;
>>>>>> if (ret) {
>>>>>> - afr_inode_event_gen_reset(local->inode, this);
>>>>>> + afr_inode_need_refresh_set(local->inode, this);
>>>>>> dict_del_sizen(local->replies[read_subvol].xdata, GF_CONTENT_KEY);
>>>>>> }
>>>>>> } else {
>>>>>> @@ -3007,6 +2955,7 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this)
>>>>>> afr_private_t *priv = NULL;
>>>>>> afr_local_t *local = NULL;
>>>>>> int read_subvol = -1;
>>>>>> + int ret = 0;
>>>>>> unsigned char *data_readable = NULL;
>>>>>> unsigned char *success_replies = NULL;
>>>>>>
>>>>>> @@ -3028,7 +2977,10 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this)
>>>>>> if (!afr_has_quorum(success_replies, this, frame))
>>>>>> goto unwind;
>>>>>>
>>>>>> - afr_replies_interpret(frame, this, local->inode, NULL);
>>>>>> + ret = afr_replies_interpret(frame, this, local->inode, NULL);
>>>>>> + if (ret) {
>>>>>> + afr_inode_need_refresh_set(local->inode, this);
>>>>>> + }
>>>>>>
>>>>>> read_subvol = afr_read_subvol_decide(local->inode, this, NULL,
>>>>>> data_readable);
>>>>>> @@ -3284,11 +3236,7 @@ afr_discover(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
>>>>>> afr_read_subvol_get(loc->inode, this, NULL, NULL, &event,
>>>>>> AFR_DATA_TRANSACTION, NULL);
>>>>>>
>>>>>> - if (afr_is_inode_refresh_reqd(loc->inode, this, event,
>>>>>> - local->event_generation))
>>>>>> - afr_inode_refresh(frame, this, loc->inode, NULL, afr_discover_do);
>>>>>> - else
>>>>>> - afr_discover_do(frame, this, 0);
>>>>>> + afr_discover_do(frame, this, 0);
>>>>>>
>>>>>> return 0;
>>>>>> out:
>>>>>> @@ -3429,11 +3377,7 @@ afr_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
>>>>>> afr_read_subvol_get(loc->parent, this, NULL, NULL, &event,
>>>>>> AFR_DATA_TRANSACTION, NULL);
>>>>>>
>>>>>> - if (afr_is_inode_refresh_reqd(loc->inode, this, event,
>>>>>> - local->event_generation))
>>>>>> - afr_inode_refresh(frame, this, loc->parent, NULL, afr_lookup_do);
>>>>>> - else
>>>>>> - afr_lookup_do(frame, this, 0);
>>>>>> + afr_lookup_do(frame, this, 0);
>>>>>>
>>>>>> return 0;
>>>>>> out:
>>>>>> diff --git a/xlators/cluster/afr/src/afr-dir-write.c b/xlators/cluster/afr/src/afr-dir-write.c
>>>>>> index 82a72fddd..333085b14 100644
>>>>>> --- a/xlators/cluster/afr/src/afr-dir-write.c
>>>>>> +++ b/xlators/cluster/afr/src/afr-dir-write.c
>>>>>> @@ -119,11 +119,11 @@ __afr_dir_write_finalize(call_frame_t *frame, xlator_t *this)
>>>>>> continue;
>>>>>> if (local->replies[i].op_ret < 0) {
>>>>>> if (local->inode)
>>>>>> - afr_inode_event_gen_reset(local->inode, this);
>>>>>> + afr_inode_need_refresh_set(local->inode, this);
>>>>>> if (local->parent)
>>>>>> - afr_inode_event_gen_reset(local->parent, this);
>>>>>> + afr_inode_need_refresh_set(local->parent, this);
>>>>>> if (local->parent2)
>>>>>> - afr_inode_event_gen_reset(local->parent2, this);
>>>>>> + afr_inode_need_refresh_set(local->parent2, this);
>>>>>> continue;
>>>>>> }
>>>>>>
>>>>>> diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h
>>>>>> index a3f2942b3..ed6d777c1 100644
>>>>>> --- a/xlators/cluster/afr/src/afr.h
>>>>>> +++ b/xlators/cluster/afr/src/afr.h
>>>>>> @@ -958,7 +958,10 @@ afr_inode_read_subvol_set(inode_t *inode, xlator_t *this,
>>>>>> int event_generation);
>>>>>>
>>>>>> int
>>>>>> -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this);
>>>>>> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this);
>>>>>> +
>>>>>> +int
>>>>>> +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this);
>>>>>>
>>>>>> int
>>>>>> afr_read_subvol_select_by_policy(inode_t *inode, xlator_t *this,
>>>>>> --
>>>>>> 2.25.2
>>>>>>
>>>>>
More information about the Gluster-users
mailing list