[Gluster-devel] about afr

nicolas prochazka prochazka.nicolas at gmail.com
Tue Feb 3 11:48:56 UTC 2009


ok,
So now I know there's few bugs,

1 - when stop and i restart a server , I've the EBADFD bug
2 - When I stop server :
      - with  --disable-direct-io-mode   : my big image file become corrupt
( missing data ...)
      - without --disable-direct-io-mode  :   my process hangs and cpu load
grows a lot (by process )

any ideas ?

Regards,
Nicolas Prochazka

 On Tue, Feb 3, 2009 at 5:42 AM, Raghavendra G <raghavendra at zresearch.com>wrote:

> Hi Nicolas,
>
> On Tue, Feb 3, 2009 at 12:01 AM, nicolas prochazka <
> prochazka.nicolas at gmail.com> wrote:
>
>> I inspect the log and i find something interesting :
>> All is ok,
>> i have stop 10.98.98.2 and i restart it :
>>
>> 2009-02-02 15:00:32 D [client-protocol.c:6498:notify] brick_10.98.98.2:
>> got GF_EVENT_CHILD_UP
>> 2009-02-02 15:00:32 D [socket.c:924:socket_connect] brick_10.98.98.2:
>> connect () called on transport already connected
>> 2009-02-02 15:00:32 N [client-protocol.c:5786:client_setvolume_cbk]
>> brick_10.98.98.2: connection and handshake succeeded
>> 2009-02-02 15:00:40 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse:
>> 17399: STATFS
>> 2009-02-02 15:00:40 D [fuse-bridge.c:368:fuse_entry_cbk] glusterfs-fuse:
>> 17400: LOOKUP() / => 1 (1)
>> 200t9-02-02 15:00:42 D [client-protocol.c:5854:client_protocol_reconnect]
>> brick_10.98.98.2: breaking reconnect chain
>>
>> All seems to be ok but now i have this log :
>> ( a lot of times )
>>
>> 2009-02-02 15:07:05 D [client-protocol.c:2799:client_fstat]
>> brick_10.98.98.2: (2148533016): failed to get remote fd. returning EBADFD
>>
>> then  *stop 10.98.98.1  ( I tought that 10.98.98.2 is ok but EBADFD seems
>> to be not ! )*
>>
>
> This is a known issue in afr for files which remain open across the time
> frame when a server goes down and comes back. Ideally afr should've issued
> reopen for those files once the server comes back. But currently its not
> doing so.
>
> **
>>
>> 2009-02-02 15:10:30 D [page.c:644:ioc_frame_return] io-cache: locked
>> local(0x6309d0)
>> 2009-02-02 15:10:30 D [client-protocol.c:2799:client_fstat]
>> brick_10.98.98.2: (2148533016): failed to get remote fd. returning EBADFD
>> 2009-02-02 15:10:30 D [page.c:646:ioc_frame_return] io-cache: unlocked
>> local(0x6309d0)
>> 2009-02-02 15:10:30 D [io-cache.c:798:ioc_need_prune] io-cache: locked
>> table(0x614320)
>> 2009-02-02 15:10:30 D [io-cache.c:802:ioc_need_prune] io-cache: unlocked
>> table(0x614320)
>> 2009-02-02 15:10:30 D [client-protocol.c:2799:client_fstat]
>> brick_10.98.98.1: (2148533016): failed to get remote fd. returning EBADFD
>> 2009-02-02 15:10:30 D [io-cache.c:425:ioc_cache_validate_cbk] io-cache:
>> cache for inode(0x7fdce0002780) is invalid. flushing all pages
>>
>>
>> Now my client have problems with two servers ( fd )
>>
>> so perhaps there is a problem, why 10.98.98.2 is online but client tells
>> EBADFD.
>>
>> Regard,
>> Nicolas
>>
>>
>>
>> On Mon, Feb 2, 2009 at 3:30 PM, nicolas prochazka <
>> prochazka.nicolas at gmail.com> wrote:
>>
>>> hi again,
>>> last test and last log before stop for me :
>>> I do a change, i add option read-subvolume brick_10.98.98.2 in client
>>> conf 10.98.98.48
>>> and option read-subvolume brick_10.98.98.1 in client conf 10.98.98.44
>>>
>>> run 10.98.98.1 and 10.98.98.2 as server
>>> run 10.98.98.44 and 10.98.98.48 as client
>>>
>>> 1 - stop 10.98.98.2
>>> 10.98.98.48 always run and go read to 10.98.98.1
>>> 10.98.98.44 always run , 10.98.98.1
>>>
>>> 2 - rerun 10.98.98.2 , waiting 5 minutes
>>>
>>> 3 - stop 10.98.98.1
>>> process 10.98.98.44 / 48  are hanging
>>>
>>> I think, client can not re read to 10.98.98.2  , is it normal ?
>>> 10.98.98.2 is become ready after crash.
>>>
>>>
>>> Regards,
>>> Nico
>>>
>>>
>>>
>>> On Mon, Feb 2, 2009 at 2:25 PM, nicolas prochazka <
>>> prochazka.nicolas at gmail.com> wrote:
>>>
>>>> hello
>>>> I always trying to debugging my strange and block problem.
>>>> I run client with log but there's a lot and a lot (100 mo ) so i can not
>>>> send you, just info :
>>>>
>>>> Server 10.98.98.1  and 10.98.98.2
>>>> client 10.98.98.44  10.98.98.48
>>>>
>>>> Test : ( all tests is performe with big file ( > 10G ) sometimes the
>>>> test hangs process, sometimes, big file become corrupte ( there's seem
>>>> that's some data is lacking )
>>>>
>>>> run all system.  :  ok
>>>> stop : 10.98.98.2   : client seems ok
>>>> run 10.98.98.2 :  sometime it block
>>>> stop 10.98.98.1 : client 10.98.98.44 is blocking   : last log is :
>>>>
>>>> 2009-02-02 13:53:59 D [io-cache.c:798:ioc_need_prune] io-cache: locked
>>>> table(0x614320)
>>>> 2009-02-02 13:53:59 D [io-cache.c:802:ioc_need_prune] io-cache: unlocked
>>>> table(0x614320)
>>>> 2009-02-02 13:53:59 D [client-protocol.c:1701:client_readv]
>>>> brick_10.98.98.2: (2148533016): failed to get remote fd, returning EBADFD
>>>>
>>>> and if i rerun 10.98.98.1 , client run again ( ls works ) and log :
>>>>
>>>> 2009-02-02 14:03:18 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse:
>>>> 40423: STATFS
>>>> 2009-02-02 14:03:18 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse:
>>>> 40424: STATFS
>>>> 2009-02-02 14:03:33 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse:
>>>> 40425: STATFS
>>>>
>>>> On client 10.98.98.48 , not block.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jan 30, 2009 at 10:14 AM, nicolas prochazka <
>>>> prochazka.nicolas at gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>> first thing, thanks a lot for all yours works.
>>>>> second,
>>>>> Your tests is ok for me but when i replace echo or tail by opening a
>>>>> file with certains type of program,
>>>>> as qemu for example, there's a lot of problem. Process hangs, I also
>>>>> try with --disable-direct-io-mode  then process do not hang but file seems
>>>>> to be corrupted.
>>>>> It's very strange problem.
>>>>>
>>>>> Regards,
>>>>> Nicolas Prochazka.
>>>>>
>>>>> 2009/1/30 Raghavendra G <raghavendra at zresearch.com>
>>>>>
>>>>> nicolas,
>>>>>>
>>>>>> I've two servers n1 and n2 which are being afred from client side. I
>>>>>> am using the same configuration you finalized on for which you are facing
>>>>>> the problem. n1 is the first child of afr.
>>>>>>
>>>>>> on n1:
>>>>>> ifconfig eth0 down (eth0 is the interface I am using for communicating
>>>>>> with server on n1)
>>>>>>
>>>>>> on glusterfs mount:
>>>>>> 1. ls (hangs for transport-timeout seconds but completes successfully
>>>>>> after timeout)
>>>>>> 2. I also had a file opened with tail -f /mnt/glusterfs/file before
>>>>>> bringing down eth0 on n1.
>>>>>> 3. echo "content" >> /mnt/glusterfs/file, appends to file and I was
>>>>>> able to observe the content through tail -f.
>>>>>>
>>>>>> on n1:
>>>>>> bring up eth0
>>>>>>
>>>>>> on glusterfs mount:
>>>>>> 1. ls (completes successfully without any problem).
>>>>>> 2. echo "content-2" >> /mnt/glusterfs/file (also appends content-2 to
>>>>>> file and shown in the output of tail -f)
>>>>>>
>>>>>> From the above tests, it seems the bug is not reproducible in our
>>>>>> setup. Is this the similar procedure you followed to reproduce the bug? I am
>>>>>> using glusterfs--mainline--3.0--patch-883.
>>>>>>
>>>>>> regards,
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 30, 2009 at 12:05 AM, Anand Avati <avati at zresearch.com>wrote:
>>>>>>
>>>>>>> Raghu/ Krishna,
>>>>>>>  can you guys look into this? It seems like a serious flaw..
>>>>>>>
>>>>>>> avati
>>>>>>>
>>>>>>> On Thu, Jan 29, 2009 at 7:13 PM, nicolas prochazka
>>>>>>> <prochazka.nicolas at gmail.com> wrote:
>>>>>>> > hello again,
>>>>>>> > to be more precise,
>>>>>>> > now i can do 'ls /glustermountpoint ' after timeout in all cases,
>>>>>>> that's
>>>>>>> > good
>>>>>>> > but, for files which be opened before the crash of first server,
>>>>>>> that do not
>>>>>>> > work, process seems to be block.
>>>>>>> >
>>>>>>> > Regards,
>>>>>>> > Nicolas.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Raghavendra G
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Raghavendra G
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20090203/b8510a5e/attachment-0003.html>


More information about the Gluster-devel mailing list