[Gluster-devel] about afr

Wed Feb 4 08:46:34 UTC 2009

Nicolas,

I am not able to completely understand the scenario.
But consider this case, there are 2 servers afred onto a client:
* When both servers are up you open a file (don't close it)
* Bring one of the servers down, now the opened fd loses the context
of the downed server, i.e no further operations go there even if the
server comes back up,
* Now if the second server goes down, further operations on the fd
completely fails and application gets error.

This scenario is expected.

But under no scenario can glusterfs hang - which is what you are
saying is happening sometimes. You also say that glusterfs hangs
without any performance translators. When it hangs can you attach gdb
to glusterfs client and give us the backtrace?

>> gdb -p <pid of glusterfs>
>> type "bt" at the gdb command prompt.

You say that glusterfs hangs when only using qemu on glusterfs setup.
Does any other appication hang?

Can you explain the steps once again to reproduce the glusterfs hang?
clearly mention if server process is stopped or if the server machine
is hard powered off in the steps.

Thanks
Krishna

On Tue, Feb 3, 2009 at 8:45 PM, nicolas prochazka
<prochazka.nicolas at gmail.com> wrote:
> Without performance translator, the result is the same.
> I'm trying with gdb as soon as possible.
> you say, EBADFD is fine, AFR will try the operation on the other server , ok
> so i understand, but it I test to stop this server, gluster can not retrieve
> the first which is EBADFD.
> A lot of my problem comes from here, i think, because with my two server,
> i stop the first, then restart , wait, stop the second, restart  and all is
> KO.
> I just try to stop the first and test, then all is ok .
> Nicolas
>
> On Tue, Feb 3, 2009 at 3:50 PM, Krishna Srinivas <krishna at zresearch.com>
> wrote:
>>
>> Nicolas,
>>
>> When you restart the server logs indicating EBADFD is fine, AFR will
>> try the operation on the other server. When you have the situation
>> where the glusterfs client hangs can you attach gdb to the glusterfs
>> and mail us the backtrace?
>>
>> gdb -p <pid of glusterfs>
>> type "bt" at the gdb command prompt.
>>
>> Just want to confirm that glusterfs has not blocked at a system call.
>> (as we have non blocking io now)
>>
>> Can you see if removing the performance translators helps? we can
>> narrow down to the problem translator in such case.
>>
>> Krishna
>>
>> On Tue, Feb 3, 2009 at 5:18 PM, nicolas prochazka
>> <prochazka.nicolas at gmail.com> wrote:
>> > ok,
>> > So now I know there's few bugs,
>> >
>> > 1 - when stop and i restart a server , I've the EBADFD bug
>> > 2 - When I stop server :
>> >       - with  --disable-direct-io-mode   : my big image file become
>> > corrupt
>> > ( missing data ...)
>> >       - without --disable-direct-io-mode  :   my process hangs and cpu
>> > load
>> > grows a lot (by process )
>> >
>> > any ideas ?
>> >
>> > Regards,
>> > Nicolas Prochazka
>> >
>> >  On Tue, Feb 3, 2009 at 5:42 AM, Raghavendra G
>> > <raghavendra at zresearch.com>
>> > wrote:
>> >>
>> >> Hi Nicolas,
>> >>
>> >> On Tue, Feb 3, 2009 at 12:01 AM, nicolas prochazka
>> >> <prochazka.nicolas at gmail.com> wrote:
>> >>>
>> >>> I inspect the log and i find something interesting :
>> >>> All is ok,
>> >>> i have stop 10.98.98.2 and i restart it :
>> >>>
>> >>> 2009-02-02 15:00:32 D [client-protocol.c:6498:notify]
>> >>> brick_10.98.98.2:
>> >>> got GF_EVENT_CHILD_UP
>> >>> 2009-02-02 15:00:32 D [socket.c:924:socket_connect] brick_10.98.98.2:
>> >>> connect () called on transport already connected
>> >>> 2009-02-02 15:00:32 N [client-protocol.c:5786:client_setvolume_cbk]
>> >>> brick_10.98.98.2: connection and handshake succeeded
>> >>> 2009-02-02 15:00:40 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse:
>> >>> 17399: STATFS
>> >>> 2009-02-02 15:00:40 D [fuse-bridge.c:368:fuse_entry_cbk]
>> >>> glusterfs-fuse:
>
>