[Gluster-devel] Problem with clients that goes down..

Mon Apr 21 13:01:27 UTC 2008

Your description says that you are powering the client down. I will
try to reproduce this bug and get back to you.

Krishna

On Mon, Apr 21, 2008 at 6:22 PM, Krishna Srinivas <krishna at zresearch.com> wrote:
> One doubt, are you sure you are not stopping the server on which
>  the namespace is there?
>
>  On Mon, Apr 21, 2008 at 6:00 PM, Antonio González
>
>
> <antonio.gonzalez at libera.net> wrote:
>  > Thanks Krishna, dont worry for not respond, i think is a hard work to
>  >  maintain this list!!!
>  >
>  >
>  >
>  >  Well, the main problem is the first you note. I have made some test over
>  >  glusters to check the viability when client goes down, I can see that some
>  >  times if a client hangs while making any operation (read/write) other
>  >  clients don't work correctly.
>  >
>  >
>  >
>  >  I proved this issue in several scenarios, and I can see this problem always.
>  >  Mi last test can explain you the problem. I have 4 machines, two servers and
>  >  to clients.
>  >
>  >
>  >
>  >  One server export one brick for storage (posix storage), the other server
>  >  exports a brick for namespace and a brick for storage. The unify translator
>  >  is place at client side.
>  >
>  >
>  >
>  >  The test is: From one client I cp a file (from local to glusters and vice
>  >  versa) while the client is completing the cp I power down the client, then
>  >  from other client I try a "ls" command (I proved also sha1sum over a file in
>  >  the Gluster, cp, cat ...),  the client finishes blocked during a large time.
>  >  Some times finish the command (for example "ls" 2/3 minutes) and other times
>  >  send an error message.
>  >
>  >
>  >
>  >  Note: some times the client is not blocked and the gluster works fine. Is
>  >  difficult to prevent when the client will be blocked and when no.
>  >
>  >
>  >
>  >
>  >
>  >  As I comment previously I test this issue with several scenarios, with and
>  >  without AFR (I think the problem is because unify translator), the unify
>  >  translator at the client side and at the server side, one server and two
>  >  clients, 2 server and 2 clients, 3 server and two clients.
>  >
>  >
>  >
>  >  The issue about timeout option is related about this problem. I test with
>  >  the timeout option to see the impact over the same tests. I can see that if
>  >  I define a timeout, when a client try a ls command (or cp, sha1sum ..) the
>  >  recovery time is less than if I not define timeout. I don't know the
>  >  relation about this, but it seems that with timeout the client when the
>  >  timeout expire try the command other time and this time the command finish
>  >  successfully but I don't sure about this.
>  >
>  >
>  >
>  >
>  >
>  >  The config files of this last test:
>  >
>  >
>  >
>  >
>  >
>  >  Server1
>  >
>  >
>  >
>  >  volume brick
>  >
>  >       type storage/posix
>  >
>  >       option directory /home/pruebaD
>  >
>  >  end-volume
>  >
>  >
>  >
>  >  volume brick-ns
>  >
>  >       type storage/posix
>  >
>  >       option directory /home/namespace
>  >
>  >  end-volume
>  >
>  >
>  >
>  >
>  >
>  >  volume server
>  >
>  >       type protocol/server
>  >
>  >       subvolumes brick brick-ns
>  >
>  >       option transport-type tcp/server
>  >
>  >       option auth.ip.brick.allow *
>  >
>  >       option auth.ip.brick-ns.allow *
>  >
>  >       option listen-port 6996                # Default is 6996
>  >
>  >       option client-volume-filename
>  >  etc/glusterfs/pruebaDistribuida/glusterfs-client.vol
>  >
>  >  end-volume
>  >
>  >
>  >
>  >
>  >
>  >  Sever2
>  >
>  >
>  >
>  >  volume brick
>  >
>  >       type storage/posix
>  >
>  >       option directory /home/pruebaD
>  >
>  >  end-volume
>  >
>  >
>  >
>  >  volume server
>  >
>  >       type protocol/server
>  >
>  >       subvolumes brick
>  >
>  >       option transport-type tcp/server
>  >
>  >       option auth.ip.brick.allow *
>  >
>  >  end-volume
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >  Clients
>  >
>  >
>  >
>  >  volume brick1
>  >
>  >       type protocol/client
>  >
>  >       option transport-type tcp/client
>  >
>  >       option remote-host 10.1.0.45
>  >
>  >       option remote-subvolume brick
>  >
>  >  end-volume
>  >
>  >
>  >
>  >  volume brick2
>  >
>  >       type protocol/client
>  >
>  >       option transport-type tcp/client
>  >
>  >       option remote-host 10.1.0.40
>  >
>  >       option remote-subvolume brick
>  >
>  >  end-volume
>  >
>  >
>  >
>  >
>  >
>  >  volume ns1
>  >
>  >       type protocol/client
>  >
>  >       option transport-type tcp/client
>  >
>  >       option remote-host 10.1.0.45
>  >
>  >       option remote-subvolume brick-ns
>  >
>  >  end-volume
>  >
>  >
>  >
>  >
>  >
>  >  volume unify
>  >
>  >       type cluster/unify
>  >
>  >       subvolumes brick1 brick2
>  >
>  >       option namespace ns1
>  >
>  >       option scheduler rr
>  >
>  >  end-volume
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >  The version of glusters is 1.3.8pre5, fuse 2.7.2glfs9. The OS is gentoo
>  >  kernel 2.6.23-r6.
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >  Thanks for the reply,
>  >
>  >
>  >
>  >
>  >
>  >  -----Mensaje original-----
>  >  De: krishna.srinivas at gmail.com [mailto:krishna.srinivas at gmail.com] En nombre
>  >  de Krishna Srinivas
>  >  Enviado el: lunes, 21 de abril de 2008 13:09
>  >  Para: Antonio González
>  >  CC: gluster-devel at nongnu.org
>  >  Asunto: Re: [Gluster-devel] Problem with clients that goes down..
>  >
>  >
>  >
>  >
>  >
>  >  Hi Antonio,
>  >
>  >
>  >
>  >  Excuse us, somehow your issue was not responded to.
>  >
>  >
>  >
>  >  If I understand correctly, you are facing two problems:
>  >
>  >  1) plugging out the cable on one client will make other clients hang
>  >
>  >  2) the timeout value you specify in spec file does not reflect
>  >
>  >    in the actual timeout you see when you access glusterfs.
>  >
>  >
>  >
>  >  Is that correct? I have lost track of your setup details. Searching mail
>  >
>  >  archives did not give me the exact picture. Can you give the setup
>  >
>  >  details with config files? And also the tests?
>  >
>  >
>  >
>  >  Surely the problem you are facing should be fixed.
>  >
>  >
>  >
>  >  Regards
>  >
>  >  Krishna
>  >
>  >
>  >
>  >
>  >
>  >  On Mon, Apr 21, 2008 at 3:58 PM, Antonio González
>  >
>  >  <antonio.gonzalez at libera.net> wrote:
>  >
>  >  > Hello all,
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >  I have made a lot of tests over GlusterFS to verify his viability. I
>  >  wrote
>  >
>  >  >  at this list one or two weeks ago asking about an issue with clients that
>  >
>  >  >  goes down and causes problems with other clients that can not access to
>  >  the
>  >
>  >  >  Gluster file system.
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >  Are the developers of GlusterFS noticed about this issue?  I think that
>  >  is a
>  >
>  >  >  serious problem and I need an answer to advice or not the use of
>  >  GlusterFS
>  >
>  >  >  in a project.
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >  I proved this issue over several scenarios (AFR/unify at server side,
>  >  client
>  >
>  >  >  side, without AFR…), and I think that the problem is the unify
>  >  translator.
>  >
>  >  >  I made a test with one server and two clients. Without unify translator
>  >
>  >  >  works fine, a client who goes down while reads or copy a file, don't
>  >  affect
>  >
>  >  >  other clients. With the unify translator, if a client who reads/writes
>  >  file
>  >
>  >  >  goes down causes the problem (other clients that tries an "ls" command
>  >  are
>  >
>  >  >  blocked).
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >  I made a test with two servers (without AFR, unify at client side), I
>  >  have
>  >
>  >  >  localized files in each server, I try to block one server and access to a
>  >
>  >  >  file in the other server (cp command). I can see that the access to this
>  >
>  >  >  server (no blocked) is in function of the timeout option. If I don't set
>  >
>  >  >  timeout, the client takes 2 or 3 minutes and not finishes the command. If
>  >  I
>  >
>  >  >  set a timeout of 20 sec the client takes 32 sec and finishes the command.
>  >
>  >  >  For a timeout of 40 s. the client takes 60 sec approximately.
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >  I would like to know at least if this problem is recognized by the
>  >
>  >  >  developers of Gluster. They know which is problem?  They working to solve
>  >
>  >  >  it? .
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >
>  >
>  >  >  Thanks,
>  >
>  >  >
>  >
>  >  >  _______________________________________________
>  >
>  >  >  Gluster-devel mailing list
>  >
>  >  >  Gluster-devel at nongnu.org
>  >
>  >  >  http://lists.nongnu.org/mailman/listinfo/gluster-devel
>  >
>  >  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >  <http://www.libera.net/correoweb/redir.php?https://www.plaxo.com/add_me?u=51
>  >  540170138&v0=1125188&k0=1660502549>
>  >
>  >   <http://www.libera.net/correoweb/redir.php?http://www.plaxo.com/signature>
>  >
>  >
>  >
>  >
>  >
>  >  _______________________________________________
>  >  Gluster-devel mailing list
>  >  Gluster-devel at nongnu.org
>  >  http://lists.nongnu.org/mailman/listinfo/gluster-devel
>  >
>