[Gluster-devel] glusterfs-1.3.8pre1

Fri Feb 22 17:52:37 UTC 2008

Hello,

*Q1: *

I've installed glusterfs-1.3.8pre1 on one node of a cluster running 
1.3.7, but glusterfsd 1.3.8 drops incoming connections from 1.3.7 
clients. Is this by design ? Do I need to upgrade everything to 
1.3.8pre1 at once ?

Here's a part of /var/log/glusterfs/glusterfsd.log

2008-02-22 17:17:33 C [tcp.c:87:tcp_disconnect] server: connection 
disconnected
2008-02-22 17:17:33 E [server-protocol.c:183:generic_reply] server: 
transport_writev failed
2008-02-22 17:17:33 C [tcp.c:87:tcp_disconnect] server: connection 
disconnected
2008-02-22 17:17:33 E [server-protocol.c:183:generic_reply] server: 
transport_writev failed
2008-02-22 17:17:33 C [tcp.c:87:tcp_disconnect] server: connection 
disconnected
2008-02-22 17:17:33 C [tcp.c:87:tcp_disconnect] server: connection 
disconnected

*Q2: *

The reasons for trying to upgrade to 1.3.8 are the following:

Current configuration:

- 16 clients / 16 servers (one client/server on each machine)
- servers are dual opteron, some of them quad core, 8 or 12 gb ram
- kernel 2.6.24-2, linux gentoo (can provide gluster ebuilds)
- fuse 2.7.2glfs8, glusterfs 1.3.7 - see config files- basicly a simple 
unify with no ra/wt cache

Configs are here: http://gluster.pastebin.com/m7f61927f
All servers are stable and the problems below are in normal running 
conditions.

Inside the gluster filesystem we store ~3 million pictures, in a 
directory tree that guarantees up
to 1k pictures or subdirectories per directory, with ~30 writes per 
second, and ~300 reads per
second. Files are relatively small, 4-5k/picture.

1. glusterfs (client) appears to memory leak in our configuration - 300 
mb RAM eaten over 2 days.

2. frequent files with size 0, ctime 0 (1970) even if all servers are up 
and running.

3. occasional files with correct size/ctime that cannot be read, and 
sometimes they can be
read from other servers.

4. back when I was using AFR for mirrored namespace (which I gave up, 
trying to alleviate the
other errors), crash in AFR in glusterfs (client) when one of the 
servers was shutting down.

These errors appear in glusterfs.log when a file cannot be read:

2008-02-22 17:37:36 E [unify.c:837:unify_open] 
nowb-nora-client-stable-www: /tmpfs/small/1/70/92/7092182.jpg: 
entry_count is 4
2008-02-22 17:34:51 E [unify.c:790:unify_open_cbk] 
nowb-nora-client-stable-www: Open success on namespace, failed on child node

*Q3:*

Nice-to-haves:

1. Redundant namespace => no single point of failure.

2. A way to see a diagram of the cluster, it's connected nodes, etc. for 
anyone running more than 2-3 servers - pulling live data from one of the 
servers/clients.

As a side question, our organization could commit a part time developer 
dedicated to helping out with glusterfs; are you interested ?

Best regards
Dan