[Gluster-users] Gluster trouble
Marcus Bointon
marcus at synchromedia.co.uk
Wed Jul 3 14:55:59 UTC 2013
Back in March I posted about some gluster problems:
http://gluster.org/pipermail/gluster-users/2013-March/035737.html
http://gluster.org/pipermail/gluster-users/2013-March/035655.html
I'm still in the same situation - a straightforward 2-node, 2-way AFR setup with each server mounting the single shared volume via NFS using gluster 3.3.0 (can't use 3.3.1 dues to its NFS issues) on 64-bit linux (ubuntu lucid). Gluster appears to be working, but won't mount on boot by any means I've tried, and it's still logging prodigious amounts of incomprehensible rubbish (to me!).
gluster says everything is ok:
gluster volume status
Status of volume: shared
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick 192.168.1.10:/var/shared 24009 Y 3097
Brick 192.168.1.11:/var/shared 24009 Y 3020
NFS Server on localhost 38467 Y 3103
Self-heal Daemon on localhost N/A Y 3109
NFS Server on 192.168.1.11 38467 Y 3057
Self-heal Daemon on 192.168.1.11 N/A Y 3096
(other node says the same thing with IPs the other way around)
Yet the logs tell a different story.
In syslog, this happens every second:
Jul 3 00:17:29 web1 init: glusterd main process (14958) terminated with status 255
Jul 3 00:17:29 web1 init: glusterd main process ended, respawning
In /var/log/glusterfs/etc-glusterfs-glusterd.vol.log I have lots of this:
[2013-07-03 14:24:08.350429] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.3.0
[2013-07-03 14:24:08.350592] E [glusterfsd.c:1296:glusterfs_pidfile_setup] 0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily unavailable)
In /var/log/glusterfs/glustershd.log, every minute I get hundreds of these:
2013-07-03 14:24:00.792751] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0: background meta-data self-heal completed on <gfid:16adce4d-1933-485f-8359-66c47c757cd3>
[2013-07-03 14:24:00.794251] I [afr-common.c:1340:afr_launch_self_heal] 0-shared-replicate-0: background meta-data self-heal triggered. path: <gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>, reason: lookup detected pending operations
[2013-07-03 14:24:00.796411] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0: background meta-data self-heal completed on <gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>
'gluster volume heal shared info' says:
Heal operation on volume shared has been successful
Brick 192.168.1.10:/var/shared
Number of entries: 335
...
I'm not clear whether this means it has 335 files still to fix, or whether it's done so already.
Both servers are logging the same kind of stuff. I'm sure all these are related since they happen at about the same rate.
The lock error looks the most interesting, but I've no idea why that should happen. As before, I've tried deleting all traces of gluster, reinstalling and reconfiguring and putting all the data back on, but nothing changes.
Here's the command I used to create the volume:
gluster volume create shared replica 2 transport tcp 192.168.1.10:/var/shared 192.168.1.11:/var/shared
Here's the volume file it created:
+------------------------------------------------------------------------------+
1: volume shared-posix
2: type storage/posix
3: option directory /var/shared
4: option volume-id 2600e26c-b6c4-448f-a6f6-ad27c14745a0
5: end-volume
6:
7: volume shared-access-control
8: type features/access-control
9: subvolumes shared-posix
10: end-volume
11:
12: volume shared-locks
13: type features/locks
14: subvolumes shared-access-control
15: end-volume
16:
17: volume shared-io-threads
18: type performance/io-threads
19: subvolumes shared-locks
20: end-volume
21:
22: volume shared-index
23: type features/index
24: option index-base /var/shared/.glusterfs/indices
25: subvolumes shared-io-threads
26: end-volume
27:
28: volume shared-marker
29: type features/marker
30: option volume-uuid 2600e26c-b6c4-448f-a6f6-ad27c14745a0
31: option timestamp-file /var/lib/glusterd/vols/shared/marker.tstamp
32: option xtime off
33: option quota off
34: subvolumes shared-index
35: end-volume
36:
37: volume /var/shared
38: type debug/io-stats
39: option latency-measurement off
40: option count-fop-hits off
41: subvolumes shared-marker
42: end-volume
43:
44: volume shared-server
45: type protocol/server
46: option transport-type tcp
47: option auth.login./var/shared.allow 94017411-d986-48e4-a7ac-47c1db14fba0
48: option auth.login.94017411-d986-48e4-a7ac-47c1db14fba0.password 3929acf9-fcf1-4684-b271-07927d375c9b
49: option auth.addr./var/shared.allow *
50: subvolumes /var/shared
51: end-volume
Despite all this, I've not seen gluster do anything visibly wrong - if I create a file on the shared volume it appears on the other node, checksums match, clients can read etc, but I don't want to be running on luck! It's all very troubling, and it's making a right mess of my new distributed logging system...
Any ideas?
Marcus
More information about the Gluster-users
mailing list