[Gluster-users] 3.6.6 healing issues?
Osborne, Paul (paul.osborne@canterbury.ac.uk)
paul.osborne at canterbury.ac.uk
Thu Oct 15 15:40:13 UTC 2015
HI,
I am seeing what I can best describe as an oddity where my monitoring is telling me that there is an issue (nagios touches a file and then removes it - to check read/write access available on the client mount point), gluster says that there is not an issue on the server mounting the file store there is allegedly a lack of space and I am not certain where to turn.:
# dpkg --list | grep gluster
ii glusterfs-client 3.6.6-1 amd64 clustered file-system (client package)
ii glusterfs-common 3.6.6-1 amd64 GlusterFS common libraries and translator modules
ii glusterfs-server 3.6.6-1 amd64 clustered file-system (server package)
So these are packages straight out of the gluster.org repository.
# gluster volume status kerberos
Status of volume: kerberos
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick gfsi-rh-01:/srv/hod/kerberos/gfs 49171 Y 12863
Brick gfsi-isr-01:/srv/hod/kerberos/gfs 49169 Y 37115
Brick gfsi-cant-01:/srv/hod/kerberos/gfs 49163 Y 49057
NFS Server on localhost 2049 Y 49069
Self-heal Daemon on localhost N/A Y 49076
NFS Server on gfsi-isr-01.core.canterbury.ac.uk 2049 Y 37127
Self-heal Daemon on gfsi-isr-01.core.canterbury.ac.uk N/A Y 37134
NFS Server on gfsi-rh-01.core.canterbury.ac.uk 2049 Y 12875
Self-heal Daemon on gfsi-rh-01.core.canterbury.ac.uk N/A Y 12882
Task Status of Volume kerberos
------------------------------------------------------------------------------
There are no active volume tasks
# gluster volume info kerberos
Volume Name: kerberos
Type: Replicate
Volume ID: 89d63332-bb1e-4b47-8882-dfdb9af7f97d
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gfsi-rh-01:/srv/hod/kerberos/gfs
Brick2: gfsi-isr-01:/srv/hod/kerberos/gfs
Brick3: gfsi-cant-01:/srv/hod/kerberos/gfs
Options Reconfigured:
cluster.server-quorum-ratio: 51
# gluster volume heal kerberos statistics | grep No | grep -v 0
which to me looks good - 3 servers as a replica with qurom set.
BUT:
# mount | grep kerberos
gfsi-cant-01:/kerberos on /var/gfs/kerberos type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=50,retrans=2,sec=sys,mountaddr=194.82.211.115,mountvers=3,mountport=38465,mountproto=tcp,local_lock=all,addr=194.82.211.115)
To be clear we use autofs failover mounting NFS to our 3 bricks with a .5 second timeout
# df -h
Filesystem Size Used Avail Use% Mounted on
gfsi-cant-01:/kerberos 97M 31M 61M 34% /var/gfs/kerberos
Its a deliberately small volume as it only holds kerberos keytabs that need to be in sync across web servers. Amusingly there is nothing like that much data being used:
/var/gfs/kerberos# ls -la
total 8
drwxr-xr-x 7 root root 1024 Oct 15 16:03 .
drwxr-xr-x 5 root root 0 Oct 15 16:03 ..
drwxr-xr-x 2 root root 1024 Sep 16 12:09 HTTP_blog-mgmnt
drwxr-xr-x 2 root root 1024 Jul 31 2014 HTTP_kerbtest
drwxr-xr-x 2 root root 1024 May 14 15:48 HTTP_wiki-dev
-rw-r--r--+ 1 root root 2546 Sep 16 11:24 krb5-keytab-autoupdate
drwxr-xr-x 2 nagios nagios 1024 Oct 15 16:18 .nagios
-rw-r--r-- 1 nagios nagios 47 Apr 23 16:08 .nagioscheck
Rlampint-rh-01:/var/gfs/kerberos# du -hs *
5.0K HTTP_blog-mgmnt
2.5K HTTP_kerbtest
2.5K HTTP_wiki-dev
2.5K krb5-keytab-autoupdate
Rlampint-rh-01:/var/gfs/kerberos# du -hs .nagios
1.0K .nagios
So I can only assume that the rest of the data is masked as Gluster metadata.
Rlampint-rh-01:/var/gfs/kerberos# ls -la >file
bash: file: No space left on device
which suggests that the volume is full - it clearly isn't.
In the logs:
/var/log/glusterfs# less glfsheal-kerberos.log
[2015-10-15 15:29:31.407041] E [glfs-mgmt.c:520:mgmt_getspec_cbk] 0-gfapi: failed to get the 'volume file' from server
[2015-10-15 15:29:31.407090] E [glfs-mgmt.c:599:mgmt_getspec_cbk] 0-glfs-mgmt: failed to fetch volume file (key:kerberos)
<Repeatedly>
I have stopped and restarted the volume, which has made no difference that I can see.
Other volumes configured and provisioned in a similar way on these 3 GFS servers are not reporting issues and they are rather well loaded compared to this one.
The nfs.log shows:
[2015-10-15 15:35:15.222424] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 9b62284f: /.nagios/.nagiosrwtest.1444923315 => -1 (No space left on device)
[2015-10-15 15:35:28.081437] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-2: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.082000] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-0: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.082064] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-1: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.083019] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 4b32485d: /.nagios/.nagiosrwtest.1444923328 => -1 (No space left on device)
At this point I am rather confused and not entirely certain where to turn next - as I can see that there is space although allegedly there isnt - the three bricks are comprised of 3 100MB volumes - in any case testing has shown previously that gluster will size for the smallest brick when using replicas.
Thoughts/comments are as always welcome.
Thanks
Paul
More information about the Gluster-users
mailing list