[Gluster-users] 3.6.6 healing issues?

Thu Oct 15 15:40:13 UTC 2015

HI,

I am seeing what I can best describe as an oddity where my monitoring is telling me that there is an issue (nagios touches a file and then removes it - to check read/write access available on the client mount point), gluster says that there is not an issue on the server mounting the file store there is allegedly a lack of space and I am not certain where to turn.:

# dpkg --list | grep gluster
ii  glusterfs-client                   3.6.6-1                           amd64        clustered file-system (client package)
ii  glusterfs-common                   3.6.6-1                           amd64        GlusterFS common libraries and translator modules
ii  glusterfs-server                   3.6.6-1                           amd64        clustered file-system (server package)

So these are packages straight out of the gluster.org repository.

# gluster volume status kerberos
Status of volume: kerberos
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick gfsi-rh-01:/srv/hod/kerberos/gfs			49171	Y	12863
Brick gfsi-isr-01:/srv/hod/kerberos/gfs			49169	Y	37115
Brick gfsi-cant-01:/srv/hod/kerberos/gfs		49163	Y	49057
NFS Server on localhost					2049	Y	49069
Self-heal Daemon on localhost				N/A	Y	49076
NFS Server on gfsi-isr-01.core.canterbury.ac.uk		2049	Y	37127
Self-heal Daemon on gfsi-isr-01.core.canterbury.ac.uk	N/A	Y	37134
NFS Server on gfsi-rh-01.core.canterbury.ac.uk		2049	Y	12875
Self-heal Daemon on gfsi-rh-01.core.canterbury.ac.uk	N/A	Y	12882

Task Status of Volume kerberos
------------------------------------------------------------------------------
There are no active volume tasks

# gluster volume info kerberos

Volume Name: kerberos
Type: Replicate
Volume ID: 89d63332-bb1e-4b47-8882-dfdb9af7f97d
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gfsi-rh-01:/srv/hod/kerberos/gfs
Brick2: gfsi-isr-01:/srv/hod/kerberos/gfs
Brick3: gfsi-cant-01:/srv/hod/kerberos/gfs
Options Reconfigured:
cluster.server-quorum-ratio: 51

# gluster volume heal kerberos statistics | grep No  | grep -v 0

which to me looks good - 3 servers as a replica with qurom set.

BUT:

# mount | grep kerberos
gfsi-cant-01:/kerberos on /var/gfs/kerberos type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=50,retrans=2,sec=sys,mountaddr=194.82.211.115,mountvers=3,mountport=38465,mountproto=tcp,local_lock=all,addr=194.82.211.115)

To be clear we use autofs failover mounting NFS  to our 3 bricks with a .5 second timeout

# df -h
Filesystem               Size  Used Avail Use% Mounted on
gfsi-cant-01:/kerberos    97M   31M   61M  34% /var/gfs/kerberos

Its a deliberately small volume as it only holds kerberos keytabs that need to be in sync across web servers.  Amusingly there is nothing like that much data being used:

/var/gfs/kerberos# ls -la
total 8
drwxr-xr-x  7 root   root   1024 Oct 15 16:03 .
drwxr-xr-x  5 root   root      0 Oct 15 16:03 ..
drwxr-xr-x  2 root   root   1024 Sep 16 12:09 HTTP_blog-mgmnt
drwxr-xr-x  2 root   root   1024 Jul 31  2014 HTTP_kerbtest
drwxr-xr-x  2 root   root   1024 May 14 15:48 HTTP_wiki-dev
-rw-r--r--+ 1 root   root   2546 Sep 16 11:24 krb5-keytab-autoupdate
drwxr-xr-x  2 nagios nagios 1024 Oct 15 16:18 .nagios
-rw-r--r--  1 nagios nagios   47 Apr 23 16:08 .nagioscheck

Rlampint-rh-01:/var/gfs/kerberos# du -hs *
5.0K	HTTP_blog-mgmnt
2.5K	HTTP_kerbtest
2.5K	HTTP_wiki-dev
2.5K	krb5-keytab-autoupdate

Rlampint-rh-01:/var/gfs/kerberos# du -hs .nagios
1.0K	.nagios

So I can only assume that the rest of the data is masked as Gluster metadata.

Rlampint-rh-01:/var/gfs/kerberos# ls -la >file
bash: file: No space left on device

which suggests that the volume is full - it clearly isn't.

In the logs:
/var/log/glusterfs# less glfsheal-kerberos.log
[2015-10-15 15:29:31.407041] E [glfs-mgmt.c:520:mgmt_getspec_cbk] 0-gfapi: failed to get the 'volume file' from server
[2015-10-15 15:29:31.407090] E [glfs-mgmt.c:599:mgmt_getspec_cbk] 0-glfs-mgmt: failed to fetch volume file (key:kerberos)
<Repeatedly>

I have stopped and restarted the volume, which has made no difference that I can see.

Other volumes configured and provisioned in a similar way on these 3 GFS servers are not reporting issues and they are rather well loaded compared to this one.

The nfs.log shows:
[2015-10-15 15:35:15.222424] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 9b62284f: /.nagios/.nagiosrwtest.1444923315 => -1 (No space left on device)
[2015-10-15 15:35:28.081437] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-2: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.082000] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-0: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.082064] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-1: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.083019] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 4b32485d: /.nagios/.nagiosrwtest.1444923328 => -1 (No space left on device)

At this point I am rather confused and not entirely certain where to turn next - as I can see that there is space although allegedly there isnt - the three bricks are comprised of 3 100MB volumes - in any case testing has shown previously that gluster will size for the smallest brick when using replicas.

Thoughts/comments are as always welcome.

Thanks

Paul