[Gluster-users] files disappearing and re-appearing

Thu Dec 22 15:49:21 UTC 2016

Dear Mohammed Rafi,

thanks for getting back to me!

> If you have the problem still bugging you, or if you have any previous
> logs that you can share with me, that will help to analyze further.

I have collected the logs from the server and one client; it's a 21MB
archive, how can I provide them? (I'm not sure how complete the
collection is: unfortunately some time has already passed, so client
nodes have been terminated and their logs have been lost. Also the
issues were happening at the beginning of November, so some logs have
simply been rotated out of existence now.)

My reply to your questions is inline below.

> On 11/17/2016 07:22 PM, Riccardo Murri wrote:
> > Hello,
> >
> > we are trying out GlusterFS as the working filesystem for a compute cluster; 
> > the cluster is comprised of 57 compute nodes (55 cores each), acting as 
> > GlusterFS clients, and 25 data server nodes (8 cores each), serving 
> > 1 large GlusterFS brick each.
> >
> > We currently have noticed a couple of issues:
> >
> > 1) When compute jobs run, the `glusterfs` client process on the compute nodes
> > goes up to 100% CPU, and filesystem operations start to slow down a lot.  
> > Since there are many CPUs available, is it possible to make it use, e.g., 
> > 4 CPUs instead of one to make it more responsive?
> 
> Can you just briefly describe about your computing job, workloads to see
> what are the operation happening on the cluster.

We built a cluster with 47 compute nodes, each with 56 cores.  The
compute nodes were acting as GlusterFS clients (FUSE) to 25 GlusterFS
servers, each with 8 cores and 32 GB of RAM.  Each server was serving a
single 10TB brick (ext4 formatted), for a grand total of 250TB.

The compute nodes were running the "rockstar" [1] program, one job per
node so about 45 jobs concurrently running[2], driven by a shell script
that was performing a number of file existence probes while the main
program was running, e.g. (PERL)::

    sleep 1 while (!(-e "auto-rockstar.cfg")); #wait for server to start

Users of the cluster reported that many jobs failed or stalled because
these existence tests were never succeeding, or files would disappear
after having been created.

[1]: https://bitbucket.org/gfcstanford/rockstar
[2]: although one job could span many processes

> > 2) In addition (but possibly related to 1) we have an issue with files
> > disappearing and re-appearing: from a compute process we test for the existence
> > of a file and e.g. `test -e /glusterfs/file.txt` fails.  Then we test from
> > a different process or shell and the file is there.  As far as I can see,
> > the servers are basically idle, and none of the peers is disconnected.
> >
> > We are running GlusterFS 3.7.17 on Ubuntu 16.04, installed from the Launchpad PPA.
> > (Details below for the interested.)
> >
> > Can you give any hint about what's going on?

> is there any rebalance happening, tell me more about any on going
> operations (internal operations like rebalance, shd,etc or client
> operations).

If any rebalance happened, it was triggered automatically by the system.
It might be relevant that at some point the free space dropped to 0 (too
much output from the jobs), this might have thrown off some internal
healing operation.

Basically, the sequence of operations was like this:

- create cluster
- fill the `/glusterfs` with input data: ~200TB copied with `rsync`, no problems
- start 1000 "rockstar" jobs, issues begin as jobs stall and never complete
- reboot all GlusterFS servers and unmount/remount filesystem on clients, attempting to cure the problem
- reduce amount of compute nodes to 10 (=560 cores), job failure rate decreases to an acceptable rate

I could only get limited reports/data points from the users: they were
in a hurry to process the data because of a deadline and did not want to
sit down to debug the issue to the roots.

I am still quite interested in sorting this problem out, as the same
issue might resurface if we need to build a large cluster again.

> Also some insight about your volume configuration will also help. volume
> info and volume status.

Here it is::

    ubuntu at data001:~$ sudo gluster volume info

    Volume Name: glusterfs
    Type: Distribute
    Volume ID: fdca65bd-313c-47fa-8a09-222f794951ed
    Status: Started
    Number of Bricks: 25
    Transport-type: tcp
    Bricks:
    Brick1: data001:/srv/glusterfs
    Brick2: data002:/srv/glusterfs
    Brick3: data003:/srv/glusterfs
    Brick4: data004:/srv/glusterfs
    Brick5: data005:/srv/glusterfs
    Brick6: data006:/srv/glusterfs
    Brick7: data007:/srv/glusterfs
    Brick8: data008:/srv/glusterfs
    Brick9: data009:/srv/glusterfs
    Brick10: data010:/srv/glusterfs
    Brick11: data011:/srv/glusterfs
    Brick12: data012:/srv/glusterfs
    Brick13: data013:/srv/glusterfs
    Brick14: data014:/srv/glusterfs
    Brick15: data015:/srv/glusterfs
    Brick16: data016:/srv/glusterfs
    Brick17: data017:/srv/glusterfs
    Brick18: data018:/srv/glusterfs
    Brick19: data019:/srv/glusterfs
    Brick20: data020:/srv/glusterfs
    Brick21: data021:/srv/glusterfs
    Brick22: data022:/srv/glusterfs
    Brick23: data023:/srv/glusterfs
    Brick24: data024:/srv/glusterfs
    Brick25: data025:/srv/glusterfs
    Options Reconfigured:
    performance.readdir-ahead: on

    ubuntu at data001:~$ sudo gluster volume status
    Status of volume: glusterfs
    Gluster process                             TCP Port  RDMA Port  Online  Pid
    ------------------------------------------------------------------------------
    Brick data001:/srv/glusterfs                49153     0          Y       1462
    Brick data002:/srv/glusterfs                49153     0          Y       1459
    Brick data003:/srv/glusterfs                49153     0          Y       1463
    Brick data004:/srv/glusterfs                49153     0          Y       1460
    Brick data005:/srv/glusterfs                49153     0          Y       1459
    Brick data006:/srv/glusterfs                49153     0          Y       1748
    Brick data007:/srv/glusterfs                49153     0          Y       1457
    Brick data008:/srv/glusterfs                49153     0          Y       1498
    Brick data009:/srv/glusterfs                49153     0          Y       1469
    Brick data010:/srv/glusterfs                49153     0          Y       1489
    Brick data011:/srv/glusterfs                49153     0          Y       1470
    Brick data012:/srv/glusterfs                49153     0          Y       1458
    Brick data013:/srv/glusterfs                49153     0          Y       1475
    Brick data014:/srv/glusterfs                49153     0          Y       1464
    Brick data015:/srv/glusterfs                49153     0          Y       1459
    Brick data016:/srv/glusterfs                49153     0          Y       1465
    Brick data017:/srv/glusterfs                49153     0          Y       1466
    Brick data018:/srv/glusterfs                49153     0          Y       1467
    Brick data019:/srv/glusterfs                49153     0          Y       1464
    Brick data020:/srv/glusterfs                49153     0          Y       1460
    Brick data021:/srv/glusterfs                49153     0          Y       1556
    Brick data022:/srv/glusterfs                49153     0          Y       1458
    Brick data023:/srv/glusterfs                49153     0          Y       1472
    Brick data024:/srv/glusterfs                49153     0          Y       1767
    Brick data025:/srv/glusterfs                49153     0          Y       1470
    NFS Server on localhost                     2049      0          Y       17383
    NFS Server on data011                       2049      0          Y       14638
    NFS Server on data022                       2049      0          Y       12485
    NFS Server on data004                       2049      0          Y       15197
    NFS Server on data007                       2049      0          Y       15006
    NFS Server on data021                       2049      0          Y       13631
    NFS Server on data019                       2049      0          Y       14421
    NFS Server on data008                       2049      0          Y       13506
    NFS Server on data013                       2049      0          Y       15965
    NFS Server on data014                       2049      0          Y       13231
    NFS Server on data005                       2049      0          Y       13370
    NFS Server on data017                       2049      0          Y       15316
    NFS Server on data003                       2049      0          Y       15359
    NFS Server on data002                       2049      0          Y       12681
    NFS Server on data024                       2049      0          Y       14263
    NFS Server on data025                       2049      0          Y       12560
    NFS Server on data016                       2049      0          Y       14761
    NFS Server on data023                       2049      0          Y       13165
    NFS Server on data020                       2049      0          Y       12769
    NFS Server on data018                       2049      0          Y       13789
    NFS Server on data006                       2049      0          Y       13429
    NFS Server on data015                       2049      0          Y       13423
    NFS Server on data009                       2049      0          Y       15343
    NFS Server on data010                       2049      0          Y       13189
    NFS Server on data012                       2049      0          Y       12690

    Task Status of Volume glusterfs
    ------------------------------------------------------------------------------
    There are no active volume tasks

We build ephemeral clusters of VMs on an OpenStack infrastructure, that
are destroyed once the batch of computations is done.

The GlusterFS server configuration is done by Ansible::
https://github.com/gc3-uzh-ch/elasticluster/blob/master/elasticluster/share/playbooks/roles/glusterfs-server/tasks/export.yml

This is the `/etc/glusterfs/glusterd.vol` generated as result::

    volume management
        type mgmt/glusterd
        option working-directory /var/lib/glusterd
        option transport-type socket,rdma
        option transport.socket.keepalive-time 10
        option transport.socket.keepalive-interval 2
        option transport.socket.read-fail-log off
        option ping-timeout 0
        option event-threads 1
    #   option base-port 49152
    end-volume

The GlusterFS clients simply do `mount -t glusterfs data001:/srv/glusterfs /glusterfs`

Thanks for your help!

Riccardo

--
Riccardo Murri
http://www.s3it.uzh.ch/about/team/#Riccardo.Murri

S3IT: Services and Support for Science IT
University of Zurich
Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)

Tel: +41 44 635 4208
Fax: +41 44 635 6888