[Gluster-users] Files Missing on Client Side; Still available on bricks

Tue Jun 6 14:11:30 UTC 2017

Hello,

I am still working at recovering from a few failed OS hard drives on my gluster storage and have been removing, and re-adding bricks quite a bit. I noticed yesterday night that some of the directories are not visible when I access them through the client, but are still on the brick. For example:

Client:

# ls /scratch/dw
Ethiopian_imputation  HGDP  Rolwaling  Tibetan_Alignment

Brick:

# ls /data/brick1/scratch/dw
1000GP_Phase3  Ethiopian_imputation  HGDP  Rolwaling  SGDP  Siberian_imputation  Tibetan_Alignment  mapata

However, the directory is accessible on the client side (just not visible):

# stat /scratch/dw/SGDP
  File: `/scratch/dw/SGDP'
  Size: 212992      Blocks: 416        IO Block: 131072 directory
Device: 21h/33d Inode: 11986142482805280401  Links: 2
Access: (0775/drwxrwxr-x)  Uid: (339748621/dw)   Gid: (339748621/dw)
Access: 2017-06-02 16:00:02.398109000 -0500
Modify: 2017-06-06 06:59:13.004947703 -0500
Change: 2017-06-06 06:59:13.004947703 -0500

The only place I see the directory mentioned in the log files are in the rebalance logs. The following piece may provide a clue as to what is going on:

[2017-06-05 20:46:51.752726] E [MSGID: 109010] [dht-rebalance.c:2259:gf_defrag_get_entry] 0-hpcscratch-dht: /dw/SGDP/HGDP00476_chr6.tped gfid not present
[2017-06-05 20:46:51.752742] E [MSGID: 109010] [dht-rebalance.c:2259:gf_defrag_get_entry] 0-hpcscratch-dht: /dw/SGDP/LP6005441-DNA_B08_chr4.tmp gfid not present
[2017-06-05 20:46:51.752773] E [MSGID: 109010] [dht-rebalance.c:2259:gf_defrag_get_entry] 0-hpcscratch-dht: /dw/SGDP/LP6005441-DNA_B08.geno.tmp gfid not present
[2017-06-05 20:46:51.752789] E [MSGID: 109010] [dht-rebalance.c:2259:gf_defrag_get_entry] 0-hpcscratch-dht: /dw/SGDP/LP6005443-DNA_D02_chr4.out gfid not present

This happened yesterday during a rebalance that failed. However, doing a rebalance fix-layout allowed my to clean up these errors and successfully complete a migration to a re-added brick.

Here is the information for my storage cluster:

# gluster volume info

Volume Name: hpcscratch
Type: Distribute
Volume ID: 80b8eeed-1e72-45b9-8402-e01ae0130105
Status: Started
Number of Bricks: 6
Transport-type: tcp
Bricks:
Brick1: fs001-ib:/data/brick2/scratch
Brick2: fs003-ib:/data/brick5/scratch
Brick3: fs003-ib:/data/brick6/scratch
Brick4: fs004-ib:/data/brick7/scratch
Brick5: fs001-ib:/data/brick1/scratch
Brick6: fs004-ib:/data/brick8/scratch
Options Reconfigured:
server.event-threads: 8
performance.client-io-threads: on
client.event-threads: 8
performance.cache-size: 32MB
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
diagnostics.brick-log-level: INFO

Mount points for the bricks:

/dev/sdb on /data/brick2 type xfs (rw,noatime,nobarrier)
/dev/sda on /data/brick1 type xfs (rw,noatime,nobarrier)

Mount point on the client:

10.xx.xx.xx:/hpcscratch on /scratch type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)

My question is what are some of the possibilities for the root cause of this issue and what is the recommended way of recovering from it? Let me know if you need any more information.

--
Mike Jarsulic
Sr. HPC Administrator
Center for Research Informatics | University of Chicago
773.702.2066