[Gluster-users] Problem with files showing zero links

Fri Jan 29 14:42:28 UTC 2016

Venky Shankar wrote on 29/01/2016 12:28:
> On Fri, Jan 29, 2016 at 11:34:37AM +0000, Ronny Adsetts wrote:
>> Venky Shankar wrote on 29/01/2016 11:09:
>>> On Fri, Jan 29, 2016 at 10:46:14AM +0000, Ronny Adsetts wrote:
>>>>
>>>> No idea how it came about in the first place. Suspect it happened
>>>> during either an operating system upgrade or upgrading glusterfs
>>>> packages to those from gluster.org.
>>>
>>> If you're still willing to debug, please send across brick log file
>>> of the concerned node.
>>
>> Of course. Attached is the current brick log.
> 
> The log files does not have any clue about how some of the files ended up with missing .glusterfs
> linkage, though there are tons of errors of operations trying to access .glusterfs/.././.....
> (e.g. for gfid 7f0fa160-8e5d-44f5-88b3-8187b2397313). Probably an upgrade and restart of services
> in midst of an operation?

Unfortunately I don't have logs from when the server upgrade was done.

It's possible services were started when they're shouldn't have been. The upgrade was from Debian Squeeze to Wheezy which should have meant Gluster was the same version (3.2.7-3+deb7u1~bpo60+1 -> 3.2.7-3+deb7u1). I then upgraded to the Debian backports version (3.5.2-1~bpo70+1). I then upgraded to the gluster.org 3.6.8-1 version. I tried to upgrade to the gluster.org 3.7.x version on one of the nodes but it wouldn't start.

The node I'm having problems with is the one done second in all of the steps above.

>> My previous comments about having fixed the problem were a little premature. I now have most of the files that were showing with zero links now in the 'failed-heal' state:
>>
>> # gluster volume heal software statistics
>> [...]
>> Starting time of crawl: Fri Jan 29 11:15:53 2016
>>
>> Ending time of crawl: Fri Jan 29 11:15:58 2016
>>
>> Type of crawl: INDEX
>> No. of entries healed: 0
>> No. of entries in split-brain: 0
>> No. of heal failed entries: 1976
> 
> My guess would be, as soon as you fixed the backend linkages, self-heal was able to kick in as it could
> "find" these files on the replica. But why did the heal fail or why these files needed healing in the
> first place.

Yes, the self-heal was showing nothing until I touch'ed all the files.

Looking in the self-heal logs and picking a line at random:

[2016-01-29 14:19:10.008237] W [client-rpc-fops.c:2772:client3_3_lookup_cbk] 0-software-client-1: remote operation failed: No such file or directory. Path: <gfid:3173a4d0-0e6a-420a-b92d-4947a8a9c122> (3173a4d0-0e6a-420a-b92d-4947a8a9c122)

we have no GFID file on the troublesome node:

metropolis:/stor/software# ls -al /data/glusterfs/software/brick1/brick/.glusterfs/31/73/
total 8
drwx--S---   2 root Domain Admins    6 Jan 22 15:08 .
drwx--S--- 254 root Domain Admins 4096 Jan 20 22:15 ..

It's there on the other node:

gotham:~# ls -i /data/glusterfs/software/brick1/brick/.glusterfs/31/73/3173a4d0-0e6a-420a-b92d-4947a8a9c122
118049117 /data/glusterfs/software/brick1/brick/.glusterfs/31/73/3173a4d0-0e6a-420a-b92d-4947a8a9c122

gotham:~# find /data/glusterfs/software/brick1/brick/ -inum 118049117
/data/glusterfs/software/brick1/brick/win_patches/IE11-Windows6.1-x86-en-us.exe
/data/glusterfs/software/brick1/brick/.glusterfs/31/73/3173a4d0-0e6a-420a-b92d-4947a8a9c122

Data file is there on the bad node:

metropolis:/stor/software# ls -i /data/glusterfs/software/brick1/brick/win_patches/IE11-Windows6.1-x86-en-us.exe
116199485 /data/glusterfs/software/brick1/brick/win_patches/IE11-Windows6.1-x86-en-us.exe

metropolis:/stor/software# find /data/glusterfs/software/brick1/brick/ -inum 116199485
/data/glusterfs/software/brick1/brick/win_patches/IE11-Windows6.1-x86-en-us.exe

metropolis:/stor/software# getfattr -m . -d -e hex /data/glusterfs/software/brick1/brick/win_patches/IE11-Windows6.1-x86-en-us.exe
getfattr: Removing leading '/' from absolute path names
# file: data/glusterfs/software/brick1/brick/win_patches/IE11-Windows6.1-x86-en-us.exe
trusted.afr.software-client-0=0x000000000000000000000000
trusted.afr.software-client-1=0x000000000000000000000000
trusted.gfid=0x3173a4d00e6a420ab92d4947a8a9c122

On the "good" node, we now have a "trusted.afr.dirty" attribute set:

gotham:~# getfattr -m . -d -e hex /data/glusterfs/software/brick1/brick/win_patches/IE11-Windows6.1-x86-en-us.exe
getfattr: Removing leading '/' from absolute path names
# file: data/glusterfs/software/brick1/brick/win_patches/IE11-Windows6.1-x86-en-us.exe
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.software-client-0=0x000000000000000000000000
trusted.afr.software-client-1=0x000000000000000200000000
trusted.gfid=0x3173a4d00e6a420ab92d4947a8a9c122

Hopefully some useful information there.

Aside from the debug in how we got in to this state, any pointers for getting the volume back to a sane state?

> Pranith, Ravi?
> 
> Also, please provide self-heal logs.

File is (temporarily) here, not attached, as it was too big for the list:

http://www.amazinginternet.com/glustershd.log.xz

Attached from the node having problems (using xz for compression as gzip wasn't cutting it for this file). Let me know if you want the log from the other node too.

Ronny
-- 
Ronny Adsetts
Technical Director
Amazing Internet Ltd, London
t: +44 20 8977 8943
w: www.amazinginternet.com

Registered office: 85 Waldegrave Park, Twickenham, TW1 4TJ
Registered in England. Company No. 4042957

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: OpenPGP digital signature
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160129/ef646fa4/attachment.sig>