[Gluster-users] Recovery (no active sink)

Sat Jan 5 00:33:14 UTC 2013

Hi,

I have a volume (currently not mounted by any other clients) that 
complains about an "unsynced" entry.

Gluster-3.3.1, setup with replicate 2 (at gluster machines p01,p02 to 
keep the names short)

# gluster volume heal RedhawkHome info
Gathering Heal info on volume RedhawkHome has been successful

Brick mualglup01:/mnt/gluster/RedhawkHome
Number of entries: 1
<gfid:9ed83644-cae6-4d16-a5b7-7ccb48c41695>

Brick mualglup02:/mnt/gluster/RedhawkHome
Number of entries: 1
<gfid:9ed83644-cae6-4d16-a5b7-7ccb48c41695>

### Usually, the output gives me the path of the file,
### but this time only spits out the gfid

I walked the entire file system and found that the corresponding file 
with the gfid is:
./home/zhouq_shared/T2483spectra/January17201/t2483_17jan2010_s1221e1635_1459

I confirmed that the gfid is the same on both p01, p02 Gluster machines 
for that file.

### At both p01 and p02, I have	exactly matching

# getfattr -d -e hex -m . 
home/zhouq_shared/T2483spectra/January172010/t2483_17jan2010_s1221e1635_1459
# file: 
home/zhouq_shared/T2483spectra/January172010/t2483_17jan2010_s1221e1635_1459
trusted.afr.RedhawkHome-client-0=0x000000030000000000000000 (non zero)
trusted.afr.RedhawkHome-client-1=0x000000030000000000000000 (non zero)
trusted.gfid=0x9ed83644cae64d16a5b77ccb48c41695

Self heal is failing with the log (repeating many times each time self 
heal runs):
[2013-01-04 14:47:05.203072] I 
[afr-self-heal-data.c:712:afr_sh_data_fix] 0-RedhawkHome-replicate-0: no 
active sinks for performing self-heal on file 
<gfid:9ed83644-cae6-4d16-a5b7-7ccb48c41695>

At p01,

# stat 
home/zhouq_shared/T2483spectra/January172010/t2483_17jan2010_s1221e1635_1459
   File: 
`home/zhouq_shared/T2483spectra/January172010/t2483_17jan2010_s1221e1635_1459'
   Size: 34881536  	Blocks: 68128      IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 47190622    Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-12-18 10:26:35.276898896 -0500
Modify: 2012-12-18 10:26:36.581912761 -0500
Change: 2013-01-04 19:18:34.495935037 -0500

At p02,
# stat 
home/zhouq_shared/T2483spectra/January172010/t2483_17jan2010_s1221e1635_1459
   File: 
`home/zhouq_shared/T2483spectra/January172010/t2483_17jan2010_s1221e1635_1459'
   Size: 34881536  	Blocks: 68128      IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 328602346   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-12-18 10:26:35.275947590 -0500
Modify: 2012-12-18 10:26:36.584973268 -0500
Change: 2013-01-04 19:18:34.498314380 -0500

md5sum at both p01,p02 matched exactly.

I don't recall both Gluster machines were down at the same time (but 
that does not mean that it did not happen). This is my non-production 
volume, it could be me overly aggressive testing things out. But, I 
don't recall the client "cp" process which produced that file to have 
any error messages (this does not mean that it did not happen too).

What's the best way to recover from this error ?

I assume that the worst case scenario is I use a client to mount the 
volume and then delete the file (that is, I lose this file).

Thanks,
Robin