[Gluster-users] Hundreds of duplicate files

Joe Julian joe at julianfamily.org
Fri Feb 20 20:51:13 UTC 2015


On 02/20/2015 12:21 PM, Olav Peeters wrote:
> Let's take one file (3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd) as an 
> example...
> On the 3 nodes where all bricks are formatted as XFS and mounted in 
> /export and 272b2366-dfbf-ad47-2a0f-5d5cc40863e3 is the mounting point 
> of a NFS shared storage connection from XenServer machines:
Did I just read this correctly? Your bricks are NFS mounts? ie, 
GlusterFS Client <-> GlusterFS Server <-> NFS <-> XFS
>
> [root at gluster01 ~]# find 
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec ls 
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55 
> /export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Supposedly, this is the actual file.
> -rw-r--r--. 2 root root 0 Feb 18 00:51 
> /export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
This is not a linkfile. Note it's mode 0644. How it got there with those 
permissions would be a matter of history and would require information 
that's probably lost.
>
> root at gluster02 ~]# find 
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec ls 
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55 
> /export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> [root at gluster03 ~]# find 
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec ls 
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55 
> /export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 2 root root 0 Feb 18 00:51 
> /export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Same analysis as above.
>
> 3 files with information, 2 x a 0-bit file with the same name
>
> Checking the 0-bit files:
> [root at gluster01 ~]# getfattr -m . -d -e hex 
> /export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file: 
> export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster03 ~]# getfattr -m . -d -e hex 
> /export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file: 
> export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> This is not a glusterfs link file since there is no 
> "trusted.glusterfs.dht.linkto", am I correct?
You are correct.
>
> And checking the "good" files:
>
> # file: 
> export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-32=0x000000000000000000000000
> trusted.afr.sr_vol01-client-33=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000010000000100000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster02 ~]# getfattr -m . -d -e hex 
> /export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file: 
> export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-32=0x000000000000000000000000
> trusted.afr.sr_vol01-client-33=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster03 ~]# getfattr -m . -d -e hex 
> /export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file: 
> export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-40=0x000000000000000000000000
> trusted.afr.sr_vol01-client-41=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
>
>
> Seen from a client via a glusterfs mount:
> [root at client ~]# ls -al 
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
>
>
> Via NFS (just after performing a umount and mount the volume again):
> [root at client ~]# ls -al 
> /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> Doing the same list a couple of seconds later:
> [root at client ~]# ls -al /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> And again, and again, and again:
> [root at client ~]# ls -al /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51 
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> This really seems odd. Why do we get to see "real data file" once only?
>
> It seems more and more that this crazy file duplication (and writing 
> of sticky bit files) was actually triggered when rebooting one of the 
> three nodes while there still is an active (even when there is no data 
> exchange at all) NFS connection, since all 0-bit files (of the non 
> Sticky bit type) were either created at 00:51 or 00:41, the exact 
> moment one of the three nodes in the cluster were rebooted. This would 
> mean that replication currently with GlusterFS creates hardly any 
> redundancy. Quiet the opposite, if one of the machines goes down, all 
> of your data seriously gets disorganised. I am buzzy configuring a 
> test installation to see how this can be best reproduced for a bug 
> report..
>
> Does anyone have a suggestion how to best get rid of the duplicates, 
> or rather get this mess organised the way it should be?
> This is a cluster with millions of files. A rebalance does not fix the 
> issue, neither does a rebalance fix-layout help. Since this is a 
> replicated volume all files should be their 2x, not 3x. Can I safely 
> just remove all the 0 bit files outside of the .glusterfs directory 
> including the sticky bit files?
>
> The empty 0 bit files outside of .glusterfs on every brick I can 
> probably safely removed like this:
> find /export/* -path */.glusterfs -prune -o -type f -size 0 -perm 1000 
> -exec rm {} \;
> not?
>
> Thanks!
>
> Cheers,
> Olav
> On 18/02/15 22:10, Olav Peeters wrote:
>> Thanks Tom and Joe,
>> for the fast response!
>>
>> Before I started my upgrade I stopped all clients using the volume 
>> and stopped all VM's with VHD on the volume, but I guess, and this 
>> may be the missing thing to reproduce this in a lab, I did not detach 
>> a NFS shared storage mount from a XenServer pool to this volume, 
>> since this is an extremely risky business. I also did not stop the 
>> volume. This I guess was a bit stupid, but since I did upgrades in 
>> the past this way without any issues I skipped this step (a really 
>> bad habit). I'll make amends and file a proper bug report :-). I 
>> agree with you Joe, this should never happen, even when someone 
>> ignores the advice of stopping the volume. If it would also be 
>> nessessary to detach shared storage NFS connections to a volume, than 
>> franky, glusterfs is unusable in a private cloud. No one can afford 
>> downtime of the whole infrastructure just for a glusterfs upgrade. 
>> Ideally a replicated gluster volume should even be able to remain 
>> online and used during (at least a minor version) upgrade.
>>
>> I don't know whether a heal was maybe buzzy when I started the 
>> upgrade. I forgot to check. I did check the CPU activity on the 
>> gluster nodes which were very low (in the 0.0X range via top), so I 
>> doubt it. I will add this to the bug report as a suggestion should 
>> they not be able to reproduce with an open NFS connection.
>>
>> By the way, is it sufficient to do:
>> service glusterd stop
>> service glusterfsd stop
>> and do a:
>> ps aux | gluster*
>> to see if everything has stopped and kill any leftovers should this 
>> be necessary?
>>
>> For the fix, do you agree that if I run e.g.:
>> find /export/* -type f -size 0 -perm 1000 -exec /bin/rm {} \;
>> on every node if /export is the location of all my bricks, also in a 
>> replicated set-up, this will be save?
>> No necessary 0bit files will be deleted in e.g. the .glusterfs of 
>> every brick?
>>
>> Thanks for your support!
>>
>> Cheers,
>> Olav
>>
>>
>>
>>
>>
>> On 18/02/15 20:51, Joe Julian wrote:
>>>
>>> On 02/18/2015 11:43 AM, tbenzvi at 3vgeomatics.com wrote:
>>>> Hi Olav,
>>>>
>>>> I have a hunch that our problem was caused by improper unmounting 
>>>> of the gluster volume, and have since found that the proper order 
>>>> should be: kill all jobs using volume -> unmount volume on clients 
>>>> -> gluster volume stop -> stop gluster service (if necessary)
>>>> In my case, I wrote a Python script to find duplicate files on the 
>>>> mounted volume, then delete the corresponding link files on the 
>>>> bricks (making sure to also delete files in the .glusterfs directory)
>>>> However, your find command was also suggested to me and I think 
>>>> it's a simpler solution. I believe removing all link files (even 
>>>> ones that are not causing duplicates) is fine since the next file 
>>>> access gluster will do a lookup on all bricks and recreate any link 
>>>> files if necessary. Hopefully a gluster expert can chime in on this 
>>>> point as I'm not completely sure.
>>>
>>> You are correct.
>>>
>>>> Keep in mind your setup is somewhat different than mine as I have 
>>>> only 5 bricks with no replication.
>>>> Regards,
>>>> Tom
>>>>
>>>>     --------- Original Message ---------
>>>>     Subject: Re: [Gluster-users] Hundreds of duplicate files
>>>>     From: "Olav Peeters" <opeeters at gmail.com>
>>>>     Date: 2/18/15 10:52 am
>>>>     To: gluster-users at gluster.org, tbenzvi at 3vgeomatics.com
>>>>
>>>>     Hi all,
>>>>     I'm have this problem after upgrading from 3.5.3 to 3.6.2.
>>>>     At the moment I am still waiting for a heal to finish (on a
>>>>     31TB volume with 42 bricks, replicated over three nodes).
>>>>
>>>>     Tom,
>>>>     how did you remove the duplicates?
>>>>     with 42 bricks I will not be able to do this manually..
>>>>     Did a:
>>>>     find $brick_root -type f -size 0 -perm 1000 -exec /bin/rm {} \;
>>>>     work for you?
>>>>
>>>>     Should this type of thing ideally not be checked and mended by
>>>>     a heal?
>>>>
>>>>     Does anyone have an idea yet how this happens in the first
>>>>     place? Can it be connected to upgrading?
>>>>
>>>>     Cheers,
>>>>     Olav
>>>>
>>>>       
>>>>
>>>>     On 01/01/15 03:07, tbenzvi at 3vgeomatics.com wrote:
>>>>
>>>>         No, the files can be read on a newly mounted client! I went
>>>>         ahead and deleted all of the link files associated with
>>>>         these duplicates, and then remounted the volume. The
>>>>         problem is fixed!
>>>>         Thanks again for the help, Joe and Vijay.
>>>>         Tom
>>>>
>>>>             --------- Original Message ---------
>>>>             Subject: Re: [Gluster-users] Hundreds of duplicate files
>>>>             From: "Vijay Bellur" <vbellur at redhat.com>
>>>>             Date: 12/28/14 3:23 am
>>>>             To: tbenzvi at 3vgeomatics.com, gluster-users at gluster.org
>>>>
>>>>             On 12/28/2014 01:20 PM, tbenzvi at 3vgeomatics.com wrote:
>>>>             > Hi Vijay,
>>>>             > Yes the files are still readable from the .glusterfs
>>>>             path.
>>>>             > There is no explicit error. However, trying to read a
>>>>             text file in
>>>>             > python simply gives me null characters:
>>>>             >
>>>>             > >>> open('ott_mf_itab').readlines()
>>>>             >
>>>>             ['\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']
>>>>             >
>>>>             > And reading binary files does the same
>>>>             >
>>>>
>>>>             Is this behavior seen with a freshly mounted client too?
>>>>
>>>>             -Vijay
>>>>
>>>>             > --------- Original Message ---------
>>>>             > Subject: Re: [Gluster-users] Hundreds of duplicate files
>>>>             > From: "Vijay Bellur" <vbellur at redhat.com>
>>>>             > Date: 12/27/14 9:57 pm
>>>>             > To: tbenzvi at 3vgeomatics.com, gluster-users at gluster.org
>>>>             >
>>>>             > On 12/28/2014 10:13 AM, tbenzvi at 3vgeomatics.com wrote:
>>>>             > > Thanks Joe, I've read your blog post as well as
>>>>             your post
>>>>             > regarding the
>>>>             > > .glusterfs directory.
>>>>             > > I found some unneeded duplicate files which were
>>>>             not being read
>>>>             > > properly. I then deleted the link file from the
>>>>             brick. This always
>>>>             > > removes the duplicate file from the listing, but
>>>>             the file does not
>>>>             > > always become readable. If I also delete the
>>>>             associated file in the
>>>>             > > .glusterfs directory on that brick, then some more
>>>>             files become
>>>>             > > readable. However this solution still doesn't work
>>>>             for all files.
>>>>             > > I know the file on the brick is not corrupt as it
>>>>             can be read
>>>>             > directly
>>>>             > > from the brick directory.
>>>>             >
>>>>             > For files that are not readable from the client, can
>>>>             you check if the
>>>>             > file is readable from the .glusterfs/ path?
>>>>             >
>>>>             > What is the specific error that is seen while trying
>>>>             to read one such
>>>>             > file from the client?
>>>>             >
>>>>             > Thanks,
>>>>             > Vijay
>>>>             >
>>>>             >
>>>>             >
>>>>             > _______________________________________________
>>>>             > Gluster-users mailing list
>>>>             > Gluster-users at gluster.org
>>>>             > http://www.gluster.org/mailman/listinfo/gluster-users
>>>>             >
>>>>
>>>>
>>>>
>>>>         _______________________________________________
>>>>         Gluster-users mailing list
>>>>         Gluster-users at gluster.org
>>>>         http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150220/d8bf651c/attachment.html>


More information about the Gluster-users mailing list