[Gluster-users] Hundreds of duplicate files
Joe Julian
joe at julianfamily.org
Fri Feb 20 20:51:13 UTC 2015
On 02/20/2015 12:21 PM, Olav Peeters wrote:
> Let's take one file (3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd) as an
> example...
> On the 3 nodes where all bricks are formatted as XFS and mounted in
> /export and 272b2366-dfbf-ad47-2a0f-5d5cc40863e3 is the mounting point
> of a NFS shared storage connection from XenServer machines:
Did I just read this correctly? Your bricks are NFS mounts? ie,
GlusterFS Client <-> GlusterFS Server <-> NFS <-> XFS
>
> [root at gluster01 ~]# find
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec ls
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
> /export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Supposedly, this is the actual file.
> -rw-r--r--. 2 root root 0 Feb 18 00:51
> /export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
This is not a linkfile. Note it's mode 0644. How it got there with those
permissions would be a matter of history and would require information
that's probably lost.
>
> root at gluster02 ~]# find
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec ls
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
> /export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> [root at gluster03 ~]# find
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec ls
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
> /export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 2 root root 0 Feb 18 00:51
> /export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Same analysis as above.
>
> 3 files with information, 2 x a 0-bit file with the same name
>
> Checking the 0-bit files:
> [root at gluster01 ~]# getfattr -m . -d -e hex
> /export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file:
> export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster03 ~]# getfattr -m . -d -e hex
> /export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file:
> export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> This is not a glusterfs link file since there is no
> "trusted.glusterfs.dht.linkto", am I correct?
You are correct.
>
> And checking the "good" files:
>
> # file:
> export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-32=0x000000000000000000000000
> trusted.afr.sr_vol01-client-33=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000010000000100000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster02 ~]# getfattr -m . -d -e hex
> /export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file:
> export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-32=0x000000000000000000000000
> trusted.afr.sr_vol01-client-33=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster03 ~]# getfattr -m . -d -e hex
> /export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file:
> export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-40=0x000000000000000000000000
> trusted.afr.sr_vol01-client-41=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
>
>
> Seen from a client via a glusterfs mount:
> [root at client ~]# ls -al
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
>
>
> Via NFS (just after performing a umount and mount the volume again):
> [root at client ~]# ls -al
> /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> Doing the same list a couple of seconds later:
> [root at client ~]# ls -al /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> And again, and again, and again:
> [root at client ~]# ls -al /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
> /mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> This really seems odd. Why do we get to see "real data file" once only?
>
> It seems more and more that this crazy file duplication (and writing
> of sticky bit files) was actually triggered when rebooting one of the
> three nodes while there still is an active (even when there is no data
> exchange at all) NFS connection, since all 0-bit files (of the non
> Sticky bit type) were either created at 00:51 or 00:41, the exact
> moment one of the three nodes in the cluster were rebooted. This would
> mean that replication currently with GlusterFS creates hardly any
> redundancy. Quiet the opposite, if one of the machines goes down, all
> of your data seriously gets disorganised. I am buzzy configuring a
> test installation to see how this can be best reproduced for a bug
> report..
>
> Does anyone have a suggestion how to best get rid of the duplicates,
> or rather get this mess organised the way it should be?
> This is a cluster with millions of files. A rebalance does not fix the
> issue, neither does a rebalance fix-layout help. Since this is a
> replicated volume all files should be their 2x, not 3x. Can I safely
> just remove all the 0 bit files outside of the .glusterfs directory
> including the sticky bit files?
>
> The empty 0 bit files outside of .glusterfs on every brick I can
> probably safely removed like this:
> find /export/* -path */.glusterfs -prune -o -type f -size 0 -perm 1000
> -exec rm {} \;
> not?
>
> Thanks!
>
> Cheers,
> Olav
> On 18/02/15 22:10, Olav Peeters wrote:
>> Thanks Tom and Joe,
>> for the fast response!
>>
>> Before I started my upgrade I stopped all clients using the volume
>> and stopped all VM's with VHD on the volume, but I guess, and this
>> may be the missing thing to reproduce this in a lab, I did not detach
>> a NFS shared storage mount from a XenServer pool to this volume,
>> since this is an extremely risky business. I also did not stop the
>> volume. This I guess was a bit stupid, but since I did upgrades in
>> the past this way without any issues I skipped this step (a really
>> bad habit). I'll make amends and file a proper bug report :-). I
>> agree with you Joe, this should never happen, even when someone
>> ignores the advice of stopping the volume. If it would also be
>> nessessary to detach shared storage NFS connections to a volume, than
>> franky, glusterfs is unusable in a private cloud. No one can afford
>> downtime of the whole infrastructure just for a glusterfs upgrade.
>> Ideally a replicated gluster volume should even be able to remain
>> online and used during (at least a minor version) upgrade.
>>
>> I don't know whether a heal was maybe buzzy when I started the
>> upgrade. I forgot to check. I did check the CPU activity on the
>> gluster nodes which were very low (in the 0.0X range via top), so I
>> doubt it. I will add this to the bug report as a suggestion should
>> they not be able to reproduce with an open NFS connection.
>>
>> By the way, is it sufficient to do:
>> service glusterd stop
>> service glusterfsd stop
>> and do a:
>> ps aux | gluster*
>> to see if everything has stopped and kill any leftovers should this
>> be necessary?
>>
>> For the fix, do you agree that if I run e.g.:
>> find /export/* -type f -size 0 -perm 1000 -exec /bin/rm {} \;
>> on every node if /export is the location of all my bricks, also in a
>> replicated set-up, this will be save?
>> No necessary 0bit files will be deleted in e.g. the .glusterfs of
>> every brick?
>>
>> Thanks for your support!
>>
>> Cheers,
>> Olav
>>
>>
>>
>>
>>
>> On 18/02/15 20:51, Joe Julian wrote:
>>>
>>> On 02/18/2015 11:43 AM, tbenzvi at 3vgeomatics.com wrote:
>>>> Hi Olav,
>>>>
>>>> I have a hunch that our problem was caused by improper unmounting
>>>> of the gluster volume, and have since found that the proper order
>>>> should be: kill all jobs using volume -> unmount volume on clients
>>>> -> gluster volume stop -> stop gluster service (if necessary)
>>>> In my case, I wrote a Python script to find duplicate files on the
>>>> mounted volume, then delete the corresponding link files on the
>>>> bricks (making sure to also delete files in the .glusterfs directory)
>>>> However, your find command was also suggested to me and I think
>>>> it's a simpler solution. I believe removing all link files (even
>>>> ones that are not causing duplicates) is fine since the next file
>>>> access gluster will do a lookup on all bricks and recreate any link
>>>> files if necessary. Hopefully a gluster expert can chime in on this
>>>> point as I'm not completely sure.
>>>
>>> You are correct.
>>>
>>>> Keep in mind your setup is somewhat different than mine as I have
>>>> only 5 bricks with no replication.
>>>> Regards,
>>>> Tom
>>>>
>>>> --------- Original Message ---------
>>>> Subject: Re: [Gluster-users] Hundreds of duplicate files
>>>> From: "Olav Peeters" <opeeters at gmail.com>
>>>> Date: 2/18/15 10:52 am
>>>> To: gluster-users at gluster.org, tbenzvi at 3vgeomatics.com
>>>>
>>>> Hi all,
>>>> I'm have this problem after upgrading from 3.5.3 to 3.6.2.
>>>> At the moment I am still waiting for a heal to finish (on a
>>>> 31TB volume with 42 bricks, replicated over three nodes).
>>>>
>>>> Tom,
>>>> how did you remove the duplicates?
>>>> with 42 bricks I will not be able to do this manually..
>>>> Did a:
>>>> find $brick_root -type f -size 0 -perm 1000 -exec /bin/rm {} \;
>>>> work for you?
>>>>
>>>> Should this type of thing ideally not be checked and mended by
>>>> a heal?
>>>>
>>>> Does anyone have an idea yet how this happens in the first
>>>> place? Can it be connected to upgrading?
>>>>
>>>> Cheers,
>>>> Olav
>>>>
>>>>
>>>>
>>>> On 01/01/15 03:07, tbenzvi at 3vgeomatics.com wrote:
>>>>
>>>> No, the files can be read on a newly mounted client! I went
>>>> ahead and deleted all of the link files associated with
>>>> these duplicates, and then remounted the volume. The
>>>> problem is fixed!
>>>> Thanks again for the help, Joe and Vijay.
>>>> Tom
>>>>
>>>> --------- Original Message ---------
>>>> Subject: Re: [Gluster-users] Hundreds of duplicate files
>>>> From: "Vijay Bellur" <vbellur at redhat.com>
>>>> Date: 12/28/14 3:23 am
>>>> To: tbenzvi at 3vgeomatics.com, gluster-users at gluster.org
>>>>
>>>> On 12/28/2014 01:20 PM, tbenzvi at 3vgeomatics.com wrote:
>>>> > Hi Vijay,
>>>> > Yes the files are still readable from the .glusterfs
>>>> path.
>>>> > There is no explicit error. However, trying to read a
>>>> text file in
>>>> > python simply gives me null characters:
>>>> >
>>>> > >>> open('ott_mf_itab').readlines()
>>>> >
>>>> ['\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']
>>>> >
>>>> > And reading binary files does the same
>>>> >
>>>>
>>>> Is this behavior seen with a freshly mounted client too?
>>>>
>>>> -Vijay
>>>>
>>>> > --------- Original Message ---------
>>>> > Subject: Re: [Gluster-users] Hundreds of duplicate files
>>>> > From: "Vijay Bellur" <vbellur at redhat.com>
>>>> > Date: 12/27/14 9:57 pm
>>>> > To: tbenzvi at 3vgeomatics.com, gluster-users at gluster.org
>>>> >
>>>> > On 12/28/2014 10:13 AM, tbenzvi at 3vgeomatics.com wrote:
>>>> > > Thanks Joe, I've read your blog post as well as
>>>> your post
>>>> > regarding the
>>>> > > .glusterfs directory.
>>>> > > I found some unneeded duplicate files which were
>>>> not being read
>>>> > > properly. I then deleted the link file from the
>>>> brick. This always
>>>> > > removes the duplicate file from the listing, but
>>>> the file does not
>>>> > > always become readable. If I also delete the
>>>> associated file in the
>>>> > > .glusterfs directory on that brick, then some more
>>>> files become
>>>> > > readable. However this solution still doesn't work
>>>> for all files.
>>>> > > I know the file on the brick is not corrupt as it
>>>> can be read
>>>> > directly
>>>> > > from the brick directory.
>>>> >
>>>> > For files that are not readable from the client, can
>>>> you check if the
>>>> > file is readable from the .glusterfs/ path?
>>>> >
>>>> > What is the specific error that is seen while trying
>>>> to read one such
>>>> > file from the client?
>>>> >
>>>> > Thanks,
>>>> > Vijay
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > Gluster-users mailing list
>>>> > Gluster-users at gluster.org
>>>> > http://www.gluster.org/mailman/listinfo/gluster-users
>>>> >
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150220/d8bf651c/attachment.html>
More information about the Gluster-users
mailing list