[Gluster-devel] [Gluster-users] missing files

Thu Feb 5 22:03:56 UTC 2015

It was a mix of files from very small to very large. And many terabytes of data. Approx 20tb

David  (Sent from mobile)

===============================
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310      [cell]
704.799.7974      [fax]
David.Robinson at corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at redhat.com> wrote:
> 
> ----- Original Message -----
>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> To: "Xavier Hernandez" <xhernandez at datalab.es>, "David F. Robinson" <david.robinson at corvidtec.com>, "Benjamin Turner"
>> <bennyturns at gmail.com>
>> Cc: gluster-users at gluster.org, "Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Thursday, February 5, 2015 5:30:04 AM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> 
>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
>>> I believe David already fixed this. I hope this is the same issue he
>>> told about permissions issue.
>> Oops, it is not. I will take a look.
> 
> Yes David exactly like these:
> 
> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> You can 100% verify my theory if you can correlate the time on the disconnects to the time that the missing files were healed.  Can you have a look at /var/log/glusterfs/glustershd.log?  That has all of the healed files + timestamps, if we can see a disconnect during the rsync and a self heal of the missing file I think we can safely assume that the disconnects may have caused this.  I'll try this on my test systems, how much data did you rsync?  What size ish of files / an idea of the dir layout?  
> 
> @Pranith - Could bricks flapping up and down during the rsync cause the files to be missing on the first ls(written to 1 subvol but not the other cause it was down), the ls triggered SH, and thats why the files were there for the second ls be a possible cause here?
> 
> -b
> 
> 
>> Pranith
>>> 
>>> Pranith
>>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
>>>> Is the failure repeatable ? with the same directories ?
>>>> 
>>>> It's very weird that the directories appear on the volume when you do
>>>> an 'ls' on the bricks. Could it be that you only made a single 'ls'
>>>> on fuse mount which not showed the directory ? Is it possible that
>>>> this 'ls' triggered a self-heal that repaired the problem, whatever
>>>> it was, and when you did another 'ls' on the fuse mount after the
>>>> 'ls' on the bricks, the directories were there ?
>>>> 
>>>> The first 'ls' could have healed the files, causing that the
>>>> following 'ls' on the bricks showed the files as if nothing were
>>>> damaged. If that's the case, it's possible that there were some
>>>> disconnections during the copy.
>>>> 
>>>> Added Pranith because he knows better replication and self-heal details.
>>>> 
>>>> Xavi
>>>> 
>>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote:
>>>>> Distributed/replicated
>>>>> 
>>>>> Volume Name: homegfs
>>>>> Type: Distributed-Replicate
>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>> Status: Started
>>>>> Number of Bricks: 4 x 2 = 8
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>> Options Reconfigured:
>>>>> performance.io-thread-count: 32
>>>>> performance.cache-size: 128MB
>>>>> performance.write-behind-window-size: 128MB
>>>>> server.allow-insecure: on
>>>>> network.ping-timeout: 10
>>>>> storage.owner-gid: 100
>>>>> geo-replication.indexing: off
>>>>> geo-replication.ignore-pid-check: on
>>>>> changelog.changelog: on
>>>>> changelog.fsync-interval: 3
>>>>> changelog.rollover-time: 15
>>>>> server.manage-gids: on
>>>>> 
>>>>> 
>>>>> ------ Original Message ------
>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com>; "Benjamin
>>>>> Turner" <bennyturns at gmail.com>
>>>>> Cc: "gluster-users at gluster.org" <gluster-users at gluster.org>; "Gluster
>>>>> Devel" <gluster-devel at gluster.org>
>>>>> Sent: 2/4/2015 6:03:45 AM
>>>>> Subject: Re: [Gluster-devel] missing files
>>>>> 
>>>>>>> On 02/04/2015 01:30 AM, David F. Robinson wrote:
>>>>>>> Sorry. Thought about this a little more. I should have been clearer.
>>>>>>> The files were on both bricks of the replica, not just one side. So,
>>>>>>> both bricks had to have been up... The files/directories just
>>>>>>> don't show
>>>>>>> up on the mount.
>>>>>>> I was reading and saw a related bug
>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
>>>>>>> suggested to run:
>>>>>>>         find <mount> -d -exec getfattr -h -n trusted.ec.heal {} \;
>>>>>> 
>>>>>> This command is specific for a dispersed volume. It won't do anything
>>>>>> (aside from the error you are seeing) on a replicated volume.
>>>>>> 
>>>>>> I think you are using a replicated volume, right ?
>>>>>> 
>>>>>> In this case I'm not sure what can be happening. Is your volume a pure
>>>>>> replicated one or a distributed-replicated ? on a pure replicated it
>>>>>> doesn't make sense that some entries do not show in an 'ls' when the
>>>>>> file is in both replicas (at least without any error message in the
>>>>>> logs). On a distributed-replicated it could be caused by some problem
>>>>>> while combining contents of each replica set.
>>>>>> 
>>>>>> What's the configuration of your volume ?
>>>>>> 
>>>>>> Xavi
>>>>>> 
>>>>>>> 
>>>>>>> I get a bunch of errors for operation not supported:
>>>>>>> [root at gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n
>>>>>>> trusted.ec.heal {} \;
>>>>>>> find: warning: the -d option is deprecated; please use -depth
>>>>>>> instead,
>>>>>>> because the latter is a POSIX-compliant feature.
>>>>>>> wks_backup/homer_backup/backup: trusted.ec.heal: Operation not
>>>>>>> supported
>>>>>>> wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal:
>>>>>>> Operation
>>>>>>> not supported
>>>>>>> wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal:
>>>>>>> Operation
>>>>>>> not supported
>>>>>>> wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal:
>>>>>>> Operation
>>>>>>> not supported
>>>>>>> wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal:
>>>>>>> Operation
>>>>>>> not supported
>>>>>>> wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal:
>>>>>>> Operation
>>>>>>> not supported
>>>>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation not
>>>>>>> supported
>>>>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not supported
>>>>>>> ------ Original Message ------
>>>>>>> From: "Benjamin Turner" <bennyturns at gmail.com
>>>>>>> <mailto:bennyturns at gmail.com>>
>>>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com
>>>>>>> <mailto:david.robinson at corvidtec.com>>
>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org
>>>>>>> <mailto:gluster-devel at gluster.org>>; "gluster-users at gluster.org"
>>>>>>> <gluster-users at gluster.org <mailto:gluster-users at gluster.org>>
>>>>>>> Sent: 2/3/2015 7:12:34 PM
>>>>>>> Subject: Re: [Gluster-devel] missing files
>>>>>>>> It sounds to me like the files were only copied to one replica,
>>>>>>>> werent
>>>>>>>> there for the initial for the initial ls which triggered a self
>>>>>>>> heal,
>>>>>>>> and were there for the last ls because they were healed. Is there
>>>>>>>> any
>>>>>>>> chance that one of the replicas was down during the rsync? It could
>>>>>>>> be that you lost a brick during copy or something like that. To
>>>>>>>> confirm I would look for disconnects in the brick logs as well as
>>>>>>>> checking glusterfshd.log to verify the missing files were actually
>>>>>>>> healed.
>>>>>>>> 
>>>>>>>> -b
>>>>>>>> 
>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson
>>>>>>>> <david.robinson at corvidtec.com <mailto:david.robinson at corvidtec.com>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>    I rsync'd 20-TB over to my gluster system and noticed that I had
>>>>>>>>    some directories missing even though the rsync completed
>>>>>>>> normally.
>>>>>>>>    The rsync logs showed that the missing files were transferred.
>>>>>>>>    I went to the bricks and did an 'ls -al
>>>>>>>>    /data/brick*/homegfs/dir/*' the files were on the bricks.
>>>>>>>> After I
>>>>>>>>    did this 'ls', the files then showed up on the FUSE mounts.
>>>>>>>>    1) Why are the files hidden on the fuse mount?
>>>>>>>>    2) Why does the ls make them show up on the FUSE mount?
>>>>>>>>    3) How can I prevent this from happening again?
>>>>>>>>    Note, I also mounted the gluster volume using NFS and saw the
>>>>>>>> same
>>>>>>>>    behavior. The files/directories were not shown until I did the
>>>>>>>>    "ls" on the bricks.
>>>>>>>>    David
>>>>>>>>    ===============================
>>>>>>>>    David F. Robinson, Ph.D.
>>>>>>>>    President - Corvid Technologies
>>>>>>>>    704.799.6944 x101 <tel:704.799.6944%20x101> [office]
>>>>>>>>    704.252.1310 <tel:704.252.1310> [cell]
>>>>>>>>    704.799.7974 <tel:704.799.7974> [fax]
>>>>>>>>    David.Robinson at corvidtec.com
>>>>>>>> <mailto:David.Robinson at corvidtec.com>
>>>>>>>>    http://www.corvidtechnologies.com
>>>>>>>> <http://www.corvidtechnologies.com/>
>>>>>>>> 
>>>>>>>>    _______________________________________________
>>>>>>>>    Gluster-devel mailing list
>>>>>>>>    Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Gluster-devel mailing list
>>>>>>> Gluster-devel at gluster.org
>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>> 
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>> 
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>