[Gluster-users] Failed rebalance - lost files, inaccessible files, permission issues
Shawn Heisey
gluster at elyograg.org
Wed Nov 13 01:25:03 UTC 2013
On 11/9/2013 2:39 AM, Shawn Heisey wrote:
> They are from the same log file - the one that I put on my dropbox
> account and linked in the original message. They are consecutive log
> entries.
Further info from our developer that is looking deeper into these problems:
------------
Ouch. I know why the rebalance stopped. The host simply ran out of
memory. From the messages file:
Nov 2 21:55:30 slc01dfs001a kernel: VFS: file-max limit 2438308 reached
Nov 2 21:55:31 slc01dfs001a kernel: automount invoked oom-killer:
gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
Nov 2 21:55:31 slc01dfs001a kernel: automount cpuset=/ mems_allowed=0
Nov 2 21:55:31 slc01dfs001a kernel: Pid: 2810, comm: automount Not
tainted 2.6.32-358.2.1.el6.centos.plus.x86_64 #1
That "file max limit" line actually goes back to the beginning of Nov.
2, and happened on all four hosts. It is because of a file descriptor
leak and was fixed in 3.3.2:
https://bugzilla.redhat.com/show_bug.cgi?id=928631
This is unconnected to the file corruption/loss which started much
earlier. I'm still trying to understand this part. I noticed that
three of the hosts reported successful rebalancing on the same day we
started losing files. I am not sure how rebalancing was distributed
among the hosts, and if the load on the other hosts was enough to keep
things stable until they stopped.
------------
I gather that we should be at least on 3.3.2, but I suspect that a
number of other bugs might be a problem unless we go to 3.4.1. The
rebalance status output is below. All hosts except "localhost" on this
status were reading "completed" a very short time after I started the
rebalance. The "localhost" line continued to increment until the
rebalance died four days after starting.
[root at slc01dfs001a ~]# gluster volume rebalance mdfs status
Node Rebalanced-files size
scanned failures status
--------- ----------- -----------
----------- ----------- ------------
localhost 1121514 1.5TB
9020514 1777661 failed
slc01nas1 0 0Bytes
13638699 0 completed
slc01dfs002a 0 0Bytes
13638699 1 completed
slc01dfs001b 0 0Bytes
13638699 0 completed
slc01dfs002b 0 0Bytes
13638700 0 completed
slc01nas2 0 0Bytes
13638699 0 completed
Thanks,
Shawn
More information about the Gluster-users
mailing list