[Gluster-users] Weird file and file truncation problems
Ling Ho
ling at slac.stanford.edu
Thu Jun 14 19:26:18 UTC 2012
I have a file on a brick with weird permission mode. I thought the "T"
only appears on zero length pointer files.
-r--r-s--T+ 2 psdatmgr ps-data 98780901596 Jan 18 15:06
e141-r0001-s02-c01.xtc
lsof show it is being held open and read/write by the glusterfs process
for the brick.
glusterfs 11336 root 55u REG 8,48
98780901596 4133 /brick3/cxi/cxi43212/xtc/e141-r0001-s02-c01.xtc
The file was written back in January, and I don't believe I have any
client process that have opened the file with write mode.
And the glusterfsd process is consuming ~600 CPU% (out of 8 cores with
hyperthreading).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11336 root 20 0 3603m 130m 2512 S 579.1 0.3 8340:01 glusterfsd
I have started a rebalance job earlier yesterday, but later stopped it.
Does this has something to do with rebalancing?
I would like to restart glusterd on this machine, but is there a way to
tell if any of the files on this server is opened? I ran glusterfs
volume top, but can't tell if the files shown are currently open file. I
don't see this e141-r0001-s02-c01.xtc file from the top command.
A possibly related, and more troubling problem I am having is with file
truncations.
In the past 6 months running 3.2.5 and 3.2.6, we have seen cases of file
truncations after unplanned power outage to our storage devices. I have
suspected write cache on our raid controllers since in those cases it
took us 2-3 days before we could get power restored and bring the disks
back up. The battery on the raid controller would not have lasted that long.
Unfortunately, in the last few days, after proper machine shutdowns and
reboots, we discovered file truncations again. We run a script that take
down the file size of our data files once a day, and the next day we
found the file size of some of the files have been reduced. Some of the
smaller files became zero length. These are files we wrote once and
never write again, only open for reading.
The troubling thing is, these files aren't freshly written when the
machines were rebooted. They were 2-3 days old. And yesterday, I found
one file from the same batch of files as the above mentioned
e141-r0001-s02-c01.xtc, also written in January became truncated.
Yesterday, I looked at about 10 truncated files from the same brick,
examined them using xfs_bmap. They all appeared to be using a single
extend. So, I look at the original untruncated files from the source,
figured out the correct length, and just built new files based on the
location returned by xfs_bmap, and the correct length from the original
files. (something like this: dd if=/dev/sdh of=/root/newfile bs=512
skip=39696059392 count=83889655). Turns out, what I was able to extract
was the same as the original file. So, the data was indeed written to
disk, and not merely stored in cache somewhere. However, the file size
had mysteriously changed.
I know this could be a XFS problem. I would appreciate if I can get some
suggestions on how to reproduce it, and what to look for. I have a test
machine with similar but smaller amount of disk space, but I have not
been able to reproduce the problem there.
I have upgraded to 3.3.0. The latest round of file truncation happened
after I have upgraded, after I stop gluster volume, rebooted the machine
cleanly.
I am running RHEL6.2 kernel version 2.6.32-220.17.1.el6.x86_64 .
Thanks,
...
ling
Thanks,
...
ling
More information about the Gluster-users
mailing list