[Gluster-devel] Disconnections and Corruption Under High Load

Thu Jan 7 10:44:43 UTC 2010

Anand Avati wrote:
> On Tue, Jan 5, 2010 at 5:31 PM, Gordan Bobic <gordan at bobich.net> wrote:
>> I've noticed a very high incidence of the problem I reported a while back,
>> that manifests itself in open files getting corrupted on commit, possibly
>> during conditions that involve server disconnections due to timeouts (very
>> high disk load). Specifically, I've noticed that my .viminfo file got
>> corrupted for the 3rd time today. Since this is root's .viminfo, and I'm
>> running glfs as root, I don't have the logs to verify the disconnections,
>> though. From what I can tell, a chunk of a dll somehow ends up in .viminfo,
>> but I'm not sure which one.
> 
> Can you describe the sequence of events? What kind of IO was being
> performed from all clients involved? Was vi opened (on the same file?)
> from multiple clients? Was some other kind of IO (rsync?) being
> performed on another client at the same time?

The I/O client was relatively lightweight - normal desktop use, web 
browser and mail reader open, a bunch of gnome-terminal windows. vi 
wasn't opened on any of the files that were open, the only things I was 
editing at the time was fstab and the gluster volume spec files. There 
is only one actual client machine (the one on my desk), if we don't 
count the AFR servers (which are also each other's clients).

The load/slowness on the system was caused purely by the disks being 
slow to respond due to the RAID check all the nodes were doing.

It's all very heisenbuggy, I've seen it happen multiple times, but there 
doesn't appear to be a reliably reproducible set of circumstances that 
causes it.

>> On a different volume, I'm seeing other weirdness under the same high disk
>> load conditions (software RAID check/resync on all server nodes). This seems
>> to be specifically related to using writebehind+iocache on the client-side
>> on one of he servers, exported via unfsd (the one from the gluster ftp
>> site). What happens is that the /home volume simply seems to disappear
>> underneath unfsd! The attached log indicates a glusterfsd crash.
>>
>> This doesn't happen if I remove the writebehind and io-cache translators.
>>
>> Other notable things about the setup that might help figure out the cause of
>> this:
>>
>> - The other two servers are idle - they are not serving any requests. They
>> are, however, also under the same high disk load.
>>
>> - writebehind and io-cache is only applied on one server the one behing used
>> to export via unfsd. The other servers do not have those translators
>> applied. The volume config is attached. It is called home-cache.vol, but
>> this is the same file the log file refers to even though it is listed there
>> as home.vol.
>>
>> The problem specifically occurs when servers are undergoing high load of the
>> described nature that causes disk latencies to go up massively. I have not
>> observed any instances of a similar crash happening without the writebehind
>> and io-cache translators.
> 
> Can you send us a backtrace of the core from gdb (command: "thread
> apply all bt full")?

Will do.

Gordan