[Gluster-devel] 1.3.0pre2 (eureka: memory-leak!)

Tue Mar 6 19:29:29 UTC 2007

Thanks Brent.

Others,
The setup has two server machines, each has 16 volumes of
type "storage/posix" each volume on 1st node has an AFR
mirror on the other node. And problem is mostly/easily seen when
performance translators are used.

Krishna

> I did a quick test, converting my storage into one giant stripe, just to
> see what would happen.  It, too, would die after awhile.  Just watching it
> with top, I realized that glusterfs's memory consumption was growing
> rapidly (at the same rate it was reading data with dd) until it
> probably couldn't allocate any more RAM and died.
>
> Wondering if this might account for what I was experiencing on my
> mirroring configuration (in this case, with all performance translators
> except for io-threads, which dies immediately), sure enough:
>
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   7622 root      25   0 1768m 1.7g  776 R  100 44.9   8:37.00 glusterfs
>
> It appears that something triggers a memory leak when reading lots of
> data.  Once it starts, it keeps growing until it can't get anymore memory
> and dies.
>
> I don't know if this would account for io-threads, though, which causes
> glusterfs to die instantly...
>
> Thanks,
>
> Brent
>
> On Mon, 5 Mar 2007, Brent A Nelson wrote:
>
>> I should have mentioned that my dds are like this:
>> dd if=/dev/zero of=/phys/blah4 bs=10M count=1024
>> and this:
>> dd if=/phys/blah4 of=/dev/null bs=10M
>>
>> Both nodes are either writing or reading at the same time.
>>
>> When applying all performance translators, the glusterfs process dies
>> quickly
>> with the least bit of activity.  Without io-threads (but with all
>> others),
>> dds that are writing succeed but reading will knock out glusterfs or the
>> dd
>> may hang (and I also got incomplete du output once during the writing).
>> I
>> tried glusterfs -s thebe /phys -l DEBUG -N, but nothing was reported in
>> either situation before a segfault.  I'll have to recompile with
>> debugging to
>> get more info.
>>
>> Thanks,
>>
>> Brent
>>
>> On Mon, 5 Mar 2007, Brent A Nelson wrote:
>>
>>> Attached are my spec files.  Below are some details to start with;
>>> sorry I
>>> don't have something more coherent, yet:
>>>
>>> I have 2 servers, each with 16 disks shared out individually.  The
>>> disks
>>> from one node are mirrors of the other node.  My clients are the same
>>> machines.  I test by running a 10GB dd read or write on one node while
>>> doing the same thing on the other node.  Then, if the filesystem is
>>> still
>>> running, I may throw in a du of a 200MB copy of /usr that is on the
>>> GlusterFS at the same time the dd processes are running to see how
>>> metadata
>>> is handled while things are busy.
>>>
>>> pre2.2 does not help.  I've been tinkering over the weekend, and it
>>> seems
>>> that the client stays alive when I don't use any performance
>>> translators,
>>> although I still get the error below.  Without performance translators,
>>> the
>>> error seems to result in an abnormal "du", where the du complains that
>>> it
>>> can't find a few directories (a second du, under the same
>>> circumstances,
>>> may work just fine, though), but the dd processes succeed. With
>>> performance
>>> translators, I get other breakage, either the dd processes hang forever
>>> or
>>> the glusterfs processes die outright. Glusterfsd has never died on me.
>>>
>>> Here are some other types of errors I can get with the performance
>>> translators (alas, I can't tell you which translators cause which
>>> errors):
>>>
>>> glusterfs:
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:107/full_rwv()]
>>> libglusterfs:full_rwv: 73996 bytes r/w instead of 74151 (Bad address)
>>> [Mar 04 00:53:35] [ERROR/client-protocol.c:183/client_protocol_xfer()]
>>> protocol/client: client_protocol_xfer: :transport_submit failed
>>> [Mar 04 17:22:27] [ERROR/tcp.c:38/tcp_recieve()]
>>> ERROR:../../../../transport/tcp/tcp.c: tcp_recieve: ((buf) == NULL) is
>>> true
>>> [Mar 04 17:22:27] [ERROR/tcp.c:38/tcp_recieve()]
>>> ERROR:../../../../transport/tcp/tcp.c: tcp_recieve: ((buf) == NULL) is
>>> true
>>>
>>> The errors above correspond to the usual glusterfsd error:
>>>
>>> glusterfsd:
>>> [Mar 04 00:52:33] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 17:22:12] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>> [Mar 04 17:22:27] [ERROR/common-utils.c:52/full_rw()]
>>> libglusterfs:full_rw:
>>> 0 bytes r/w instead of 113
>>>
>>> Additional errors I've seen from glusterfsd (which are seen in
>>> conjunction
>>> with error messages from glusterfs in tcp_recieve, just like above):
>>>
>>> [Mar 04 18:46:32] [ERROR/common-utils.c:107/full_rwv()]
>>> libglusterfs:full_rwv: 65328 bytes r/w instead of 65744 (Connection
>>> reset
>>> by peer)
>>> [Mar 04 18:46:32] [ERROR/proto-srv.c:117/generic_reply()]
>>> protocol/server:generic_reply: transport_writev failed
>>> [Mar 04 20:05:18] [ERROR/common-utils.c:107/full_rwv()]
>>> libglusterfs:full_rwv: 28376 bytes r/w instead of 65744 (Connection
>>> reset
>>> by peer)
>>> [Mar 04 20:05:18] [ERROR/proto-srv.c:117/generic_reply()]
>>> protocol/server:generic_reply: transport_writev failed
>>> [Mar 04 20:05:30] [ERROR/common-utils.c:107/full_rwv()]
>>> libglusterfs:full_rwv: 65326 bytes r/w instead of 65746 (Connection
>>> reset
>>> by peer)
>>> [Mar 04 20:05:30] [ERROR/proto-srv.c:117/generic_reply()]
>>> protocol/server:generic_reply: transport_writev failed
>>>
>>> Everything seems related to the error I mentioned originally, with or
>>> without performance translators (0 bytes r/w instead of 113), though.
>>>
>>> Thanks,
>>>
>>> Brent
>>>
>>>
>>> On Mon, 5 Mar 2007, Krishna Srinivas wrote:
>>>
>>>> Hi Brent,
>>>> Can you help us get to the root cause of the problem?
>>>> It will be of great help.
>>>> Thanks
>>>> Krishna
>>>>
>>>> On 3/3/07, Anand Avati <avati at zresearch.com> wrote:
>>>>> Brent,
>>>>>   first off, thank you for trying glusterfs. Can you give a few more
>>>>>   details -
>>>>>
>>>>>   * is the log from server or client?
>>>>>   * the log message from the other one as well.
>>>>>   * if possible a backtrace from the core of the one which died.
>>>>>
>>>>>   can you also tell what was the I/O pattern which made the crash?
>>>>> was
>>>>>   it heavy I/O on a single file? creation of a lot of files? metadata
>>>>>   operations? and is it possible to reproduce it consistantly with
>>>>> some
>>>>>   steps??
>>>>>
>>>>>   Also we recently uploaded pre2-1 release tarball. That had a couple
>>>>> of
>>>>>   bug fixes, but I need to get your answers to say if the fixes apply
>>>>>   to you as well.
>>>>>
>>>>>   Please attach your spec files as well.
>>>>>
>>>>>   regards,
>>>>>   avati
>>>>>
>>>>>   On Fri, Mar 02, 2007 at 04:05:17PM -0500, Brent A Nelson wrote:
>>>>> > So, I compiled 1.3.0pre2 as soon as it came out (nice, trouble-free
>>>>> > standard configure and make), and I found it very easy to set up a
>>>>> > GlusterFS with one node mirroring 16 disks to another, all
>>>>> optimizers
>>>>> > loaded.
>>>>> >
>>>>> > However, it isn't stable under load.  I get errors like the
>>>>> following
>>>>> and
>>>>> > glusterfs exits:
>>>>> >
>>>>> > [Mar 02 14:23:29] [ERROR/common-utils.c:52/full_rw()]
>>>>> > libglusterfs:full_rw: 0 bytes r/w instead of 113
>>>>> >
>>>>> > I thought it might be because I was using the stock fuse module
>>>>> with my
>>>>> > kernel, but I replaced it with the 2.6.3 fuse module and it still
>>>>> dies
>>>>> in
>>>>> > this way.
>>>>> >
>>>>> > Is this a bug or just that my setup is poor (one node serves 16
>>>>> > individual shares through a single glusterfsd, the mirror node does
>>>>> the
>>>>> > same, and the servers are also acting as my test clients) or that
>>>>> I'm
>>>>> not
>>>>> > using the deadline scheduler (yet) or...?
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> > Brent
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > Gluster-devel mailing list
>>>>> > Gluster-devel at nongnu.org
>>>>> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>> >
>>>>>
>>>>> --
>>>>> Shaw's Principle:
>>>>>         Build a system that even a fool can use,
>>>>>         and only a fool will want to use it.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at nongnu.org
>>>>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>>
>>>
>>
>
>