[Gluster-devel] Another segfault on client side (only sporadic)

Wed Aug 29 12:32:07 UTC 2007

Hi Bernhard, Krishna,
 There were three issues which caused all these segfaults in afr. One of it
was in fuse-bridge code, where handling inode was a problem. Other two were
in unify. All these problems should be fixed in patch-469.

Bernhard, can you check with the latest tla and confirm all these bugs are
fixed?

-amar

On 8/24/07, Bernhard J. M. Grün <bernhard.gruen at googlemail.com> wrote:
>
> Hi Krishna,
>
> here is your requested information. The following is the information
> from the first mail:
> #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaabce32cb0,
>     this=<value optimized out>, loc=0x2aaaac0fe168) at afr.c:2602
> 2602    afr.c: No such file or directory.
>         in afr.c
> (gdb) p *loc
> $1 = {path = 0x2aaaaf21f000
> "/imagecache/galerie/4197/thumbnail/419775.jpg",
>   ino = 1744179, inode = 0x2aaab237c360}
> (gdb) p *loc->inode
> $2 = {lock = 1, table = 0x60c590, nlookup = 1, generation = 0, ref = 2,
>   ino = 1744179, st_mode = 33188, fds = {next = 0x2aaab237c38c,
>     prev = 0x2aaab237c38c}, ctx = 0x0, dentry = {inode_list = {
>       next = 0x2aaab237c3a4, prev = 0x2aaab237c3a4}, name_hash = {
>       next = 0x2aaab93809a4, prev = 0x2aaac4483bf4}, inode =
> 0x2aaab237c360,
>     name = 0x2aaab1533820 "419775.jpg", parent = 0x2aaab4dc0c90},
>   inode_hash = {next = 0x2aaab65afb5c, prev = 0x2aaaac1a4fdc}, list = {
>     next = 0x2aaabada476c, prev = 0x60c5f0}}
>
>
> Now here is the information from two crashes of the later mail.
> The first crash from the last mail:
> Program terminated with signal 11, Segmentation fault.
> #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaab0831c80,
>     this=<value optimized out>, loc=0x2aaaed03abf8) at afr.c:2602
> 2602    afr.c: No such file or directory.
>         in afr.c
> (gdb) p *loc
> $1 = {path = 0x2aaabf51f790
> "/imagecache/galerie/4482/thumbnail/448221.jpg",
>   ino = 1879050, inode = 0x2aab132f86b0}
> (gdb) p *loc->inode
> $2 = {lock = 1, table = 0x60c590, nlookup = 1, generation = 0, ref = 2,
>   ino = 1879050, st_mode = 33188, fds = {next = 0x2aab132f86dc,
>     prev = 0x2aab132f86dc}, ctx = 0x0, dentry = {inode_list = {
>       next = 0x2aab132f86f4, prev = 0x2aab132f86f4}, name_hash = {
>       next = 0x2aaacb980464, prev = 0x2aaaab567dd0}, inode =
> 0x2aab132f86b0,
>     name = 0x2aab04c6b030 "448221.jpg", parent = 0x2aaabfe97300},
>   inode_hash = {next = 0x2aaaaca36d0c, prev = 0x2aaaab523fe0}, list = {
>     next = 0x2aaad4ba01dc, prev = 0x60c5f0}}
>
> The second crash:
> Program terminated with signal 11, Segmentation fault.
> #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aab10d94830,
>     this=<value optimized out>, loc=0x2aab1105bd68) at afr.c:2602
> 2602    afr.c: No such file or directory.
>         in afr.c
> (gdb) p *loc
> $1 = {path = 0x2aab1208aa00 "", ino = 39422758, inode = 0x2aaae69be260}
> (gdb) p *loc->inode
> $2 = {lock = 1, table = 0x60c590, nlookup = 1, generation = 0, ref = 3,
>   ino = 39422758, st_mode = 16877, fds = {next = 0x2aaae69be28c,
>     prev = 0x2aaae69be28c}, ctx = 0x0, dentry = {inode_list = {
>       next = 0x2aaae69be2a4, prev = 0x2aaae69be2a4}, name_hash = {
>       next = 0x2aaae69be2b4, prev = 0x2aaae69be2b4}, inode =
> 0x2aaae69be260,
>     name = 0x0, parent = 0x0}, inode_hash = {next = 0x15a5c7c,
>     prev = 0x2aaaab51a130}, list = {next = 0x2aaac0fa2fac,
>     prev = 0x2aaab7d5ad3c}}
>
> We can also meet in a chat if you like. I think this will speed up
> debugging. Just give me some time frame when we can meet and where we
> can meet.
>
> Bernhard
>
> 2007/8/24, Krishna Srinivas <krishna at zresearch.com>:
> > Bernhard,
> >
> > Can you do "p *loc" and "p *loc->inode"
> >
> > Thanks
> > Krishna
> >
> > On 8/24/07, Bernhard J. M. Grün <bernhard.gruen at googlemail.com> wrote:
> > > Hi Krishna,
> > >
> > > Unfortunately I can't give you access to our production systems. At
> > > least not at the moment.
> > > What I can do is to give you the compiled version of glusterfs, the
> > > system (Ubuntu 7.04 x86-64) and the core dumps.
> > >
> > > But I have two new back traces for you. They are from the second
> > > glusterfs client but the binaries of boths clients are the same:
> > > First back trace:
> > > Core was generated by `[glusterfs]
> > >                               '.
> > > Program terminated with signal 11, Segmentation fault.
> > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaab0831c80, this=<value
> > > optimized out>, loc=0x2aaaed03abf8) at afr.c:2602
> > > 2602    afr.c: No such file or directory.
> > >         in afr.c
> > > (gdb) bt
> > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaab0831c80, this=<value
> > > optimized out>, loc=0x2aaaed03abf8) at afr.c:2602
> > > #1  0x00002aaaaaece1bb in iot_stat (frame=0x2aaab10bfd50,
> > > this=0x6126d0, loc=0x2aaaed03abf8) at io-threads.c:651
> > > #2  0x00002ab5c53e1382 in default_stat (frame=0x2aaae2a881a0,
> > > this=0x612fe0, loc=0x2aaaed03abf8) at defaults.c:112
> > > #3  0x00002aaaab2db252 in wb_stat (frame=0x2aaac90c5420,
> > > this=0x613930, loc=0x2aaaed03abf8) at write-behind.c:236
> > > #4  0x0000000000405fd2 in fuse_getattr (req=<value optimized out>,
> > > ino=<value optimized out>, fi=<value optimized out>) at
> > > fuse-bridge.c:496
> > > #5  0x0000000000407139 in fuse_transport_notify (xl=<value optimized
> > > out>, event=<value optimized out>, data=<value optimized out>) at
> > > fuse-bridge.c:2067
> > > #6  0x00002ab5c53e3632 in sys_epoll_iteration (ctx=<value optimized
> > > out>) at epoll.c:53
> > > #7  0x000000000040356b in main (argc=5, argv=0x7fffe58f3348) at
> glusterfs.c:387
> > >
> > > Second back trace:
> > > Program terminated with signal 11, Segmentation fault.
> > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aab10d94830, this=<value
> > > optimized out>, loc=0x2aab1105bd68) at afr.c:2602
> > > 2602    afr.c: No such file or directory.
> > >         in afr.c
> > > (gdb) bt
> > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aab10d94830, this=<value
> > > optimized out>, loc=0x2aab1105bd68) at afr.c:2602
> > > #1  0x00002aaaaaece1bb in iot_stat (frame=0x2aab11aee060,
> > > this=0x6126d0, loc=0x2aab1105bd68) at io-threads.c:651
> > > #2  0x00002b56305f4382 in default_stat (frame=0x2aab15cb30a0,
> > > this=0x612fe0, loc=0x2aab1105bd68) at defaults.c:112
> > > #3  0x00002aaaab2db252 in wb_stat (frame=0x2aab1807e2e0,
> > > this=0x613930, loc=0x2aab1105bd68) at write-behind.c:236
> > > #4  0x0000000000405fd2 in fuse_getattr (req=<value optimized out>,
> > > ino=<value optimized out>, fi=<value optimized out>) at
> > > fuse-bridge.c:496
> > > #5  0x0000000000407139 in fuse_transport_notify (xl=<value optimized
> > > out>, event=<value optimized out>, data=<value optimized out>) at
> > > fuse-bridge.c:2067
> > > #6  0x00002b56305f6632 in sys_epoll_iteration (ctx=<value optimized
> > > out>) at epoll.c:53
> > > #7  0x000000000040356b in main (argc=5, argv=0x7fff7a6de138) at
> glusterfs.c:387
> > >
> > > It seems the error is the same in all three cases.
> > >
> > > Bernhard
> > >
> > > 2007/8/22, Krishna Srinivas <krishna at zresearch.com>:
> > > > Hi Bernhard,
> > > >
> > > > We are not able to figure out the bug's cause. Is it possible for
> > > > you to give us access to your machine for debugging the core?
> > > >
> > > > Thanks
> > > > Krishna
> > > >
> > > > On 8/20/07, Bernhard J. M. Grün <bernhard.gruen at googlemail.com>
> wrote:
> > > > > I still have the core dump of the crash I've reported. But I don't
> > > > > know if the backtrace is the same every time. The glusterfs client
> now
> > > > > runs perfectly since 2007-08-16. So we have to wait for the next
> crash
> > > > > to analyse that issue further.
> > > > > Also the "print child_errno" does not output anything useful. It
> just
> > > > > says that there is no symbol with that name in the current
> context.
> > > > >
> > > > > 2007/8/20, Krishna Srinivas <krishna at zresearch.com>:
> > > > > > Do you see the same backtrace everytime it crashes?
> > > > > > can you do "print child_errno" at the gdb prompt when you have
> the core?
> > > > > >
> > > > > > Thanks
> > > > > > Krishna
> > > > > >
> > > > > > On 8/20/07, Bernhard J. M. Grün <bernhard.gruen at googlemail.com>
> wrote:
> > > > > > > Hi Krishna,
> > > > > > >
> > > > > > > One or also both of our glusterfs clients with that version
> crash
> > > > > > > every 3 to 5 days I think. The problem is that there is much
> > > > > > > throughput (about 30MBit/s on each client with about 99.5%file reads,
> > > > > > > rest file writes). This makes it hard to debug.
> > > > > > > We also have a core file from that crash (If I did not deleted
> it
> > > > > > > because it was quite big) anyway when the next crash occurs
> I'll save
> > > > > > > the core dump for sure.
> > > > > > > Do you have some idea how to work around that crash?
> > > > > > > .
> > > > > > > 2007/8/20, Krishna Srinivas <krishna at zresearch.com>:
> > > > > > > > Hi Bernhard,
> > > > > > > >
> > > > > > > > Sorry for the late response. We are not able to figure out
> > > > > > > > the cause for this bug. Do you have the core file?
> > > > > > > > Is the bug seen regularly?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Krishna
> > > > > > > >
> > > > > > > > On 8/16/07, Bernhard J. M. Grün <
> bernhard.gruen at googlemail.com> wrote:
> > > > > > > > > Hello developers,
> > > > > > > > >
> > > > > > > > > We just discovered another segfault on client side. At the
> moment we
> > > > > > > > > can't give you more information than our version number, a
> back trace
> > > > > > > > > and our client configuration.
> > > > > > > > >
> > > > > > > > > We use version 1.3.0 with patches up to patch-449.
> > > > > > > > >
> > > > > > > > > The back trace looks as the follows:
> > > > > > > > > Core was generated by `[glusterfs]
> > > > > > > > >                               '.
> > > > > > > > > Program terminated with signal 11, Segmentation fault.
> > > > > > > > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaabce32cb0,
> > > > > > > > >     this=<value optimized out>, loc=0x2aaaac0fe168) at
> afr.c:2602
> > > > > > > > > 2602    afr.c: No such file or directory.
> > > > > > > > >         in afr.c
> > > > > > > > > (gdb) bt
> > > > > > > > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaabce32cb0,
> > > > > > > > >     this=<value optimized out>, loc=0x2aaaac0fe168) at
> afr.c:2602
> > > > > > > > > #1  0x00002aaaaaece1bb in iot_stat (frame=0x2aaabcc00860,
> this=0x6126d0,
> > > > > > > > >     loc=0x2aaaac0fe168) at io-threads.c:651
> > > > > > > > > #2  0x00002aaaab0d2252 in wb_stat (frame=0x2aaaad05c5e0,
> this=0x612fe0,
> > > > > > > > >     loc=0x2aaaac0fe168) at write-behind.c:236
> > > > > > > > > #3  0x0000000000405fd2 in fuse_getattr (req=<value
> optimized out>,
> > > > > > > > >     ino=<value optimized out>, fi=<value optimized out>)
> at fuse-bridge.c:496
> > > > > > > > > #4  0x0000000000407139 in fuse_transport_notify (xl=<value
> optimized out>,
> > > > > > > > >     event=<value optimized out>, data=<value optimized
> out>)
> > > > > > > > >     at fuse-bridge.c:2067
> > > > > > > > > #5  0x00002af562b6a632 in sys_epoll_iteration (ctx=<value
> optimized out>)
> > > > > > > > >     at epoll.c:53
> > > > > > > > > #6  0x000000000040356b in main (argc=9,
> argv=0x7fff48169b78) at glusterfs.c:387
> > > > > > > > >
> > > > > > > > > And here is our client configuration for that machine:
> > > > > > > > > ### Add client feature and attach to remote subvolume
> > > > > > > > > volume client1
> > > > > > > > >   type protocol/client
> > > > > > > > >   option transport-type tcp/client     # for TCP/IP
> transport
> > > > > > > > >   option remote-host 10.1.1.13     # IP address of the
> remote brick
> > > > > > > > >   option remote-port 9999              # default server
> port is 6996
> > > > > > > > >   option remote-subvolume iothreads        # name of the
> remote volume
> > > > > > > > > end-volume
> > > > > > > > >
> > > > > > > > > ### Add client feature and attach to remote subvolume
> > > > > > > > > volume client2
> > > > > > > > >   type protocol/client
> > > > > > > > >   option transport-type tcp/client     # for TCP/IP
> transport
> > > > > > > > >   option remote-host 10.1.1.14     # IP address of the
> remote brick
> > > > > > > > >   option remote-port 9999              # default server
> port is 6996
> > > > > > > > >   option remote-subvolume iothreads        # name of the
> remote volume
> > > > > > > > > end-volume
> > > > > > > > >
> > > > > > > > > volume afrbricks
> > > > > > > > >   type cluster/afr
> > > > > > > > >   subvolumes client1 client2
> > > > > > > > >   option replicate *:2
> > > > > > > > >   option self-heal off
> > > > > > > > > end-volume
> > > > > > > > >
> > > > > > > > > volume iothreads    #iothreads can give performance a
> boost
> > > > > > > > >    type performance/io-threads
> > > > > > > > >    option thread-count 16
> > > > > > > > >    subvolumes afrbricks
> > > > > > > > > end-volume
> > > > > > > > >
> > > > > > > > > ### Add writeback feature
> > > > > > > > > volume bricks
> > > > > > > > >   type performance/write-behind
> > > > > > > > >   option aggregate-size 0  # unit in bytes
> > > > > > > > >   subvolumes iothreads
> > > > > > > > > end-volume
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > We hope you can easily find and fix that error. Thank you
> in advance
> > > > > > > > >
> > > > > > > > > Bernhard J. M. Grün
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > Gluster-devel mailing list
> > > > > > > > > Gluster-devel at nongnu.org
> > > > > > > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> > >
> >
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

-- 
Amar Tumballi
Engineer - Gluster Core Team
[bulde on #gluster/irc.gnu.org]
http://www.zresearch.com - Commoditizing Supercomputing and Superstorage!