[Gluster-devel] AFR setup with Virtual Servers crashes

Thu May 10 11:40:21 UTC 2007

Hi Avati,

many thanks. I will try out the lastest code from the source repository.

thanks,
Urban

Anand Avati wrote:
> Urban,
>  this bug has alredy been fixed in the source repository.
> thanks,
> avati
>
> 2007/5/10, Urban Loesch <ul at enas.net>:
>> Hi Avati,
>>
>> thanks for your fast answer.
>>
>> I use the version glusterfs-1.3.0-pre3 downloaded form your server
>> (http://ftp.zresearch.com/pub/gluster/glusterfs/1.3-pre/).
>> I will try the latest version from TLA today afternoon and let you know
>> what happens.
>>
>> Here's the backtrace from the core dump
>> # gdb glusterfsd -c core.15160
>> ..
>> Core was generated by `glusterfsd --no-daemon --log-file=/dev/stdout
>> --log-level=DEBUG'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0xb75d8fd3 in posix_locks_flush () from
>> /usr/lib/glusterfs/1.3.0-pre3/xlator/features/posix-locks.so
>> (gdb) bt
>> #0  0xb75d8fd3 in posix_locks_flush () from
>> /usr/lib/glusterfs/1.3.0-pre3/xlator/features/posix-locks.so
>> #1  0xb75d1192 in fop_flush () from
>> /usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
>> #2  0xb75cded7 in proto_srv_notify () from
>> /usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
>> #3  0xb7f54ecd in transport_notify (this=0x804b1a0, event=1) at
>> transport.c:148
>> #4  0xb7f55b79 in sys_epoll_iteration (ctx=0xbfbc2ff0) at epoll.c:53
>> #5  0xb7f54f7d in poll_iteration (ctx=0xbfbc2ff0) at transport.c:251
>> #6  0x0804924e in main ()
>>
>> Yes it is reproducible. It happens every time when I try to start my
>> virtual server.
>>
>> Thanks
>> Urban
>>
>>
>> Anand Avati wrote:
>> > Urban,
>> > which version of glusterfs are you using? if it is from TLA checkout
>> > what is the patchset number?
>> >
>> > you have a core dump generated from the segfault, can you please get a
>> > backtrace from it? (gdb glusterfsd -c core.<pid> or gdb glusterfsd -c
>> > core, type 'bt' command and paste the output) please.
>> >
>> > is this easily reproducible? have you checked with the latest TLA
>> > checkout?
>> >
>> > thanks,
>> > avati
>> >
>> > 2007/5/10, Urban Loesch <ul at enas.net>:
>> >> Hi,
>> >>
>> >> I'm new to this list.
>> >> First: sorry for my bad english.
>> >>
>> >> I was searching for some easy and transparent Clusterfilesystem with
>> >> failover feature and I found on Wikipedia the GlusterFS project.
>> >> It's a nice project and tried it on my test environment. I thought 
>> when
>> >> it works good I use it in production too.
>> >>
>> >> A very nice feature for me is the AFR setup. So I can replicate 
>> all the
>> >> data over 2 Servers in RAID-1 Mode.
>> >> But it seems that I make something wrong, because the "glusterfsd"
>> >> crashes on both nodes.
>> >> But let me explain form the beginning.
>> >>
>> >> Here's my setup:
>> >> Hardware:
>> >> 2 different servers for storage
>> >> 1 server as client
>> >> On top of the server I use a virtual server setup (details
>> >> http://linux-vserver.org).
>> >>
>> >> OS:
>> >> Debian Sarge with self compiled 2.6.19.2 (uname -r 
>> 2.6.19.2-vs2.2.0) and
>> >> latest stable virtual server patch.
>> >> glusterfs-1.3.0-pre3.tar.gz
>> >>
>> >> What I'm trying to do:
>> >> - Create a AFR Mirror over the 2 Servers.
>> >> - Mount the Volume on Server 3 (Client).
>> >> - Install on the mounted volume the hole virtual Server with Apache,
>> >> MySql and so on.
>> >> So I have a full redundant Virtual Server mirrored over two bricks .
>> >>
>> >> Here my current confuguration:
>> >> - Serverconfig on Server 1 (brick)
>> >>
>> >> ### Export volume "brick" with the contents of "/home/export" 
>> directory.
>> >> volume brick
>> >>   type storage/posix                   # POSIX FS translator
>> >>   option directory /gluster        # Export this directory
>> >> end-volume
>> >>
>> >> ### File Locking
>> >> volume locks
>> >>   type features/posix-locks
>> >>   subvolumes brick
>> >> end-volume
>> >>
>> >> ### Add network serving capability to above brick.
>> >> volume server
>> >>   type protocol/server
>> >>   option transport-type tcp/server     # For TCP/IP transport
>> >> option listen-port 6996               # Default is 6996
>> >>   subvolumes locks
>> >>   option auth.ip.locks.allow *         # access to "brick" volume
>> >> end-volume
>> >>
>> >> - Serverconfig on Server 2 (brick-afr)
>> >> ### Export volume "brick" with the contents of "/home/export" 
>> directory.
>> >> volume brick-afr
>> >>   type storage/posix                   # POSIX FS translator
>> >>   option directory /gluster-afr        # Export this directory
>> >> end-volume
>> >>
>> >> ### File Locking
>> >> volume locks-afr
>> >>   type features/posix-locks
>> >>   subvolumes brick-afr
>> >> end-volume
>> >>
>> >> ### Add network serving capability to above brick.
>> >> volume server
>> >>   type protocol/server
>> >>   option transport-type tcp/server     # For TCP/IP transport
>> >> option listen-port 6996               # Default is 6996
>> >>   subvolumes locks-afr
>> >>   option auth.ip.locks-afr.allow *         # access to "brick" volume
>> >> end-volume
>> >>
>> >> - Clientconfiguration on Server 3 (
>> >> ### Add client feature and attach to remote subvolume of server1
>> >> volume brick
>> >>   type protocol/client
>> >>   option transport-type tcp/client     # for TCP/IP transport
>> >>   option remote-host 192.168.0.1      # IP address of the remote 
>> brick
>> >>   option remote-port 6996              # default server port is 6996
>> >>   option remote-subvolume locks        # name of the remote volume
>> >> end-volume
>> >>
>> >> ### Add client feature and attach to remote subvolume of brick1
>> >> volume brick-afr
>> >>   type protocol/client
>> >>   option transport-type tcp/client     # for TCP/IP transport
>> >>   option remote-host 192.168.0.2      # IP address of the remote 
>> brick
>> >>   option remote-port 6996              # default server port is 6996
>> >>   option remote-subvolume locks-afr        # name of the remote 
>> volume
>> >> end-volume
>> >>
>> >> ### Add AFR feature to brick
>> >> volume afr
>> >>   type cluster/afr
>> >>   subvolumes brick brick-afr
>> >>   option replicate *:2                 # All files 2 copies (RAID-1)
>> >> end-volume
>> >>
>> >> 
>> ---------------------------------------------------------------------------------------------------------------------- 
>>
>> >>
>> >> I started the two Bricks in debug mode and it starts without 
>> problems.
>> >>
>> >> - Server1
>> >> glusterfsd --no-daemon --log-file=/dev/stdout --log-level=DEBUG
>> >> ....
>> >> [May 10 11:52:11] [DEBUG/proto-srv.c:2919/init()]
>> >> protocol/server:protocol/server xlator loaded
>> >> [May 10 11:52:11] [DEBUG/transport.c:83/transport_load()]
>> >> libglusterfs/transport:attempt to load type tcp/server
>> >> [May 10 11:52:11] [DEBUG/transport.c:88/transport_load()]
>> >> libglusterfs/transport:attempt to load file
>> >> /usr/lib/glusterfs/1.3.0-pre3/transport/tcp/server.so
>> >>
>> >> - Server2
>> >> glusterfsd --no-daemon --log-file=/dev/stdout --log-level=DEBUG
>> >> ....
>> >> [May 10 11:51:44] [DEBUG/proto-srv.c:2919/init()]
>> >> protocol/server:protocol/server xlator loaded
>> >> [May 10 11:51:44] [DEBUG/transport.c:83/transport_load()]
>> >> libglusterfs/transport:attempt to load type tcp/server
>> >> [May 10 11:51:44] [DEBUG/transport.c:88/transport_load()]
>> >> libglusterfs/transport:attempt to load file
>> >> /usr/lib/glusterfs/1.3.0-pre3/transport/tcp/server.so
>> >> 
>> ------------------------------------------------------------------------------------------------------------------------------ 
>>
>> >>
>> >>
>> >> So far so good.
>> >>
>> >> After I mounted the volume on server 3 (client). It mounts without 
>> any
>> >> problems.
>> >> glusterfs --no-daemon --log-file=/dev/stdout --log-level=DEBUG
>> >> --spec-file=/etc/glusterfs/glusterfs-client.vol
>> >> /var/lib/vservers/mastersql
>> >> ...
>> >> [May 10 13:59:00] [DEBUG/client-protocol.c:2796/init()]
>> >> protocol/client:defaulting transport-timeout to 120
>> >> [May 10 13:59:00] [DEBUG/transport.c:83/transport_load()]
>> >> libglusterfs/transport:attempt to load type tcp/client
>> >> [May 10 13:59:00] [DEBUG/transport.c:88/transport_load()]
>> >> libglusterfs/transport:attempt to load file
>> >> /usr/lib/glusterfs/1.3.0-pre3/transport/tcp/client.so
>> >> [May 10 13:59:00] [DEBUG/tcp-client.c:174/tcp_connect()] 
>> transport: tcp:
>> >> :try_connect: socket fd = 8
>> >> [May 10 13:59:00] [DEBUG/tcp-client.c:196/tcp_connect()] 
>> transport: tcp:
>> >> :try_connect: finalized on port `1022'
>> >> [May 10 13:59:00] [DEBUG/tcp-client.c:255/tcp_connect()]
>> >> tcp/client:connect on 8 in progress (non-blocking)
>> >> [May 10 13:59:00] [DEBUG/tcp-client.c:293/tcp_connect()]
>> >> tcp/client:connection on 8 still in progress - try later
>> >>
>> >> OK. Nice.
>> >> A short check on the client:
>> >> df -HT
>> >> Filesystem    Type     Size   Used  Avail Use% Mounted on
>> >> /dev/sda1     ext3      13G   2.6G   8.9G  23% /
>> >> tmpfs        tmpfs     1.1G      0   1.1G   0% /lib/init/rw
>> >> udev         tmpfs      11M    46k    11M   1% /dev
>> >> tmpfs        tmpfs     1.1G      0   1.1G   0% /dev/shm
>> >> glusterfs:24914
>> >>               fuse     9.9G   2.5G   6.9G  27%
>> >> /var/lib/vservers/mastersql
>> >>
>> >> Wow it works. Now I can add, remove or edit files and directories
>> >> without problems. The file are written to all two bricks without
>> >> problems. Performance is good too.
>> >>
>> >> But then I tried to start my virtual Server (called mastersql).
>> >> The virtual server not starts and I get the a lot of following debug
>> >> output on the client:
>> >>
>> >> [May 10 14:04:43] [DEBUG/tcp-client.c:174/tcp_connect()] 
>> transport: tcp:
>> >> :try_connect: socket fd = 4
>> >> [May 10 14:04:43] [DEBUG/tcp-client.c:196/tcp_connect()] 
>> transport: tcp:
>> >> :try_connect: finalized on port `1023'
>> >> [May 10 14:04:43] [DEBUG/tcp-client.c:255/tcp_connect()]
>> >> tcp/client:connect on 4 in progress (non-blocking)
>> >> [May 10 14:04:43] [DEBUG/tcp-client.c:293/tcp_connect()]
>> >> tcp/client:connection on 4 still in progress - try later
>> >> [May 10 14:04:43] 
>> [ERROR/client-protocol.c:204/client_protocol_xfer()]
>> >> protocol/client:transport_submit failed
>> >> [May 10 14:04:43]
>> >> [DEBUG/client-protocol.c:2604/client_protocol_cleanup()]
>> >> protocol/client:cleaning up state in transport object 0x8076cf0
>> >> [May 10 14:04:43] [DEBUG/tcp-client.c:174/tcp_connect()] 
>> transport: tcp:
>> >> :try_connect: socket fd = 7
>> >> [May 10 14:04:43] [DEBUG/tcp-client.c:196/tcp_connect()] 
>> transport: tcp:
>> >> :try_connect: finalized on port `1022'
>> >> [May 10 14:04:43] [DEBUG/tcp-client.c:255/tcp_connect()]
>> >> tcp/client:connect on 7 in progress (non-blocking)
>> >> [May 10 14:04:43] [DEBUG/tcp-client.c:293/tcp_connect()]
>> >> tcp/client:connection on 7 still in progress - try later
>> >> [May 10 14:04:43] 
>> [ERROR/client-protocol.c:204/client_protocol_xfer()]
>> >> protocol/client:transport_submit failed
>> >> [May 10 14:04:43]
>> >> [DEBUG/client-protocol.c:2604/client_protocol_cleanup()]
>> >> protocol/client:cleaning up state in transport object 0x80762d0
>> >>
>> >> The two mirrorservers are crashing with the following debug code:
>> >>
>> >> [May 10 11:54:26] [DEBUG/tcp-server.c:134/tcp_server_notify()]
>> >> tcp/server:Registering socket (5) for new transport object of
>> >> 192.168.0.3
>> >> [May 10 11:55:22] [DEBUG/proto-srv.c:2418/mop_setvolume()]
>> >> server-protocol:mop_setvolume: received port = 1022
>> >> [May 10 11:55:22] [DEBUG/proto-srv.c:2434/mop_setvolume()]
>> >> server-protocol:mop_setvolume: IP addr = *, received ip addr =
>> >> 192.168.0.3
>> >> [May 10 11:55:22] [DEBUG/proto-srv.c:2444/mop_setvolume()]
>> >> server-protocol:mop_setvolume: accepted client from 192.168.0.3
>> >>
>> >> Trying to set: READ  Is grantable: READ   Inserting: READTrying to 
>> set:
>> >> UNLOCK  Is grantable: UNLOCK  Conflict with: READTrying to set: WRITE
>> >> Is grantable: WRITE   Inserting: WRITETrying to set: UNLOCK  Is
>> >> grantable: UNLOCK  Conflict with: WRITETrying to set: WRITE  Is
>> >> grantable: WRITE   Inserting: WRITETrying to set: UNLOCK  Is 
>> grantable:
>> >> UNLOCK  Conflict with: WRITETrying to set: WRITE  Is grantable: WRITE
>> >> Inserting: WRITE[May 10 12:00:09]
>> >> [CRITICAL/common-utils.c:215/gf_print_trace()] debug-backtrace:Got
>> >> signal (11), printing backtrace
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> debug-backtrace:/usr/lib/libglusterfs.so.0(gf_print_trace+0x2e)
>> >> [0xb7f53a7e]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> debug-backtrace:[0xb7f60420]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> 
>> debug-backtrace:/usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
>> >> [0xb75d1192]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> 
>> debug-backtrace:/usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
>> >> [0xb75cded7]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> debug-backtrace:/usr/lib/libglusterfs.so.0(transport_notify+0x1d)
>> >> [0xb7f54ecd]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> debug-backtrace:/usr/lib/libglusterfs.so.0(sys_epoll_iteration+0xe9)
>> >> [0xb7f55b79]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> debug-backtrace:/usr/lib/libglusterfs.so.0(poll_iteration+0x1d)
>> >> [0xb7f54f7d]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> debug-backtrace:glusterfsd [0x804924e]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> debug-backtrace:/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xc8)
>> >> [0xb7e17ea8]
>> >> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> >> debug-backtrace:glusterfsd [0x8048c51]
>> >> Segmentation fault (core dumped)
>> >>
>> >> It seems that there are come conflicts with "READ, WRITE, UNLOCK". 
>> But
>> >> I'm not an expert on filesystems an locking features.
>> >>
>> >> As you can see the filesystem is just mounted but not connected to 
>> the
>> >> two bricks.
>> >> df -HT
>> >> Filesystem    Type     Size   Used  Avail Use% Mounted on
>> >> /dev/sda1     ext3      13G   2.6G   8.9G  23% /
>> >> tmpfs        tmpfs     1.1G      0   1.1G   0% /lib/init/rw
>> >> udev         tmpfs      11M    46k    11M   1% /dev
>> >> tmpfs        tmpfs     1.1G      0   1.1G   0% /dev/shm
>> >> df: `/var/lib/vservers/mastersql': Transport endpoint is not 
>> connected
>> >>
>> >> I'm not sure if i make something wrong (configuration) or if it is a
>> >> bug!
>> >> Can you experts please help me?
>> >>
>> >> If you need any further information or something please let me know.
>> >>
>> >> Thanks and regards
>> >> Urban
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Gluster-devel mailing list
>> >> Gluster-devel at nongnu.org
>> >> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>> >>
>> >
>> >
>>
>>
>
>