[Bugs] [Bug 1318132] New: Clients return ENOTCONN or EINVAL after restarting brick servers in quick succession

Wed Mar 16 07:23:25 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1318132

            Bug ID: 1318132
           Summary: Clients return ENOTCONN or EINVAL after restarting
                    brick servers in quick succession
           Product: Red Hat Gluster Storage
           Version: 3.1
         Component: glusterfs
     Sub Component: core
          Severity: high
          Assignee: rhs-bugs at redhat.com
          Reporter: congxueyang at gmail.com
        QA Contact: annair at redhat.com
                CC: bugs at gluster.org, congxueyang at gmail.com,
                    gluster-bugs at redhat.com, jdarcy at redhat.com,
                    jwm at horde.net, rwheeler at redhat.com

+++ This bug was initially created as a clone of Bug #902953 +++

(This comment was longer than 65,535 characters and has been moved to an
attachment by Red Hat Bugzilla).

--- Additional comment from Amar Tumballi on 2013-02-14 04:37:39 EST ---

Thanks for the report, but one thing is, if a node is (or lot of nodes) are
going down and coming back up, isn't it natural to have the operations fail as
the filesystem is network based?

--- Additional comment from John Morrissey on 2013-02-15 11:04:24 EST ---

Sure, I would expect the operations to fail *while* the Gluster servers are
being restarted, but after the servers are running, I would also expect Gluster
clients to gracefully reconnect.

As the logs above show, they clearly do not do so after several minutes, or (in
our experience) even after several hours.

--- Additional comment from John Morrissey on 2013-04-01 12:28:12 EDT ---

Looks like this isn't limited to native Gluster clients.

Some of our nodes mount a Gluster instance via NFS. We noticed that these
clients can successfully mount the volume, but any I/O to them returns EIO:

    [jwm at elided:pts/13 ~> ls -l /path/to/gluster
    ls: /path/to/gluster: Input/output error

The gluster<->nfs process on the gluster server:

root     27902 12.1  0.7 406064 179052 ?       Ssl  Jan22 11601:30
/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p
/var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S
/tmp/bf018af881a58acb0efa7cefadd6fb1d.socket

is spinning on a file descriptor that probably used to be connected to a
gluster brick, but is now open to /etc/services:

-bash-4.1$ sudo strace -p 27902
Process 27902 attached - interrupt to quit
epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=19, u64=107374182419}}}, 258,
4294967295) = 1
getsockopt(19, SOL_SOCKET, SO_ERROR, [182050606976860271], [4]) = 0
shutdown(19, 2 /* send and receive */)  = -1 ENOTCONN (Transport endpoint is
not connected)
readv(19, [{"\0\0\0\0", 4}], 1)         = 0
epoll_ctl(3, EPOLL_CTL_DEL, 19, NULL)   = 0
close(19)                               = 0
epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=19, u64=107374182419}}}, 258,
4294967295) = 1
getsockopt(19, SOL_SOCKET, SO_ERROR, [190986337975795823], [4]) = 0
shutdown(19, 2 /* send and receive */)  = -1 ENOTCONN (Transport endpoint is
not connected)
readv(19, [{"\0\0\0\0", 4}], 1)         = 0
epoll_ctl(3, EPOLL_CTL_DEL, 19, NULL)   = 0
close(19)                               = 0
epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=19, u64=107374182419}}}, 258,
4294967295) = 1
-bash-4.1$ sudo lsof -p 27902
COMMAND     PID USER   FD   TYPE             DEVICE  SIZE/OFF      NODE NAME
[...]
glusterfs 27902 root   19u   REG              253,0    640999   3801126
/etc/services

--- Additional comment from Kaleb KEITHLEY on 2015-10-22 11:46:38 EDT ---

because of the large number of bugs filed against mainline version\ is
ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and
choose the appropriate, applicable version for it.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=Gjck1nUTS5&a=cc_unsubscribe