[Gluster-users] VM going down

Thu May 11 12:19:24 UTC 2017

On Wed, May 10, 2017 at 09:08:03PM +0530, Pranith Kumar Karampuri wrote:
> On Wed, May 10, 2017 at 7:11 PM, Niels de Vos <ndevos at redhat.com> wrote:
> 
> > On Wed, May 10, 2017 at 04:08:22PM +0530, Pranith Kumar Karampuri wrote:
> > > On Tue, May 9, 2017 at 7:40 PM, Niels de Vos <ndevos at redhat.com> wrote:
> > >
> > > > ...
> > > > > > client from
> > > > > > srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0
> > > > > > (version: 3.8.11)
> > > > > > [2017-05-08 10:01:06.237433] E [MSGID: 113107]
> > > > [posix.c:1079:posix_seek]
> > > > > > 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No
> > such
> > > > > > device or address]
> > > >
> > > > The SEEK procedure translates to lseek() in the posix xlator. This can
> > > > return with "No suck device or address" (ENXIO) in only one case:
> > > >
> > > >     ENXIO    whence is SEEK_DATA or SEEK_HOLE, and the file offset is
> > > >              beyond the end of the file.
> > > >
> > > > This means that an lseek() was executed where the current offset of the
> > > > filedescriptor was higher than the size of the file. I'm not sure how
> > > > that could happen... Sharding prevents using SEEK at all atm.
> > > >
> > > > ...
> > > > > > The strange part is that I cannot seem to find any other error.
> > > > > > If I restart the VM everything works as expected (it stopped at
> > ~9.51
> > > > > > UTC and was started at ~10.01 UTC) .
> > > > > >
> > > > > > This is not the first time that this happened, and I do not see any
> > > > > > problems with networking or the hosts.
> > > > > >
> > > > > > Gluster version is 3.8.11
> > > > > > this is the incriminated volume (though it happened on a different
> > one
> > > > too)
> > > > > >
> > > > > > Volume Name: datastore2
> > > > > > Type: Replicate
> > > > > > Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea
> > > > > > Status: Started
> > > > > > Snapshot Count: 0
> > > > > > Number of Bricks: 1 x (2 + 1) = 3
> > > > > > Transport-type: tcp
> > > > > > Bricks:
> > > > > > Brick1: srvpve2g:/data/brick2/brick
> > > > > > Brick2: srvpve3g:/data/brick2/brick
> > > > > > Brick3: srvpve1g:/data/brick2/brick (arbiter)
> > > > > > Options Reconfigured:
> > > > > > nfs.disable: on
> > > > > > performance.readdir-ahead: on
> > > > > > transport.address-family: inet
> > > > > >
> > > > > > Any hint on how to dig more deeply into the reason would be greatly
> > > > > > appreciated.
> > > >
> > > > Probably the problem is with SEEK support in the arbiter functionality.
> > > > Just like with a READ or a WRITE on the arbiter brick, SEEK can only
> > > > succeed on bricks where the files with content are located. It does not
> > > > look like arbiter handles SEEK, so the offset in lseek() will likely be
> > > > higher than the size of the file on the brick (empty, 0 size file). I
> > > > don't know how the replication xlator responds on an error return from
> > > > SEEK on one of the bricks, but I doubt it likes it.
> > > >
> > >
> > > inode-read fops don't get sent to arbiter brick. So this won't happen.
> >
> > Yes, I see that the arbiter xlator returns on reads without going to the
> > bricks. Should that not be done for seek as well? It's the first time I
> > actually looked at the code of the arbiter xlator, so I might well be
> > misunderstanding how it works :)
> >
> 
> inode-read fops are the fops which read some information from the inode.
> Like stat/getxattr/read. Even seek falls in that category. It is not sent
> on arbiter brick...

What confuses me is that the arbiter xlator defines the following FOPs
in xlators/features/arbiter/src/arbiter.c:

    struct xlator_fops fops = { 
            .lookup = arbiter_lookup,
            .readv  = arbiter_readv,
            .truncate = arbiter_truncate,
            .writev = arbiter_writev,
            .ftruncate = arbiter_ftruncate,
            .fallocate = arbiter_fallocate,
            .discard = arbiter_discard,
            .zerofill = arbiter_zerofill,
    };

To go back to the error message: 

  [posix.c:1079:posix_seek] 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such device or address]

We need to know on which brick this occurs to confirm that is was not
sent on the arbiter brick somehow.

Thanks,
Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170511/2d082c9c/attachment.sig>