[Gluster-users] issues after botched update

Jorick Astrego jorick at netbulae.eu
Thu Jan 18 11:24:06 UTC 2018


Hi,

A client has a glusterfs cluster that's behaving weirdly after some 
issues during upgrade.

They upgraded a glusterfs 2+1 cluster (replica with arbiter) from 3.10.9 
to 3.12.4 on Centos and now have weird issues and some files maybe being 
corrupted. They also switched from nfs ganesha that crashed every couple 
of days to glusterfs subdirectory mounting. Subdirectory mounting was 
the point of the upgrade.

On a client when I go switch to a user and go to the home folder, I do a 
ls -lart and get a list of files, when I do it the second time I get 
"ls: cannot open directory .: Permission denied" until I logout and 
login again. Then it works mostly one time only, somtimes it fails on 
the first time. Deploying something in this folder works sometimes, 
other times I get write permission is denied on /home/user.

I've enabled the bitrot daemon but it's still running:

    Volume name : home

    State of scrub: Active (In Progress)

    Scrub impact: normal

    Scrub frequency: biweekly

    Bitrot error log location: /var/log/glusterfs/bitd.log

    Scrubber error log location: /var/log/glusterfs/scrub.log


    =========================================================

    Node: localhost

    Number of Scrubbed files: 152451

    Number of Skipped files: 197

    Last completed scrub time: Scrubber pending to complete.

    Duration of last scrub (D:M:H:M:S): 0:0:0:0

    Error count: 0


    =========================================================

    Node: gluster01

    Number of Scrubbed files: 150198

    Number of Skipped files: 190

    Last completed scrub time: Scrubber pending to complete.

    Duration of last scrub (D:M:H:M:S): 0:0:0:0

    Error count: 0


    =========================================================

    Node: gluster02

    Number of Scrubbed files: 0

    Number of Skipped files: 153939

    Last completed scrub time: Scrubber pending to complete.

    Duration of last scrub (D:M:H:M:S): 0:0:0:0

    Error count: 0

    =========================================================

Gluster volume heal has one failed entries.

    Ending time of crawl: Thu Jan 18 11:49:04 2018

    Type of crawl: INDEX
    No. of entries healed: 0
    No. of entries in split-brain: 0
    No. of heal failed entries: 0

    Starting time of crawl: Thu Jan 18 11:59:04 2018

    Ending time of crawl: Thu Jan 18 11:59:06 2018

    Type of crawl: INDEX
    No. of entries healed: 0
    No. of entries in split-brain: 0
    *No. of heal failed entries: 1*

    Starting time of crawl: Thu Jan 18 12:09:06 2018

    Ending time of crawl: Thu Jan 18 12:09:07 2018

    Type of crawl: INDEX
    No. of entries healed: 0
    No. of entries in split-brain: 0
    No. of heal failed entries: 0

but I'm unable to trace which file as "gluster volume heal <volume> info 
heal-failed" doesn't exist anymore


    gluster volume heal home info heal-failed

    Usage:
    volume heal <VOLNAME> [enable | disable | full |statistics
    [heal-count [replica <HOSTNAME:BRICKNAME>]] |info [split-brain]
    |split-brain {bigger-file <FILE> | latest-mtime <FILE> |source-brick
    <HOSTNAME:BRICKNAME> [<FILE>]} |granular-entry-heal {enable | disable}]

The things I'm seeing in the logfiles on the client:

        [2018-01-18 08:59:53.360864] W [MSGID: 114031] 
[client-rpc-fops.c:2151:client3_3_seek_cbk] 0-home-client-0: remote 
operation failed [No such device or address]
        [2018-01-18 09:00:17.512636] W [MSGID: 114031] 
[client-rpc-fops.c:2151:client3_3_seek_cbk] 0-home-client-0: remote 
operation failed [No such device or address]
        [2018-01-18 09:00:27.473702] W [MSGID: 114031] 
[client-rpc-fops.c:2151:client3_3_seek_cbk] 0-home-client-0: remote 
operation failed [No such device or address]
        [2018-01-18 09:00:40.372756] W [MSGID: 114031] 
[client-rpc-fops.c:2151:client3_3_seek_cbk] 0-home-client-0: remote 
operation failed [No such device or address]
        [2018-01-18 09:00:50.344597] W [MSGID: 114031] 
[client-rpc-fops.c:2151:client3_3_seek_cbk] 0-home-client-0: remote 
operation failed [No such device or address]

Thes one worry me, there are multiple but without any gfid:

      [MSGID: 109063] [dht-layout.c:716:dht_layout_normalize]
    5-home-dht: Found anomalies in (null) (gfid =
    00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0

Also it seems to disconnect and

    [2018-01-18 08:38:41.210848] I [MSGID: 114057]
    [client-handshake.c:1478:select_server_supported_programs]
    0-home-client-1: Using Program GlusterFS 3.3, Num (1298437), Version
    (330)
    [2018-01-18 08:38:41.214548] I [MSGID: 114057]
    [client-handshake.c:1478:select_server_supported_programs]
    0-home-client-2: Using Program GlusterFS 3.3, Num (1298437), Version
    (330)
    [2018-01-18 08:38:41.255458] I [MSGID: 114046]
    [client-handshake.c:1231:client_setvolume_cbk] 0-home-client-0:
    Connected to home-client-0, attached to remote volume
    '/data/home/brick1'.
    [2018-01-18 08:38:41.255505] I [MSGID: 114047]
    [client-handshake.c:1242:client_setvolume_cbk] 0-home-client-0:
    Server and Client lk-version numbers are not same, reopening the fds
    [2018-01-18 08:38:41.255643] I [MSGID: 108005]
    [afr-common.c:4929:__afr_handle_child_up_event] 0-home-replicate-0*:
    Subvolume 'home-client-0' came back up; going online.*
    [2018-01-18 08:38:41.255791] I [MSGID: 114035]
    [client-handshake.c:202:client_set_lk_version_cbk] 0-home-client-0:
    Server lk version = 1
    [2018-01-18 08:38:41.285774] I [MSGID: 114046]
    [client-handshake.c:1231:client_setvolume_cbk] 0-home-client-1:
    Connected to home-client-1, attached to remote volume
    '/data/home/brick1'.
    [2018-01-18 08:38:41.285815] I [MSGID: 114047]
    [client-handshake.c:1242:client_setvolume_cbk] 0-home-client-1:
    Server and Client lk-version numbers are not same, reopening the fds
    [2018-01-18 08:38:41.286672] I [MSGID: 114035]
    [client-handshake.c:202:client_set_lk_version_cbk] 0-home-client-1:
    Server lk version = 1
    [2018-01-18 08:38:41.291928] I [MSGID: 114046]
    [client-handshake.c:1231:client_setvolume_cbk] 0-home-client-2:
    Connected to home-client-2, attached to remote volume
    '/data/home/brick1'.
    [2018-01-18 08:38:41.291970] I [MSGID: 114047]
    [client-handshake.c:1242:client_setvolume_cbk] 0-home-client-2:
    Server and Client lk-version numbers are not same, reopening the fds
    [2018-01-18 08:38:41.292750] I [MSGID: 114035]
    [client-handshake.c:202:client_set_lk_version_cbk] 0-home-client-2:
    Server lk version = 1
    [2018-01-18 08:38:41.764822] I [MSGID: 108031]
    [afr-common.c:2376:afr_local_discovery_cbk] 0-home-replicate-0:
    selecting local read_child home-client-0


    [2018-01-18 08:38:39.039054] I [addr.c:55:compare_addr_and_update]
    0-/data/home/brick1: allowed = "*", received addr = "*.*.*.*"
    [2018-01-18 08:38:39.040813] I [login.c:76:gf_auth] 0-auth/login:
    allowed user names: 2d225def-3f34-472a-b8e4-c183acafc151
    [2018-01-18 08:38:39.040853] I [MSGID: 115029]
    [server-handshake.c:793:server_setvolume] 0-home-server: accepted
    client from
    gluster00.cluster.local-21570-2018/01/18-08:38:38:940331-home-client-0-0-0
    (version: 3.12.4)
    [2018-01-18 08:38:39.412234] I [MSGID: 115036]
    [server.c:527:server_rpc_notify] 0-home-server: disconnecting
    connection from
    gluster00.cluster.local-21570-2018/01/18-08:38:38:940331-home-client-0-0-0
    [2018-01-18 08:38:39.435180] I [MSGID: 101055]
    [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
    connection
    storage00.railscluster.local-21570-2018/01/18-08:38:38:940331-home-client-0-0-0
    [2018-01-18 08:38:41.203430] I [addr.c:55:compare_addr_and_update]
    0-/data/home/brick1: allowed = "*", received addr = "*.*.*.*"
    [2018-01-18 08:38:41.203464] I [login.c:76:gf_auth] 0-auth/login:
    allowed user names: 2d225def-3f34-472a-b8e4-c183acafc151
    [2018-01-18 08:38:41.203502] I [MSGID: 115029]
    [server-handshake.c:793:server_setvolume] 0-home-server: accepted
    client from
    gluster00.cluster.locall-21590-2018/01/18-08:38:41:174726-home-client-0-0-0
    (version: 3.12.4)
    [2018-01-18 08:38:41.787891] I [MSGID: 115036]
    [server.c:527:server_rpc_notify] 0-home-server: disconnecting
    connection from
    gluster00.cluster.local-21590-2018/01/18-08:38:41:174726-home-client-0-0-0
    [2018-01-18 08:38:41.790071] I [MSGID: 101055]
    [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
    connection
    gluster00.cluster.local.local-21590-2018/01/18-08:38:41:174726-home-client-0-0-0
    [2018-01-18 08:38:56.298125] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
    0-mgmt: Volume file changed
    [2018-01-18 08:38:56.384120] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
    0-mgmt: Volume file changed
    [2018-01-18 08:38:56.394284] I
    [glusterfsd-mgmt.c:1821:mgmt_getspec_cbk] 0-glusterfs: No change in
    volfile,continuing
    [2018-01-18 08:38:56.450621] I
    [glusterfsd-mgmt.c:1821:mgmt_getspec_cbk] 0-glusterfs: No change in
    volfile,continuing
    [2018-01-18 08:39:17.606237] E [MSGID: 113107]
    [posix.c:1150:posix_seek] 0-home-posix: seek failed on fd 639 length
    1166 [No such device or address]
    [2018-01-18 08:39:17.606310] E [MSGID: 115089]
    [server-rpc-fops.c:2090:server_seek_cbk] 0-home-server: 1432655:
    SEEK-2 (c20376ae-7db9-4340-950e-dba6aa95e848), client:
    client1.cluster.local-5710-2018/01/17-22:02:07:122452-home-client-0-0-0,
    error-xlator: home-posix [No such device or address]
    [2018-01-18 08:39:17.610233] E [MSGID: 113107]
    [posix.c:1150:posix_seek] 0-home-posix: seek failed on fd 639 length
    1166 [No such device or address]
    [2018-01-18 08:39:17.610285] E [MSGID: 115089]
    [server-rpc-fops.c:2090:server_seek_cbk] 0-home-server: 1432656:
    SEEK-2 (c20376ae-7db9-4340-950e-dba6aa95e848), client:
    client1.cluster.local-5710-2018/01/17-22:02:07:122452-home-client-0-0-0,
    error-xlator: home-posix [No such device or address]
    [2018-01-18 08:37:53.592700] E [MSGID: 113107]
    [posix.c:1150:posix_seek] 0-home-posix: seek failed on fd 1993
    length 2279 [No such device or address]
    [2018-01-18 08:39:53.583741] E [MSGID: 113107]
    [posix.c:1150:posix_seek] 0-home-posix: seek failed on fd 1948
    length 1758 [No such device or address]
    [2018-01-18 08:39:53.583822] E [MSGID: 115089]
    [server-rpc-fops.c:2090:server_seek_cbk] 0-home-server: 641734:
    SEEK-2 (96dc4c82-ab85-4477-a6e6-1116466f68ba), client:
    client1.cluster.local-5772-2018/01/17-22:02:07:201351-home-client-0-0-0,
    error-xlator: home-posix [No such device or address]





Met vriendelijke groet, With kind regards,

Jorick Astrego

Netbulae Virtualization Experts 

----------------

	Tel: 053 20 30 270 	info at netbulae.eu 	Staalsteden 4-3A 	KvK 08198180
 	Fax: 053 20 30 271 	www.netbulae.eu 	7547 TA Enschede 	BTW NL821234584B01

----------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180118/a21f2e77/attachment.html>


More information about the Gluster-users mailing list