[Gluster-users] setting gfid on .trashcan/... failed - total outage

Fri Jun 30 12:26:39 UTC 2017

On Thu, 2017-06-29 at 17:13 +0200, Dietmar Putz wrote:
> Hello Anoop,
> 
> thank you for your reply....
> 
> answers inside...
> 
> best regards
> 
> Dietmar
> 
> 
> On 29.06.2017 10:48, Anoop C S wrote:
> > On Wed, 2017-06-28 at 14:42 +0200, Dietmar Putz wrote:
> > > Hello,
> > > 
> > > recently we had two times a partial gluster outage followed by a total
> > > outage of all four nodes. Looking into the gluster mailing list i found
> > > a very similar case in
> > > http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.html
> > 
> > If you are talking about a crash happening on bricks, were you able to find any backtraces from
> > any
> > of the brick logs?
> 
> yes, the crash happened on the bricks.
> i followed the hints in the mentioned similar case but unfortunately i 
> did not found any backtrace from any of the brick logs.

Usually a backtrace will be written to logs just before brick dies in case of SIG SEGV.

> 
> > 
> > > but i'm not sure if this issue is fixed...
> > > 
> > > even this outage happened on glusterfs 3.7.18 which gets no more updates
> > > since ~.20 i would kindly ask if this issue is known to be fixed in 3.8
> > > resp. 3.10... ?
> > > unfortunately i did not found corresponding informations in the release
> > > notes...
> > > 
> > > best regards
> > > Dietmar
> > > 
> > > 
> > > the partial outage started as shown below, the very first entries
> > > occurred in the brick-logs :
> > > 
> > > gl-master-04, brick1-mvol1.log :
> > > 
> > > [2017-06-23 16:35:11.373471] E [MSGID: 113020]
> > > [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on
> > > /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511
> > > failed
> > > [2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata]
> > > (-->/usr/lib/x86_64-linux-
> > > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
> > > [0x7f4f8c2aaa0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
> > > storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f]
> > > -->/usr/lib/x86_64-linux-
> > > gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
> > > [0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode:
> > > 0x7f4ef430bd6cgfid:00000000-0
> > > 000-0000-0000-000000000000 [Invalid argument]
> > > ...
> > > 
> > > 
> > > gl-master-04 : etc-glusterfs-glusterd.vol.log
> > > 
> > > [2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor]
> > > 0-rpc-service: RPC program not available (req 1298437 330) for
> > > 10.0.1.203:65533
> > > [2017-06-23 16:35:18.872421] E
> > > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
> > > to complete successfully
> > > 
> > > gl-master-04 : glustershd.log
> > > 
> > > [2017-06-23 16:35:42.536840] E [MSGID: 108006]
> > > [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are
> > > down. Going offline until atleast one of them comes back up.
> > > [2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish]
> > > 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused)
> > > 
> > > 
> > > 
> > > gl-master-03, brick1-movl1.log :
> > > 
> > > [2017-06-23 16:35:11.399769] E [MSGID: 113020]
> > > [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on
> > > /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511
> > > failed
> > > [2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata]
> > > (-->/usr/lib/x86_64-linux-
> > > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
> > > [0x7ff517087a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
> > > storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f]
> > > -->/usr/lib/x86_64-linux-
> > > gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
> > > [0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode:
> > > 0x7ff4c82e1b5cgfid:00000000-0
> > > 000-0000-0000-000000000000 [Invalid argument]
> > > ...
> > > 
> > > 
> > > gl-master-03 : etc-glusterfs-glusterd.vol.log
> > > 
> > > [2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor]
> > > 0-rpc-service: RPC program not available (req 1298437 330) for
> > > 10.0.1.203:65530
> > > [2017-06-23 16:35:19.879201] E
> > > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
> > > to complete successfully
> > > [2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor]
> > > 0-rpc-service: RPC program not available (req 1298437 330) for
> > > 10.0.1.203:65530
> > > [2017-06-23 16:35:19.879314] E
> > > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
> > > to complete successfully
> > > [2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor]
> > > 0-rpc-service: RPC program not available (req 1298437 330) for
> > > 10.0.1.203:65530
> > > [2017-06-23 16:35:19.879859] E
> > > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
> > > to complete successfully
> > > [2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv] 0-management:
> > > readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket failed
> > > (No data available)
> > > [2017-06-23 16:35:42.543486] I [MSGID: 106005]
> > > [glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management:
> > > Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd.
> > > 
> > > 
> > > gl-master-03 : glustershd.log
> > > 
> > > [2017-06-23 16:35:42.537752] E [MSGID: 108006]
> > > [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are
> > > down. Going offline until atleast one of them comes back up.
> > > [2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish]
> > > 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused)
> > > [2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish]
> > > 0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection refused)
> > > 
> > > 
> > > 
> > > about 73 minutes later the remaining replicated pair was affected by the
> > > outage :
> > > 
> > > gl-master-02, brick1-mvol1.log :
> > > 
> > > [2017-06-23 17:48:30.093526] E [MSGID: 113018]
> > > [posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on parent
> > > /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory]
> > > [2017-06-23 17:48:30.093591] E [MSGID: 113018]
> > > [posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent
> > > /brick1/mvol1/.trashcan//2290 failed [No such file or directory]
> > > [2017-06-23 17:48:30.093636] E [MSGID: 113027]
> > > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/ failed
> > > [File exists]
> > > [2017-06-23 17:48:30.093670] E [MSGID: 113027]
> > > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
> > > /brick1/mvol1/.trashcan failed [File exists]
> > > [2017-06-23 17:48:30.093701] E [MSGID: 113027]
> > > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
> > > /brick1/mvol1/.trashcan/ failed [File exists]
> > > [2017-06-23 17:48:30.113559] E [MSGID: 113001]
> > > [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on
> > > /brick1/mvol1/.trashcan//2290 failed [No such file or directory]
> > > [2017-06-23 17:48:30.113630] E [MSGID: 113027]
> > > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
> > > /brick1/mvol1/.trashcan//2290 failed [File exists]
> > > [2017-06-23 17:48:30.163155] E [MSGID: 113001]
> > > [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on
> > > /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory]
> > > [2017-06-23 17:48:30.163282] E [MSGID: 113001]
> > > [posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on
> > > /brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830
> > > failed  [No such file or directory]
> > > [2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata]
> > > (-->/usr/lib/x86_64-linux-
> > > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
> > > [0x7f4ec77d9a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
> > > storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f]
> > > -->/usr/lib/x86_64-linux-
> > > gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
> > > [0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode:
> > > 0x7f4e7041f9acgfid:00000000-0
> > > 000-0000-0000-000000000000 [Invalid argument]
> > > 
> > > 
> > > the mentioned file in the brick-log was still available in the origin
> > > directory but not in the corresponding trashcan directory :
> > > 
> > > 
> > > [ 14:29:29 ] - root at gl-master-01  /var/log/glusterfs $ls -lh
> > > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT*
> > > -rw-r--r-- 1 2001 2001 386M Mar 31 13:00
> > > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4
> > > -rw-r--r-- 1 2001 2001 386M Jun  2 13:09
> > > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4
> > > [ 15:08:53 ] - root at gl-master-01  /var/log/glusterfs $
> > > 
> > > 
> > > [ 15:11:04 ] - root at gl-master-01  /var/log/glusterfs $ls -lh
> > > /sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT*
> > > [ 15:11:10 ] - root at gl-master-01  /var/log/glusterfs $
> > > 
> > > 
> > > some further informations...the OS is ubuntu 16.04.2 lts, volume info
> > > below :
> > > 
> > > [ 11:31:53 ] - root at gl-master-03  ~ $gluster volume info mvol1
> > > 
> > > Volume Name: mvol1
> > > Type: Distributed-Replicate
> > > Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96
> > > Status: Started
> > > Number of Bricks: 2 x 2 = 4
> > > Transport-type: tcp
> > > Bricks:
> > > Brick1: gl-master-01-int:/brick1/mvol1
> > > Brick2: gl-master-02-int:/brick1/mvol1
> > > Brick3: gl-master-03-int:/brick1/mvol1
> > > Brick4: gl-master-04-int:/brick1/mvol1
> > > Options Reconfigured:
> > > geo-replication.ignore-pid-check: on
> > > geo-replication.indexing: on
> > > nfs.disable: off
> > > diagnostics.client-log-level: ERROR
> > > changelog.changelog: on
> > > performance.cache-refresh-timeout: 32
> > > cluster.min-free-disk: 200GB
> > > network.ping-timeout: 5
> > > performance.io-thread-count: 64
> > > performance.cache-size: 8GB
> > > performance.readdir-ahead: on
> > > features.trash: off
> > 
> > mvol1 has disabled the trash feature. So you should not be seeing the above mentioned errors in
> > brick logs further.
> 
> yes, right after the second outage we decided to disable the trash 
> feature...
> 
> > 
> > > features.trash-max-filesize: 1GB
> > > [ 11:31:56 ] - root at gl-master-03  ~ $
> > > 
> > > 
> > > Host : gl-master-01
> > > -rw-r----- 1 root root 232M Jun 23 17:49
> > > /var/crash/_usr_sbin_glusterfsd.0.crash
> > > -----------------------------------------------------
> > > Host : gl-master-02
> > > -rw-r----- 1 root root 226M Jun 23 17:49
> > > /var/crash/_usr_sbin_glusterfsd.0.crash
> > > -----------------------------------------------------
> > > Host : gl-master-03
> > > -rw-r----- 1 root root 254M Jun 23 16:35
> > > /var/crash/_usr_sbin_glusterfsd.0.crash
> > > -----------------------------------------------------
> > > Host : gl-master-04
> > > -rw-r----- 1 root root 239M Jun 23 16:35
> > > /var/crash/_usr_sbin_glusterfsd.0.crash
> > > -----------------------------------------------------
> > 
> > If these are the core files dumped due to brick crash, can you please attach it to gdb as
> > follows
> > and paste the backtrace by executing the `bt` command within it.
> > 
> > $ gdb /usr/sbin/glusterfsd /var/crash/_usr_sbin_glusterfs.0.crash
> > 
> > (gdb) bt
> 
> unfortunately another problem...even when the filename ends up with 
> 'crash' and the creation time meets the time of the error the file 
> _usr_sbin_glusterfsd.0.crash is not recognized as a core dump.
> currently i don't know how to handle this, tried several things with no 
> success, therefore i add the 'head' of the file...
> 
> [ 14:47:37 ] - root at gl-master-03  ~ $gdb /usr/sbin/glusterfsd 
> /var/crash/_usr_sbin_glusterfsd.0.crash
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
> ...
> "/var/crash/_usr_sbin_glusterfsd.0.crash" is not a core dump: File 
> format not recognised
> (gdb)
> 
> [ 14:48:30 ] - root at gl-master-03  ~ $file 
> /var/crash/_usr_sbin_glusterfsd.0.crash
> /var/crash/_usr_sbin_glusterfsd.0.crash: ASCII text, with very long lines
> [ 14:48:37 ] - root at gl-master-03  ~ $head 
> /var/crash/_usr_sbin_glusterfsd.0.crash
> ProblemType: Crash
> Architecture: amd64
> Date: Fri Jun 23 16:35:13 2017
> DistroRelease: Ubuntu 16.04
> ExecutablePath: /usr/sbin/glusterfsd
> ExecutableTimestamp: 1481112595
> ProcCmdline: /usr/sbin/glusterfsd -s gl-master-03-int --volfile-id 
> mvol1.gl-master-03-int.brick1-mvol1 -p 
> /var/lib/glusterd/vols/mvol1/run/gl-master-03-int-brick1-mvol1.pid -S 
> /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket --brick-name 
> /brick1/mvol1 -l /var/log/glusterfs/bricks/brick1-mvol1.log 
> --xlator-option 
> *-posix.glusterd-uuid=056fb1db-9a49-422d-81fb-94e1881313fd --brick-port 
> 49152 --xlator-option mvol1-server.listen-port=49152
> ProcCwd: /
> ProcEnviron:
>   LANGUAGE=en_GB:en
> [ 14:48:52 ] - root at gl-master-03  ~ $

Are there any information regarding the signal which caused the crash? like Signal: 11 or something
similar.. May be you can use `apport-unpack` to extract the coredump(found via some search..not all
familiar).