[Bugs] [Bug 1219358] New: Disperse volume: client crashed while running iozone

Thu May 7 07:22:32 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1219358

            Bug ID: 1219358
           Summary: Disperse volume: client crashed while running iozone
           Product: GlusterFS
           Version: 3.7.0
         Component: disperse
          Keywords: Triaged
          Assignee: bugs at gluster.org
          Reporter: aspandey at redhat.com
                CC: bugs at gluster.org, byarlaga at redhat.com,
                    gluster-bugs at redhat.com, iesool at 163.com,
                    pkarampu at redhat.com
        Depends On: 1188242, 1192378
            Blocks: 1186580 (qe_tracker_everglades)

+++ This bug was initially created as a clone of Bug #1188242 +++

Description of problem:
=======================

Fuse mounted on the client and tried to run iozone for 10 files in parallel
using below command. The gluster process has crashed and when tried to cd to
the mount point it gives "Transport end point not connected" message.

for i in `seq 1 10`; do /opt/iozone3_430/src/current/iozone -az -i0 -i1 & done

Version-Release number of selected component (if applicable):
=============================================================
glusterfs 3.7dev built on Feb  2 2015 01:04:49

Package Information:
====================
Downloaded from :

http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.7dev-0.555.gite927623.autobuild/

How reproducible:
100%

Steps to Reproduce:
===================

1. Create a fuse mount
2. Run iozone. as for i in `seq 1 10`; do ./iozone3_430/src/current/iozone -az
-i0 -i1 & done

Number of volumes :
===================
1

Volume Names:
=============
testvol

Volume on which the particular issue is seen [ if applicable ] :
================================================================
testvol 

Type of volumes :
=================
Disperse (1x(4+2))

Volume options if available :
=============================

[root at dhcp37-178 ~]# gluster volume get testvol all
Option                                  Value                                   
------                                  -----                                   
cluster.lookup-unhashed                 on                                      
cluster.min-free-disk                   10%                                     
cluster.min-free-inodes                 5%                                      
cluster.rebalance-stats                 off                                     
cluster.subvols-per-directory           (null)                                  
cluster.readdir-optimize                off                                     
cluster.rsync-hash-regex                (null)                                  
cluster.extra-hash-regex                (null)                                  
cluster.dht-xattr-name                  trusted.glusterfs.dht                   
cluster.randomize-hash-range-by-gfid    off                                     
cluster.local-volume-name               (null)                                  
cluster.weighted-rebalance              on                                      
cluster.switch-pattern                  (null)                                  
cluster.entry-change-log                on                                      
cluster.read-subvolume                  (null)                                  
cluster.read-subvolume-index            -1                                      
cluster.read-hash-mode                  1                                       
cluster.background-self-heal-count      16                                      
cluster.metadata-self-heal              on                                      
cluster.data-self-heal                  on                                      
cluster.entry-self-heal                 on                                      
cluster.self-heal-daemon                on                                      
cluster.self-heal-window-size           1                                       
cluster.data-change-log                 on                                      
cluster.metadata-change-log             on                                      
cluster.data-self-heal-algorithm        (null)                                  
cluster.eager-lock                      on                                      
cluster.quorum-type                     none                                    
cluster.quorum-count                    (null)                                  
cluster.choose-local                    true                                    
cluster.self-heal-readdir-size          1KB                                     
cluster.post-op-delay-secs              1                                       
cluster.ensure-durability               on                                      
cluster.stripe-block-size               128KB                                   
cluster.stripe-coalesce                 true                                    
diagnostics.latency-measurement         off                                     
diagnostics.dump-fd-stats               off                                     
diagnostics.count-fop-hits              off                                     
diagnostics.brick-log-level             INFO                                    
diagnostics.client-log-level            INFO                                    
diagnostics.brick-sys-log-level         CRITICAL                                
diagnostics.client-sys-log-level        CRITICAL                                
diagnostics.brick-logger                (null)                                  
diagnostics.client-logger               (null)                                  
diagnostics.brick-log-format            (null)                                  
diagnostics.client-log-format           (null)                                  
diagnostics.brick-log-buf-size          5                                       
diagnostics.client-log-buf-size         5                                       
diagnostics.brick-log-flush-timeout     120                                     
diagnostics.client-log-flush-timeout    120                                     
performance.cache-max-file-size         0                                       
performance.cache-min-file-size         0                                       
performance.cache-refresh-timeout       1                                       
performance.cache-priority                                                      
performance.cache-size                  32MB                                    
performance.io-thread-count             16                                      
performance.high-prio-threads           16                                      
performance.normal-prio-threads         16                                      
performance.low-prio-threads            16                                      
performance.least-prio-threads          1                                       
performance.enable-least-priority       on                                      
performance.least-rate-limit            0                                       
performance.cache-size                  128MB                                   
performance.flush-behind                on                                      
performance.nfs.flush-behind            on                                      
performance.write-behind-window-size    1MB                                     
performance.nfs.write-behind-window-size1MB                                     
performance.strict-o-direct             off                                     
performance.nfs.strict-o-direct         off                                     
performance.strict-write-ordering       off                                     
performance.nfs.strict-write-ordering   off                                     
performance.lazy-open                   yes                                     
performance.read-after-open             no                                      
performance.read-ahead-page-count       4                                       
performance.md-cache-timeout            1                                       
features.encryption                     off                                     
encryption.master-key                   (null)                                  
encryption.data-key-size                256                                     
encryption.block-size                   4096                                    
network.frame-timeout                   1800                                    
network.ping-timeout                    42                                      
network.tcp-window-size                 (null)                                  
features.lock-heal                      off                                     
features.grace-timeout                  10                                      
network.remote-dio                      disable                                 
network.tcp-window-size                 (null)                                  
network.inode-lru-limit                 16384                                   
auth.allow                              *                                       
auth.reject                             (null)                                  
transport.keepalive                     (null)                                  
server.allow-insecure                   (null)                                  
server.root-squash                      off                                     
server.anonuid                          65534                                   
server.anongid                          65534                                   
server.statedump-path                   /var/run/gluster                        
server.outstanding-rpc-limit            64                                      
features.lock-heal                      off                                     
features.grace-timeout                  (null)                                  
server.ssl                              (null)                                  
auth.ssl-allow                          *                                       
server.manage-gids                      off                                     
client.send-gids                        on                                      
server.gid-timeout                      2                                       
server.own-thread                       (null)                                  
performance.write-behind                on                                      
performance.read-ahead                  on                                      
performance.readdir-ahead               off                                     
performance.io-cache                    on                                      
performance.quick-read                  on                                      
performance.open-behind                 on                                      
performance.stat-prefetch               on                                      
performance.client-io-threads           off                                     
performance.nfs.write-behind            on                                      
performance.nfs.read-ahead              off                                     
performance.nfs.io-cache                off                                     
performance.nfs.quick-read              off                                     
performance.nfs.stat-prefetch           off                                     
performance.nfs.io-threads              off                                     
performance.force-readdirp              true                                    
features.file-snapshot                  off                                     
features.uss                            off                                     
features.snapshot-directory             .snaps                                  
features.show-snapshot-directory        off                                     
network.compression                     off                                     
network.compression.window-size         -15                                     
network.compression.mem-level           8                                       
network.compression.min-size            0                                       
network.compression.compression-level   -1                                      
network.compression.debug               false                                   
features.limit-usage                    (null)                                  
features.quota-timeout                  0                                       
features.default-soft-limit             80%                                     
features.soft-timeout                   60                                      
features.hard-timeout                   5                                       
features.alert-time                     86400                                   
features.quota-deem-statfs              off                                     
geo-replication.indexing                off                                     
geo-replication.indexing                off                                     
geo-replication.ignore-pid-check        off                                     
geo-replication.ignore-pid-check        off                                     
features.quota                          on                                      
debug.trace                             off                                     
debug.log-history                       no                                      
debug.log-file                          no                                      
debug.exclude-ops                       (null)                                  
debug.include-ops                       (null)                                  
debug.error-gen                         off                                     
debug.error-failure                     (null)                                  
debug.error-number                      (null)                                  
debug.random-failure                    off                                     
debug.error-fops                        (null)                                  
nfs.enable-ino32                        no                                      
nfs.mem-factor                          15                                      
nfs.export-dirs                         on                                      
nfs.export-volumes                      on                                      
nfs.addr-namelookup                     off                                     
nfs.dynamic-volumes                     off                                     
nfs.register-with-portmap               on                                      
nfs.outstanding-rpc-limit               16                                      
nfs.port                                2049                                    
nfs.rpc-auth-unix                       on                                      
nfs.rpc-auth-null                       on                                      
nfs.rpc-auth-allow                      all                                     
nfs.rpc-auth-reject                     none                                    
nfs.ports-insecure                      off                                     
nfs.trusted-sync                        off                                     
nfs.trusted-write                       off                                     
nfs.volume-access                       read-write                              
nfs.export-dir                                                                  
nfs.disable                             false                                   
nfs.nlm                                 on                                      
nfs.acl                                 on                                      
nfs.mount-udp                           off                                     
nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab             
nfs.rpc-statd                           /sbin/rpc.statd                         
nfs.server-aux-gids                     off                                     
nfs.drc                                 off                                     
nfs.drc-size                            0x20000                                 
nfs.read-size                           (1 * 1048576ULL)                        
nfs.write-size                          (1 * 1048576ULL)                        
nfs.readdir-size                        (1 * 1048576ULL)                        
features.read-only                      off                                     
features.worm                           off                                     
storage.linux-aio                       off                                     
storage.batch-fsync-mode                reverse-fsync                           
storage.batch-fsync-delay-usec          0                                       
storage.owner-uid                       -1                                      
storage.owner-gid                       -1                                      
storage.node-uuid-pathinfo              off                                     
storage.health-check-interval           30                                      
storage.build-pgfid                     off                                     
storage.bd-aio                          off                                     
cluster.server-quorum-type              off                                     
cluster.server-quorum-ratio             0                                       
changelog.changelog                     off                                     
changelog.changelog-dir                 (null)                                  
changelog.encoding                      ascii                                   
changelog.rollover-time                 15                                      
changelog.fsync-interval                5                                       
changelog.changelog-barrier-timeout     120                                     
features.barrier                        disable                                 
features.barrier-timeout                120                                     
locks.trace                             disable                                 
cluster.disperse-self-heal-daemon       enable                                  
[root at dhcp37-178 ~]# 

Output of gluster volume info :
================================
[root at dhcp37-178 ~]# gluster v info

Volume Name: testvol
Type: Disperse
Volume ID: ad1a31fb-2e69-4d5d-9ae0-d057879b8fd5
Status: Started
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1:
dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick1/b1
Brick2:
dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick2/b1
Brick3:
dhcp37-178:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick3/b1
Brick4:
dhcp37-183:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick4/b1
Brick5:
dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick5/b2
Brick6:
dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick6/b2
Options Reconfigured:
features.uss: off
features.quota: on
[root at dhcp37-178 ~]# 

Output of gluster volume status :
=================================
[root at dhcp37-178 ~]# gluster v status
Status of volume: testvol
Gluster process                        Port    Online    Pid
------------------------------------------------------------------------------
Brick dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick1/b1                49156    Y    3225
Brick dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick2/b1                49167    Y    3238
Brick dhcp37-178:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick3/b1                49166    Y    3192
Brick dhcp37-183:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick4/b1                49166    Y    3173
Brick dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick5/b2                49157    Y    3236
Brick dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick6/b2                49168    Y    3249
NFS Server on localhost                    2049    Y    3206
Quota Daemon on localhost                N/A    Y    3221
NFS Server on dhcp37-208                2049    Y    3262
Quota Daemon on dhcp37-208                N/A    Y    3276
NFS Server on dhcp37-183                2049    Y    3186
Quota Daemon on dhcp37-183                N/A    Y    3199
NFS Server on 10.70.37.120                2049    Y    3250
Quota Daemon on 10.70.37.120                N/A    Y    3263

Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks

[root at dhcp37-178 ~]# 

Actual results:
================

Gluster client crashed

Expected results:
================

It should not be crashed

Additional info:
================
Attaching the client mount log.

--- Additional comment from Bhaskarakiran on 2015-02-24 06:33:12 EST ---

--- Additional comment from Bhaskarakiran on 2015-02-24 06:34:39 EST ---

Log snippet: 
============

pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(FTRUNCATE)
frame : type(0) op(0)
frame : type(1) op(UNLINK)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(FLUSH)
frame : type(1) op(STAT)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2015-02-24 11:41:47
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7dev
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x306ae20aa6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x306ae3bdcf]
/lib64/libc.so.6[0x342d4326a0]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/distribute.so(dht_writev_cbk+0x268)[0x7f300993cbf8]
/usr/lib64/libglusterfs.so.0(default_writev_cbk+0xcc)[0x306ae2e5ec]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_manager_writev+0x10d)[0x7f3009b8647d]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(__ec_manager+0x34)[0x7f3009b6a654]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_resume+0x91)[0x7f3009b6a461]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_combine+0x196)[0x7f3009b88fa6]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_writev_cbk+0x27b)[0x7f3009b844bb]
/usr/lib64/glusterfs/3.7dev/xlator/protocol/client.so(client3_3_writev_cbk+0x6cc)[0x7f3009de301c]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x306aa0ea65]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x142)[0x306aa0ff02]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x306aa0b5f8]
/usr/lib64/glusterfs/3.7dev/rpc-transport/socket.so(+0x9759)[0x7f30103fc759]
/usr/lib64/glusterfs/3.7dev/rpc-transport/socket.so(+0xb1bd)[0x7f30103fe1bd]
/usr/lib64/libglusterfs.so.0[0x306ae78ffc]
/lib64/libpthread.so.0[0x342d8079d1]
/lib64/libc.so.6(clone+0x6d)[0x342d4e89dd]
---------

--- Additional comment from Ashish Pandey on 2015-03-03 23:59:25 EST ---

dht_fsync_cbk() function is being called with op_ret = -1,  op_errno = 2
(ENOENT)  and postbuf and prebuff is NULL.
Inside the function dht_fsync_cbk, skipping the error handling of op_errno =
ENOENT 
( if (op_ret == -1 && !dht_inode_missing(op_errno)) ) 
which causes control to go to -  
if (IS_DHT_MIGRATION_PHASE1 (postbuf)) 
Macro  IS_DHT_MIGRATION_PHASE1 trying to access the attributes of file using 
postbuf pointer which is NULL. This leads to crash.
Bug id 960843 made some changes to not to include op_errno = ENOENT in error
handling.

Need to investigate the reason to skip op_errno = ENOENT case and also modify
marco definitions to handle NULL pointers properly.

--- Additional comment from Pranith Kumar K on 2015-03-09 02:44:05 EDT ---

Ashish,
     I just realized, on an active fd, fsync should never give ESTALE/ENOENT as
the fd is already opened on the file. Why is EC returning this error? This
could be ec bug after all?

Pranith

--- Additional comment from Anand Avati on 2015-04-09 08:21:47 EDT ---

REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#1) for review on master by Ashish Pandey
(aspandey at redhat.com)

--- Additional comment from Anand Avati on 2015-04-13 07:19:29 EDT ---

REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#2) for review on master by Ashish Pandey
(aspandey at redhat.com)

--- Additional comment from Anand Avati on 2015-04-13 07:19:32 EDT ---

REVIEW: http://review.gluster.org/10218 (Comments implemeted) posted (#1) for
review on master by Ashish Pandey (aspandey at redhat.com)

--- Additional comment from Anand Avati on 2015-04-14 05:45:23 EDT ---

REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#3) for review on master by Ashish Pandey
(aspandey at redhat.com)

--- Additional comment from Anand Avati on 2015-04-28 02:06:23 EDT ---

REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#4) for review on master by Ashish Pandey
(aspandey at redhat.com)

--- Additional comment from Anand Avati on 2015-05-01 11:04:55 EDT ---

REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#5) for review on master by Ashish Pandey
(aspandey at redhat.com)

--- Additional comment from Anand Avati on 2015-05-03 07:46:40 EDT ---

REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#6) for review on master by Ashish Pandey
(aspandey at redhat.com)

--- Additional comment from Anand Avati on 2015-05-04 07:37:02 EDT ---

REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#7) for review on master by Ashish Pandey
(aspandey at redhat.com)

--- Additional comment from Anand Avati on 2015-05-04 22:43:51 EDT ---

COMMIT: http://review.gluster.org/10176 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 582b252e3a418ee332cf3d4b1a415520e242b599
Author: Ashish Pandey <aspandey at redhat.com>
Date:   Thu Apr 9 17:27:46 2015 +0530

    cluster/ec: Use fd instead of loc for get_size_version

    Change-Id: Ia7d43cb3b222db34ecb0e35424f1766715ed8e6a
    BUG: 1188242
    Signed-off-by: Ashish Pandey <aspandey at redhat.com>
    Reviewed-on: http://review.gluster.org/10176
    Reviewed-by: Xavier Hernandez <xhernandez at datalab.es>
    Tested-by: Gluster Build System <jenkins at build.gluster.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1186580
[Bug 1186580] QE tracker bug for Everglades
https://bugzilla.redhat.com/show_bug.cgi?id=1188242
[Bug 1188242] Disperse volume: client crashed while running iozone
https://bugzilla.redhat.com/show_bug.cgi?id=1192378
[Bug 1192378] Disperse volume: client crashed while running renames with
epoll enabled
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.