[Bugs] [Bug 1219358] New: Disperse volume: client crashed while running iozone
bugzilla at redhat.com
bugzilla at redhat.com
Thu May 7 07:22:32 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1219358
Bug ID: 1219358
Summary: Disperse volume: client crashed while running iozone
Product: GlusterFS
Version: 3.7.0
Component: disperse
Keywords: Triaged
Assignee: bugs at gluster.org
Reporter: aspandey at redhat.com
CC: bugs at gluster.org, byarlaga at redhat.com,
gluster-bugs at redhat.com, iesool at 163.com,
pkarampu at redhat.com
Depends On: 1188242, 1192378
Blocks: 1186580 (qe_tracker_everglades)
+++ This bug was initially created as a clone of Bug #1188242 +++
Description of problem:
=======================
Fuse mounted on the client and tried to run iozone for 10 files in parallel
using below command. The gluster process has crashed and when tried to cd to
the mount point it gives "Transport end point not connected" message.
for i in `seq 1 10`; do /opt/iozone3_430/src/current/iozone -az -i0 -i1 & done
Version-Release number of selected component (if applicable):
=============================================================
glusterfs 3.7dev built on Feb 2 2015 01:04:49
Package Information:
====================
Downloaded from :
http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.7dev-0.555.gite927623.autobuild/
How reproducible:
100%
Steps to Reproduce:
===================
1. Create a fuse mount
2. Run iozone. as for i in `seq 1 10`; do ./iozone3_430/src/current/iozone -az
-i0 -i1 & done
Number of volumes :
===================
1
Volume Names:
=============
testvol
Volume on which the particular issue is seen [ if applicable ] :
================================================================
testvol
Type of volumes :
=================
Disperse (1x(4+2))
Volume options if available :
=============================
[root at dhcp37-178 ~]# gluster volume get testvol all
Option Value
------ -----
cluster.lookup-unhashed on
cluster.min-free-disk 10%
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize off
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 16
cluster.metadata-self-heal on
cluster.data-self-heal on
cluster.entry-self-heal on
cluster.self-heal-daemon on
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm (null)
cluster.eager-lock on
cluster.quorum-type none
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.stripe-block-size 128KB
cluster.stripe-coalesce true
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level INFO
diagnostics.client-log-level INFO
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
performance.cache-max-file-size 0
performance.cache-min-file-size 0
performance.cache-refresh-timeout 1
performance.cache-priority
performance.cache-size 32MB
performance.io-thread-count 16
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.least-rate-limit 0
performance.cache-size 128MB
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 1MB
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.lazy-open yes
performance.read-after-open no
performance.read-ahead-page-count 4
performance.md-cache-timeout 1
features.encryption off
encryption.master-key (null)
encryption.data-key-size 256
encryption.block-size 4096
network.frame-timeout 1800
network.ping-timeout 42
network.tcp-window-size (null)
features.lock-heal off
features.grace-timeout 10
network.remote-dio disable
network.tcp-window-size (null)
network.inode-lru-limit 16384
auth.allow *
auth.reject (null)
transport.keepalive (null)
server.allow-insecure (null)
server.root-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 64
features.lock-heal off
features.grace-timeout (null)
server.ssl (null)
auth.ssl-allow *
server.manage-gids off
client.send-gids on
server.gid-timeout 2
server.own-thread (null)
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead off
performance.io-cache on
performance.quick-read on
performance.open-behind on
performance.stat-prefetch on
performance.client-io-threads off
performance.nfs.write-behind on
performance.nfs.read-ahead off
performance.nfs.io-cache off
performance.nfs.quick-read off
performance.nfs.stat-prefetch off
performance.nfs.io-threads off
performance.force-readdirp true
features.file-snapshot off
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.limit-usage (null)
features.quota-timeout 0
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing off
geo-replication.indexing off
geo-replication.ignore-pid-check off
geo-replication.ignore-pid-check off
features.quota on
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.enable-ino32 no
nfs.mem-factor 15
nfs.export-dirs on
nfs.export-volumes on
nfs.addr-namelookup off
nfs.dynamic-volumes off
nfs.register-with-portmap on
nfs.outstanding-rpc-limit 16
nfs.port 2049
nfs.rpc-auth-unix on
nfs.rpc-auth-null on
nfs.rpc-auth-allow all
nfs.rpc-auth-reject none
nfs.ports-insecure off
nfs.trusted-sync off
nfs.trusted-write off
nfs.volume-access read-write
nfs.export-dir
nfs.disable false
nfs.nlm on
nfs.acl on
nfs.mount-udp off
nfs.mount-rmtab /var/lib/glusterd/nfs/rmtab
nfs.rpc-statd /sbin/rpc.statd
nfs.server-aux-gids off
nfs.drc off
nfs.drc-size 0x20000
nfs.read-size (1 * 1048576ULL)
nfs.write-size (1 * 1048576ULL)
nfs.readdir-size (1 * 1048576ULL)
features.read-only off
features.worm off
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid off
storage.bd-aio off
cluster.server-quorum-type off
cluster.server-quorum-ratio 0
changelog.changelog off
changelog.changelog-dir (null)
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
features.barrier disable
features.barrier-timeout 120
locks.trace disable
cluster.disperse-self-heal-daemon enable
[root at dhcp37-178 ~]#
Output of gluster volume info :
================================
[root at dhcp37-178 ~]# gluster v info
Volume Name: testvol
Type: Disperse
Volume ID: ad1a31fb-2e69-4d5d-9ae0-d057879b8fd5
Status: Started
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1:
dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick1/b1
Brick2:
dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick2/b1
Brick3:
dhcp37-178:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick3/b1
Brick4:
dhcp37-183:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick4/b1
Brick5:
dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick5/b2
Brick6:
dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick6/b2
Options Reconfigured:
features.uss: off
features.quota: on
[root at dhcp37-178 ~]#
Output of gluster volume status :
=================================
[root at dhcp37-178 ~]# gluster v status
Status of volume: testvol
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick1/b1 49156 Y 3225
Brick dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick2/b1 49167 Y 3238
Brick dhcp37-178:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick3/b1 49166 Y 3192
Brick dhcp37-183:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick4/b1 49166 Y 3173
Brick dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick5/b2 49157 Y 3236
Brick dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048c
f9f906f45a4869238/brick6/b2 49168 Y 3249
NFS Server on localhost 2049 Y 3206
Quota Daemon on localhost N/A Y 3221
NFS Server on dhcp37-208 2049 Y 3262
Quota Daemon on dhcp37-208 N/A Y 3276
NFS Server on dhcp37-183 2049 Y 3186
Quota Daemon on dhcp37-183 N/A Y 3199
NFS Server on 10.70.37.120 2049 Y 3250
Quota Daemon on 10.70.37.120 N/A Y 3263
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks
[root at dhcp37-178 ~]#
Actual results:
================
Gluster client crashed
Expected results:
================
It should not be crashed
Additional info:
================
Attaching the client mount log.
--- Additional comment from Bhaskarakiran on 2015-02-24 06:33:12 EST ---
--- Additional comment from Bhaskarakiran on 2015-02-24 06:34:39 EST ---
Log snippet:
============
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(FTRUNCATE)
frame : type(0) op(0)
frame : type(1) op(UNLINK)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(FLUSH)
frame : type(1) op(STAT)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2015-02-24 11:41:47
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7dev
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x306ae20aa6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x306ae3bdcf]
/lib64/libc.so.6[0x342d4326a0]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/distribute.so(dht_writev_cbk+0x268)[0x7f300993cbf8]
/usr/lib64/libglusterfs.so.0(default_writev_cbk+0xcc)[0x306ae2e5ec]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_manager_writev+0x10d)[0x7f3009b8647d]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(__ec_manager+0x34)[0x7f3009b6a654]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_resume+0x91)[0x7f3009b6a461]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_combine+0x196)[0x7f3009b88fa6]
/usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_writev_cbk+0x27b)[0x7f3009b844bb]
/usr/lib64/glusterfs/3.7dev/xlator/protocol/client.so(client3_3_writev_cbk+0x6cc)[0x7f3009de301c]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x306aa0ea65]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x142)[0x306aa0ff02]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x306aa0b5f8]
/usr/lib64/glusterfs/3.7dev/rpc-transport/socket.so(+0x9759)[0x7f30103fc759]
/usr/lib64/glusterfs/3.7dev/rpc-transport/socket.so(+0xb1bd)[0x7f30103fe1bd]
/usr/lib64/libglusterfs.so.0[0x306ae78ffc]
/lib64/libpthread.so.0[0x342d8079d1]
/lib64/libc.so.6(clone+0x6d)[0x342d4e89dd]
---------
--- Additional comment from Ashish Pandey on 2015-03-03 23:59:25 EST ---
dht_fsync_cbk() function is being called with op_ret = -1, op_errno = 2
(ENOENT) and postbuf and prebuff is NULL.
Inside the function dht_fsync_cbk, skipping the error handling of op_errno =
ENOENT
( if (op_ret == -1 && !dht_inode_missing(op_errno)) )
which causes control to go to -
if (IS_DHT_MIGRATION_PHASE1 (postbuf))
Macro IS_DHT_MIGRATION_PHASE1 trying to access the attributes of file using
postbuf pointer which is NULL. This leads to crash.
Bug id 960843 made some changes to not to include op_errno = ENOENT in error
handling.
Need to investigate the reason to skip op_errno = ENOENT case and also modify
marco definitions to handle NULL pointers properly.
--- Additional comment from Pranith Kumar K on 2015-03-09 02:44:05 EDT ---
Ashish,
I just realized, on an active fd, fsync should never give ESTALE/ENOENT as
the fd is already opened on the file. Why is EC returning this error? This
could be ec bug after all?
Pranith
--- Additional comment from Anand Avati on 2015-04-09 08:21:47 EDT ---
REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#1) for review on master by Ashish Pandey
(aspandey at redhat.com)
--- Additional comment from Anand Avati on 2015-04-13 07:19:29 EDT ---
REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#2) for review on master by Ashish Pandey
(aspandey at redhat.com)
--- Additional comment from Anand Avati on 2015-04-13 07:19:32 EDT ---
REVIEW: http://review.gluster.org/10218 (Comments implemeted) posted (#1) for
review on master by Ashish Pandey (aspandey at redhat.com)
--- Additional comment from Anand Avati on 2015-04-14 05:45:23 EDT ---
REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#3) for review on master by Ashish Pandey
(aspandey at redhat.com)
--- Additional comment from Anand Avati on 2015-04-28 02:06:23 EDT ---
REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#4) for review on master by Ashish Pandey
(aspandey at redhat.com)
--- Additional comment from Anand Avati on 2015-05-01 11:04:55 EDT ---
REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#5) for review on master by Ashish Pandey
(aspandey at redhat.com)
--- Additional comment from Anand Avati on 2015-05-03 07:46:40 EDT ---
REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#6) for review on master by Ashish Pandey
(aspandey at redhat.com)
--- Additional comment from Anand Avati on 2015-05-04 07:37:02 EDT ---
REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for
get_size_version) posted (#7) for review on master by Ashish Pandey
(aspandey at redhat.com)
--- Additional comment from Anand Avati on 2015-05-04 22:43:51 EDT ---
COMMIT: http://review.gluster.org/10176 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com)
------
commit 582b252e3a418ee332cf3d4b1a415520e242b599
Author: Ashish Pandey <aspandey at redhat.com>
Date: Thu Apr 9 17:27:46 2015 +0530
cluster/ec: Use fd instead of loc for get_size_version
Change-Id: Ia7d43cb3b222db34ecb0e35424f1766715ed8e6a
BUG: 1188242
Signed-off-by: Ashish Pandey <aspandey at redhat.com>
Reviewed-on: http://review.gluster.org/10176
Reviewed-by: Xavier Hernandez <xhernandez at datalab.es>
Tested-by: Gluster Build System <jenkins at build.gluster.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1186580
[Bug 1186580] QE tracker bug for Everglades
https://bugzilla.redhat.com/show_bug.cgi?id=1188242
[Bug 1188242] Disperse volume: client crashed while running iozone
https://bugzilla.redhat.com/show_bug.cgi?id=1192378
[Bug 1192378] Disperse volume: client crashed while running renames with
epoll enabled
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list