<html><head></head><body><div>Hmm. I have a 3.12.9 volume (several) with 3.12.9 clients that are dropping the mount yet parallel.readdir is off. This is only happening on the RDMA interface. The TCP transport mounts are fine.</div><div><br></div><div>Option Value </div><div>------ ----- </div><div>cluster.lookup-unhashed on </div><div>cluster.lookup-optimize off </div><div>cluster.min-free-disk 10% </div><div>cluster.min-free-inodes 5% </div><div>cluster.rebalance-stats off </div><div>cluster.subvols-per-directory (null) </div><div>cluster.readdir-optimize off </div><div>cluster.rsync-hash-regex (null) </div><div>cluster.extra-hash-regex (null) </div><div>cluster.dht-xattr-name trusted.glusterfs.dht </div><div>cluster.randomize-hash-range-by-gfid off </div><div>cluster.rebal-throttle normal </div><div>cluster.lock-migration off </div><div>cluster.local-volume-name (null) </div><div>cluster.weighted-rebalance on </div><div>cluster.switch-pattern (null) </div><div>cluster.entry-change-log on </div><div>cluster.read-subvolume (null) </div><div>cluster.read-subvolume-index -1 </div><div>cluster.read-hash-mode 1 </div><div>cluster.background-self-heal-count 8 </div><div>cluster.metadata-self-heal on </div><div>cluster.data-self-heal on </div><div>cluster.entry-self-heal on </div><div>cluster.self-heal-daemon enable </div><div>cluster.heal-timeout 600 </div><div>cluster.self-heal-window-size 1 </div><div>cluster.data-change-log on </div><div>cluster.metadata-change-log on </div><div>cluster.data-self-heal-algorithm (null) </div><div>cluster.eager-lock on </div><div>disperse.eager-lock on </div><div>cluster.quorum-type none </div><div>cluster.quorum-count (null) </div><div>cluster.choose-local true </div><div>cluster.self-heal-readdir-size 1KB </div><div>cluster.post-op-delay-secs 1 </div><div>cluster.ensure-durability on </div><div>cluster.consistent-metadata no </div><div>cluster.heal-wait-queue-length 128 </div><div>cluster.favorite-child-policy none </div><div>cluster.stripe-block-size 128KB </div><div>cluster.stripe-coalesce true </div><div>diagnostics.latency-measurement off </div><div>diagnostics.dump-fd-stats off </div><div>diagnostics.count-fop-hits off </div><div>diagnostics.brick-log-level INFO </div><div>diagnostics.client-log-level INFO </div><div>diagnostics.brick-sys-log-level CRITICAL </div><div>diagnostics.client-sys-log-level CRITICAL </div><div>diagnostics.brick-logger (null) </div><div>diagnostics.client-logger (null) </div><div>diagnostics.brick-log-format (null) </div><div>diagnostics.client-log-format (null) </div><div>diagnostics.brick-log-buf-size 5 </div><div>diagnostics.client-log-buf-size 5 </div><div>diagnostics.brick-log-flush-timeout 120 </div><div>diagnostics.client-log-flush-timeout 120 </div><div>diagnostics.stats-dump-interval 0 </div><div>diagnostics.fop-sample-interval 0 </div><div>diagnostics.stats-dump-format json </div><div>diagnostics.fop-sample-buf-size 65535 </div><div>diagnostics.stats-dnscache-ttl-sec 86400 </div><div>performance.cache-max-file-size 0 </div><div>performance.cache-min-file-size 0 </div><div>performance.cache-refresh-timeout 1 </div><div>performance.cache-priority </div><div>performance.cache-size 32MB </div><div>performance.io-thread-count 16 </div><div>performance.high-prio-threads 16 </div><div>performance.normal-prio-threads 16 </div><div>performance.low-prio-threads 16 </div><div>performance.least-prio-threads 1 </div><div>performance.enable-least-priority on </div><div>performance.cache-size 128MB </div><div>performance.flush-behind on </div><div>performance.nfs.flush-behind on </div><div>performance.write-behind-window-size 1MB </div><div>performance.resync-failed-syncs-after-fsyncoff </div><div>performance.nfs.write-behind-window-size1MB </div><div>performance.strict-o-direct off </div><div>performance.nfs.strict-o-direct off </div><div>performance.strict-write-ordering off </div><div>performance.nfs.strict-write-ordering off </div><div>performance.lazy-open yes </div><div>performance.read-after-open no </div><div>performance.read-ahead-page-count 4 </div><div>performance.md-cache-timeout 1 </div><div>performance.cache-swift-metadata true </div><div>performance.cache-samba-metadata false </div><div>performance.cache-capability-xattrs true </div><div>performance.cache-ima-xattrs true </div><div>features.encryption off </div><div>encryption.master-key (null) </div><div>encryption.data-key-size 256 </div><div>encryption.block-size 4096 </div><div>network.frame-timeout 1800 </div><div>network.ping-timeout 42 </div><div>network.tcp-window-size (null) </div><div>features.lock-heal off </div><div>features.grace-timeout 10 </div><div>network.remote-dio disable </div><div>client.event-threads 2 </div><div>client.tcp-user-timeout 0 </div><div>client.keepalive-time 20 </div><div>client.keepalive-interval 2 </div><div>client.keepalive-count 9 </div><div>network.tcp-window-size (null) </div><div>network.inode-lru-limit 16384 </div><div>auth.allow * </div><div>auth.reject (null) </div><div>transport.keepalive 1 </div><div>server.allow-insecure (null) </div><div>server.root-squash off </div><div>server.anonuid 65534 </div><div>server.anongid 65534 </div><div>server.statedump-path /var/run/gluster </div><div>server.outstanding-rpc-limit 64 </div><div>features.lock-heal off </div><div>features.grace-timeout 10 </div><div>server.ssl (null) </div><div>auth.ssl-allow * </div><div>server.manage-gids off </div><div>server.dynamic-auth on </div><div>client.send-gids on </div><div>server.gid-timeout 300 </div><div>server.own-thread (null) </div><div>server.event-threads 1 </div><div>server.tcp-user-timeout 0 </div><div>server.keepalive-time 20 </div><div>server.keepalive-interval 2 </div><div>server.keepalive-count 9 </div><div>transport.listen-backlog 10 </div><div>ssl.own-cert (null) </div><div>ssl.private-key (null) </div><div>ssl.ca-list (null) </div><div>ssl.crl-path (null) </div><div>ssl.certificate-depth (null) </div><div>ssl.cipher-list (null) </div><div>ssl.dh-param (null) </div><div>ssl.ec-curve (null) </div><div>performance.write-behind on </div><div>performance.read-ahead on </div><div>performance.readdir-ahead on </div><div>performance.io-cache on </div><div>performance.quick-read on </div><div>performance.open-behind on </div><div>performance.nl-cache off </div><div>performance.stat-prefetch on </div><div>performance.client-io-threads on </div><div>performance.nfs.write-behind on </div><div>performance.nfs.read-ahead off </div><div>performance.nfs.io-cache off </div><div>performance.nfs.quick-read off </div><div>performance.nfs.stat-prefetch off </div><div>performance.nfs.io-threads off </div><div>performance.force-readdirp true </div><div>performance.cache-invalidation false </div><div>features.uss off </div><div>features.snapshot-directory .snaps </div><div>features.show-snapshot-directory off </div><div>network.compression off </div><div>network.compression.window-size -15 </div><div>network.compression.mem-level 8 </div><div>network.compression.min-size 0 </div><div>network.compression.compression-level -1 </div><div>network.compression.debug false </div><div>features.limit-usage (null) </div><div>features.default-soft-limit 80% </div><div>features.soft-timeout 60 </div><div>features.hard-timeout 5 </div><div>features.alert-time 86400 </div><div>features.quota-deem-statfs off </div><div>geo-replication.indexing off </div><div>geo-replication.indexing off </div><div>geo-replication.ignore-pid-check off </div><div>geo-replication.ignore-pid-check off </div><div>features.quota off </div><div>features.inode-quota off </div><div>features.bitrot disable </div><div>debug.trace off </div><div>debug.log-history no </div><div>debug.log-file no </div><div>debug.exclude-ops (null) </div><div>debug.include-ops (null) </div><div>debug.error-gen off </div><div>debug.error-failure (null) </div><div>debug.error-number (null) </div><div>debug.random-failure off </div><div>debug.error-fops (null) </div><div>nfs.disable off </div><div>features.read-only off </div><div>features.worm off </div><div>features.worm-file-level off </div><div>features.default-retention-period 120 </div><div>features.retention-mode relax </div><div>features.auto-commit-period 180 </div><div>storage.linux-aio off </div><div>storage.batch-fsync-mode reverse-fsync </div><div>storage.batch-fsync-delay-usec 0 </div><div>storage.owner-uid -1 </div><div>storage.owner-gid -1 </div><div>storage.node-uuid-pathinfo off </div><div>storage.health-check-interval 30 </div><div>storage.build-pgfid off </div><div>storage.gfid2path on </div><div>storage.gfid2path-separator : </div><div>storage.bd-aio off </div><div>cluster.server-quorum-type off </div><div>cluster.server-quorum-ratio 0 </div><div>changelog.changelog off </div><div>changelog.changelog-dir (null) </div><div>changelog.encoding ascii </div><div>changelog.rollover-time 15 </div><div>changelog.fsync-interval 5 </div><div>changelog.changelog-barrier-timeout 120 </div><div>changelog.capture-del-path off </div><div>features.barrier disable </div><div>features.barrier-timeout 120 </div><div>features.trash off </div><div>features.trash-dir .trashcan </div><div>features.trash-eliminate-path (null) </div><div>features.trash-max-filesize 5MB </div><div>features.trash-internal-op off </div><div>cluster.enable-shared-storage disable </div><div>cluster.write-freq-threshold 0 </div><div>cluster.read-freq-threshold 0 </div><div>cluster.tier-pause off </div><div>cluster.tier-promote-frequency 120 </div><div>cluster.tier-demote-frequency 3600 </div><div>cluster.watermark-hi 90 </div><div>cluster.watermark-low 75 </div><div>cluster.tier-mode cache </div><div>cluster.tier-max-promote-file-size 0 </div><div>cluster.tier-max-mb 4000 </div><div>cluster.tier-max-files 10000 </div><div>cluster.tier-query-limit 100 </div><div>cluster.tier-compact on </div><div>cluster.tier-hot-compact-frequency 604800 </div><div>cluster.tier-cold-compact-frequency 604800 </div><div>features.ctr-enabled off </div><div>features.record-counters off </div><div>features.ctr-record-metadata-heat off </div><div>features.ctr_link_consistency off </div><div>features.ctr_lookupheal_link_timeout 300 </div><div>features.ctr_lookupheal_inode_timeout 300 </div><div>features.ctr-sql-db-cachesize 12500 </div><div>features.ctr-sql-db-wal-autocheckpoint 25000 </div><div>features.selinux on </div><div>locks.trace off </div><div>locks.mandatory-locking off </div><div>cluster.disperse-self-heal-daemon enable </div><div>cluster.quorum-reads no </div><div>client.bind-insecure (null) </div><div>features.shard off </div><div>features.shard-block-size 64MB </div><div>features.scrub-throttle lazy </div><div>features.scrub-freq biweekly </div><div>features.scrub false </div><div>features.expiry-time 120 </div><div>features.cache-invalidation off </div><div>features.cache-invalidation-timeout 60 </div><div>features.leases off </div><div>features.lease-lock-recall-timeout 60 </div><div>disperse.background-heals 8 </div><div>disperse.heal-wait-qlength 128 </div><div>cluster.heal-timeout 600 </div><div>dht.force-readdirp on </div><div>disperse.read-policy gfid-hash </div><div>cluster.shd-max-threads 1 </div><div>cluster.shd-wait-qlength 1024 </div><div>cluster.locking-scheme full </div><div>cluster.granular-entry-heal no </div><div>features.locks-revocation-secs 0 </div><div>features.locks-revocation-clear-all false </div><div>features.locks-revocation-max-blocked 0 </div><div>features.locks-monkey-unlocking false </div><div>disperse.shd-max-threads 1 </div><div>disperse.shd-wait-qlength 1024 </div><div>disperse.cpu-extensions auto </div><div>disperse.self-heal-window-size 1 </div><div>cluster.use-compound-fops off </div><div>performance.parallel-readdir off </div><div>performance.rda-request-size 131072 </div><div>performance.rda-low-wmark 4096 </div><div>performance.rda-high-wmark 128KB </div><div>performance.rda-cache-limit 10MB </div><div>performance.nl-cache-positive-entry false </div><div>performance.nl-cache-limit 10MB </div><div>performance.nl-cache-timeout 60 </div><div>cluster.brick-multiplex off </div><div>cluster.max-bricks-per-process 0 </div><div>disperse.optimistic-change-log on </div><div>cluster.halo-enabled False </div><div>cluster.halo-shd-max-latency 99999 </div><div>cluster.halo-nfsd-max-latency 5 </div><div>cluster.halo-max-latency 5 </div><div>cluster.halo-max-replicas 99999 </div><div>cluster.halo-min-replicas 2 </div><div><br></div><div>On Thu, 2018-06-14 at 12:12 +0100, mohammad kashif wrote:</div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div><div><div><div>Hi Nithya<br><br></div>It seems that problem can be solved by either turning parallel-readir off or downgrading client to 3.10.12-1 . Yesterday I downgraded some clients to 3.10.12-1 and it seems to fixed the problem. Today when I saw your email then I disabled parallel-readir off and the current client 3.12.9-1 started to work. I upgraded server and clients to 3.12.9-1 last month and since then clients were intermittently unmounting once in a week. But during last three days, it started unmounting every few minutes. I don't know that what triggered this sudden panic except that file system was quite full; around 98%. It is 480 TB file system. The file system has almost 80 Million files.<br><br></div>Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with 192GB RAM client and it still had the same issue. <br><br><br>Volume Name: atlasglust<br>Type: Distribute<br>Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b<br>Status: Started<br>Snapshot Count: 0<br>Number of Bricks: 7<br>Transport-type: tcp<br>Bricks:<br>Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0<br>Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0<br>Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0<br>Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0<br>Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0<br>Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0<br>Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0<br>Options Reconfigured:<br>diagnostics.client-log-level: ERROR<br>diagnostics.brick-log-level: ERROR<br>performance.cache-invalidation: on<br>server.event-threads: 4<br>client.event-threads: 4<br>cluster.lookup-optimize: on<br>performance.client-io-threads: on<br>performance.cache-size: 1GB<br>performance.parallel-readdir: off<br>performance.md-cache-timeout: 600<br>performance.stat-prefetch: on<br>features.cache-invalidation-timeout: 600<br>features.cache-invalidation: on<br>auth.allow: X.Y.Z.*<br>transport.address-family: inet<br>performance.readdir-ahead: on<br>nfs.disable: on<br><br><br></div>Thanks<br><br></div>Kashif<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jun 14, 2018 at 5:39 AM, Nithya Balachandran <span dir="ltr"><<a href="mailto:nbalacha@redhat.com" target="_blank">nbalacha@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr">+Poornima who works on parallel-readdir. <div><br></div><div>@Poornima, Have you seen anything like this before?</div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On 14 June 2018 at 10:07, Nithya Balachandran <span dir="ltr"><<a href="mailto:nbalacha@redhat.com" target="_blank">nbalacha@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr">This is not the same issue as the one you are referring - that was in the RPC layer and caused the bricks to crash. This one is different as it seems to be in the dht and rda layers. It does look like a stack overflow though.<div><br></div><div>@Mohammad,</div><div><br></div><div>Please send the following information:</div><div><br></div><div>1. gluster volume info </div><div>2. The number of entries in the directory being listed</div><div>3. System memory</div><div><br></div><div>Does this still happen if you turn off parallel-readdir?</div><div><br></div><div>Regards,</div><div>Nithya</div><div><div class="m_-7144954943680245677h5"><div><br></div><div><br></div><div><br><div class="gmail_extra"><br><div class="gmail_quote">On 13 June 2018 at 16:40, Milind Changire <span dir="ltr"><<a href="mailto:mchangir@redhat.com" target="_blank">mchangir@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>+Nithya</div><div><br></div><div>Nithya,</div><div>Do these logs [1] look similar to the recursive readdir() issue that you encountered just a while back ?</div><div>i.e. recursive readdir() response definition in the XDR<br></div><div><br></div><div>[1] <a href="http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log" target="_blank">http://www-pnp.physics.ox.ac.u<wbr>k/~mohammad/backtrace.log</a><br></div><div><br></div><div class="gmail_extra"><div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409h5"><br><div class="gmail_quote">On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <span dir="ltr"><<a href="mailto:kashif.alig@gmail.com" target="_blank">kashif.alig@gmail.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Hi Milind</div><div><br></div><div>Thanks a lot, I manage to run gdb and produced traceback as well. Its here</div><div><br></div><div><a href="http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log" target="_blank">http://www-pnp.physics.ox.ac.u<wbr>k/~mohammad/backtrace.log</a> <br></div><div><br></div><div><br></div><div>I am trying to understand but still not able to make sense out of it.<br></div><div><br></div><div>Thanks</div><span class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402HOEnZb"><font color="#888888"><div><br></div><div>Kashif<br></div></font></span></div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402HOEnZb"><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <span dir="ltr"><<a href="mailto:mchangir@redhat.com" target="_blank">mchangir@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Kashif,</div><div>FYI: <a href="http://debuginfo.centos.org/centos/6/storage/x86_64/" target="_blank">http://debuginfo.centos.org/ce<wbr>ntos/6/storage/x86_64/</a></div><div><br></div></div><div class="gmail_extra"><div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404h5"><br><div class="gmail_quote">On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <span dir="ltr"><<a href="mailto:kashif.alig@gmail.com" target="_blank">kashif.alig@gmail.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div><div>Hi Milind <br></div><div><br></div><div>There is no
glusterfs-debuginfo available for gluster-3.12 from
<a href="http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/" target="_blank">http://mirror.centos.org/cento<wbr>s/6/storage/x86_64/gluster-3.1<wbr>2/</a> repo. Do
you know from where I can get it? <br></div><div>Also when I run gdb, it says <br></div><div><br></div><div>Missing separate debuginfos, use: debuginfo-install glusterfs-fuse-3.12.9-1.el6.x8<wbr>6_64 <br></div><div><br></div><div>I can't find debug package for glusterfs-fuse either</div></div><div><br></div><div>Thanks from the pit of despair ;)</div><span class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878HOEnZb"><font color="#888888"><div><br></div><div>Kashif<br></div><div><br></div></font></span></div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878HOEnZb"><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <span dir="ltr"><<a href="mailto:kashif.alig@gmail.com" target="_blank">kashif.alig@gmail.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Hi Milind</div><div><br></div><div>I will send you links for logs.</div><div><br></div><div>I collected these core dumps at client and there is no glusterd process running on client.</div><span class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411HOEnZb"><font color="#888888"><div><br></div><div>Kashif<br></div><div><br></div><div><br></div></font></span></div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411HOEnZb"><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <span dir="ltr"><<a href="mailto:mchangir@redhat.com" target="_blank">mchangir@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Kashif,</div><div>Could you also send over the client/mount log file as Vijay suggested ?</div><div>Or maybe the lines with the crash backtrace lines<br></div><div><br></div><div>Also, you've mentioned that you straced glusterd, but when you ran gdb, you ran it over /usr/sbin/glusterfs</div><div><br></div></div><div class="gmail_extra"><div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572h5"><br><div class="gmail_quote">On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <span dir="ltr"><<a href="mailto:vbellur@redhat.com" target="_blank">vbellur@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span>On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <span dir="ltr"><<a href="mailto:kashif.alig@gmail.com" target="_blank">kashif.alig@gmail.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Hi Milind<br></div><div><br></div><div>The operating system is Scientific Linux 6 which is based on RHEL6. The cpu arch is Intel x86_64.</div><div><br></div><div>I will send you a separate email with link to core dump.</div></div><br></blockquote><div><br></div><div><br></div></span><div>You could also grep for crash in the client log file and the lines following crash would have a backtrace in most cases.</div><div><br></div><div>HTH,</div><div>Vijay</div><div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568h5"><div> </div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><span><div><br></div><div>Thanks for your help.</div><div><br></div><div>Kashif<br></div><div><br></div></span></div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923HOEnZb"><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <span dir="ltr"><<a href="mailto:mchangir@redhat.com" target="_blank">mchangir@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Kashif,</div><div>Could you share the core dump via Google Drive or something similar</div><div><br></div><div>Also, let me know the CPU arch and OS Distribution on which you are running gluster.</div><div><br></div><div>If you've installed the glusterfs-debuginfo package, you'll also get the source lines in the backtrace via gdb</div><div><br></div><div><br></div></div><div class="gmail_extra"><div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073h5"><br><div class="gmail_quote">On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <span dir="ltr"><<a href="mailto:kashif.alig@gmail.com" target="_blank">kashif.alig@gmail.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Hi Milind, Vijay <br></div><div><br></div><div>Thanks, I have some more information now as I straced glusterd on client</div><div><br></div><div>138544 0.000131 mprotect(0x7f2f70785000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000026><br>138544 0.000128 mprotect(0x7f2f70786000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000027><br>138544 0.000126 mprotect(0x7f2f70787000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000027><br>138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---<br>138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---<br>138551 0.105048 +++ killed by SIGSEGV (core dumped) +++<br>138550 0.000041 +++ killed by SIGSEGV (core dumped) +++<br>138547 0.000008 +++ killed by SIGSEGV (core dumped) +++<br>138546 0.000007 +++ killed by SIGSEGV (core dumped) +++<br>138545 0.000007 +++ killed by SIGSEGV (core dumped) +++<br>138544 0.000008 +++ killed by SIGSEGV (core dumped) +++<br>138543 0.000007 +++ killed by SIGSEGV (core dumped) +++</div><div><br></div><div>As for I understand that somehow gluster is trying to access memory in appropriate manner and kernel sends SIGSEGV <br></div><div><br></div><div>I also got the core dump. I am trying gdb first time so I am not sure whether I am using it correctly <br></div><div><br></div><div>gdb /usr/sbin/glusterfs core.138536</div><div><br></div><div>It just tell me that program terminated with signal 11, segmentation fault .</div><div><br></div><div>The problem is not limited to one client but happening to many clients. <br></div><div><br></div><div>I will really appreciate any help as whole file system has become unusable <br></div><div><br></div><div>Thanks</div><span class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629HOEnZb"><font color="#888888"><div><br></div><div>Kashif<br></div><div><br></div><div><br></div><div><br></div></font></span></div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629HOEnZb"><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <span dir="ltr"><<a href="mailto:mchangir@redhat.com" target="_blank">mchangir@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Kashif,</div><div>You can change the log level by:</div><div>$ gluster volume set <vol> diagnostics.brick-log-level TRACE</div><div>$ gluster volume set <vol> diagnostics.client-log-level TRACE</div><div><br></div><div>and see how things fare</div><div><br></div><div>If you want fewer logs you can change the log-level to DEBUG instead of TRACE.</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629m_7147468144477528676h5">On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif <span dir="ltr"><<a href="mailto:kashif.alig@gmail.com" target="_blank">kashif.alig@gmail.com</a>></span> wrote:<br></div></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629m_7147468144477528676h5"><div dir="ltr"><div>Hi Vijay</div><div><br></div><div>Now it is unmounting every 30 mins ! <br></div><div><br></div><div>The server log at /var/log/glusterfs/bricks/glus<wbr>teratlas-brics001-gv0.log have this line only</div><div><br></div><div>2018-06-12 09:53:19.303102] I [MSGID: 115013] [server-helpers.c:289:do_fd_cl<wbr>eanup] 0-atlasglust-server: fd cleanup on /atlas/atlasdata/zgubic/hmumu/<wbr>histograms/v14.3/Signal<br>[2018-06-12 09:53:19.306190] I [MSGID: 101055] [client_t.c:443:gf_client_unre<wbr>f] 0-atlasglust-server: Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4<wbr>60889-atlasglust-client-0-0-0</div><div><br></div><div>There is no other information. Is there any way to increase log verbosity?</div><div><br></div><div>on the client <br></div><div><br></div><div>2018-06-12 09:51:01.744980] I [MSGID: 114057] [client-handshake.c:1478:selec<wbr>t_server_supported_programs] 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)<br>[2018-06-12 09:51:01.746508] I [MSGID: 114046] [client-handshake.c:1231:clien<wbr>t_setvolume_cbk] 0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote volume '/glusteratlas/brick006/gv0'.<br>[2018-06-12 09:51:01.746543] I [MSGID: 114047] [client-handshake.c:1242:clien<wbr>t_setvolume_cbk] 0-atlasglust-client-5: Server and Client lk-version numbers are not same, reopening the fds<br>[2018-06-12 09:51:01.746814] I [MSGID: 114035] [client-handshake.c:202:client<wbr>_set_lk_version_cbk] 0-atlasglust-client-5: Server lk version = 1<br>[2018-06-12 09:51:01.748449] I [MSGID: 114057] [client-handshake.c:1478:selec<wbr>t_server_supported_programs] 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)<br>[2018-06-12 09:51:01.750219] I [MSGID: 114046] [client-handshake.c:1231:clien<wbr>t_setvolume_cbk] 0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote volume '/glusteratlas/brick007/gv0'.<br>[2018-06-12 09:51:01.750261] I [MSGID: 114047] [client-handshake.c:1242:clien<wbr>t_setvolume_cbk] 0-atlasglust-client-6: Server and Client lk-version numbers are not same, reopening the fds<br>[2018-06-12 09:51:01.750503] I [MSGID: 114035] [client-handshake.c:202:client<wbr>_set_lk_version_cbk] 0-atlasglust-client-6: Server lk version = 1<br>[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.14<br>[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph<wbr>_sync] 0-fuse: switched to graph 0<br></div><div><br></div><div><br></div><div>is there a problem with server and client 1k version?</div><div><br></div><div>Thanks for your help.</div><span class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629m_7147468144477528676m_605157100837625209HOEnZb"><font color="#888888"><div><br></div><div>Kashif<br></div><div><br></div><div><br></div><div><br></div><div> </div></font></span></div><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629m_7147468144477528676m_605157100837625209HOEnZb"><div class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629m_7147468144477528676m_605157100837625209h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <span dir="ltr"><<a href="mailto:vbellur@redhat.com" target="_blank">vbellur@redhat.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span>On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <span dir="ltr"><<a href="mailto:kashif.alig@gmail.com" target="_blank">kashif.alig@gmail.com</a>></span> wrote:<br><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div dir="ltr"><div>Hi</div><div><br></div><div>Since I have updated our gluster server and client to latest version 3.12.9-1, I am having this issue of gluster getting unmounted from client very regularly. It was not a problem before update.<br></div><div><br></div><div>Its a distributed file system with no replication. We have seven servers totaling around 480TB data. Its 97% full. <br></div><div><br></div><div>I am using following config on server</div><div><br></div><div><br></div>gluster volume set atlasglust features.cache-invalidation on<br>gluster volume set atlasglust features.cache-invalidation-ti<wbr>meout 600<br>gluster volume set atlasglust performance.stat-prefetch on<br>gluster volume set atlasglust performance.cache-invalidation on<br>gluster volume set atlasglust performance.md-cache-timeout 600<br>gluster volume set atlasglust performance.parallel-readdir on<br>gluster volume set atlasglust performance.cache-size 1GB<br>gluster volume set atlasglust performance.client-io-threads on<br>gluster volume set atlasglust cluster.lookup-optimize on<br>gluster volume set atlasglust performance.stat-prefetch on<br>gluster volume set atlasglust client.event-threads 4<br>gluster volume set atlasglust server.event-threads 4<br><div><br></div><div>clients are mounted with this option</div><div><br></div><div>defaults,direct-io-mode=disabl<wbr>e,attribute-timeout=600,entry-<wbr>timeout=600,negative-timeout=6<wbr>00,fopen-keep-cache,rw,_netdev <br></div><div><br></div><div>I can't see anything in the log file. Can someone suggest that how to troubleshoot this issue?</div><div><br></div><div><br></div></div><br></blockquote><div><br></div><div><br></div></span><div>Can you please share the log file? Checking for messages related to disconnections/crashes in the log file would be a good way to start troubleshooting the problem.</div><div><br></div><div>Thanks,</div><div>Vijay </div></div></div></div>
</blockquote></div><br></div>
</div></div><br></div></div>______________________________<wbr>_________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/mailm<wbr>an/listinfo/gluster-users</a><span class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629m_7147468144477528676HOEnZb"><font color="#888888"><br></font></span></blockquote></div><span class="m_-7144954943680245677m_-8863048801970484267m_-2956020922779137409m_-5775231697309111402m_-5140365810814053404m_5014525246591411878m_2285939528783175411m_4375213844299906572m_4931964523437139568m_4152801743731486923m_-4451941051880752073m_3059762644582098629m_7147468144477528676HOEnZb"><font color="#888888"><br><br clear="all"><br><pre>_______________________________________________</pre><pre>Gluster-users mailing list</pre><pre><a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a></pre><pre><a href="http://lists.gluster.org/mailman/listinfo/gluster-users">http://lists.gluster.org/mailman/listinfo/gluster-users</a></pre></font></span></div></blockquote></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></div></div></div></blockquote></div></div></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></blockquote><div><span><pre><pre>-- <br></pre>James P. Kinney III
Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain
http://heretothereideas.blogspot.com/
</pre></span></div></body></html>