[Gluster-users] Client un-mounting since upgrade to 3.12.9-1 version

Thu Jun 14 14:30:51 UTC 2018

Hmm. I have a 3.12.9 volume (several) with 3.12.9 clients that are
dropping the mount yet parallel.readdir is off. This is only happening
on the RDMA interface. The TCP transport mounts are fine.
Option                                  Value                          
         ------                                  --
---                                   cluster.lookup-
unhashed                 on                                      cluste
r.lookup-
optimize                 off                                     cluste
r.min-free-
disk                   10%                                     cluster.
min-free-
inodes                 5%                                      cluster.
rebalance-
stats                 off                                     cluster.s
ubvols-per-
directory           (null)                                  cluster.rea
ddir-
optimize                off                                     cluster
.rsync-hash-
regex                (null)                                  cluster.ex
tra-hash-
regex                (null)                                  cluster.dh
t-xattr-
name                  trusted.glusterfs.dht                   cluster.r
andomize-hash-range-by-
gfid    off                                     cluster.rebal-
throttle                  normal                                  clust
er.lock-
migration                  off                                     clus
ter.local-volume-
name               (null)                                  cluster.weig
hted-
rebalance              on                                      cluster.
switch-
pattern                  (null)                                  cluste
r.entry-change-
log                on                                      cluster.read
-subvolume                  (null)                                  clu
ster.read-subvolume-index            -
1                                      cluster.read-hash-
mode                  1                                       cluster.b
ackground-self-heal-
count      8                                       cluster.metadata-
self-
heal              on                                      cluster.data-
self-
heal                  on                                      cluster.e
ntry-self-
heal                 on                                      cluster.se
lf-heal-
daemon                enable                                  cluster.h
eal-
timeout                    600                                     clus
ter.self-heal-window-
size           1                                       cluster.data-
change-
log                 on                                      cluster.met
adata-change-
log             on                                      cluster.data-
self-heal-
algorithm        (null)                                  cluster.eager-
lock                      on                                      dispe
rse.eager-
lock                     on                                      cluste
r.quorum-
type                     none                                    cluste
r.quorum-
count                    (null)                                  cluste
r.choose-
local                    true                                    cluste
r.self-heal-readdir-
size          1KB                                     cluster.post-op-
delay-
secs              1                                       cluster.ensur
e-
durability               on                                      cluste
r.consistent-
metadata             no                                      cluster.he
al-wait-queue-
length          128                                     cluster.favorit
e-child-
policy           none                                    cluster.stripe
-block-
size               128KB                                   cluster.stri
pe-
coalesce                 true                                    diagno
stics.latency-
measurement         off                                     diagnostics
.dump-fd-
stats               off                                     diagnostics
.count-fop-
hits              off                                     diagnostics.b
rick-log-
level             INFO                                    diagnostics.c
lient-log-
level            INFO                                    diagnostics.br
ick-sys-log-
level         CRITICAL                                diagnostics.clien
t-sys-log-
level        CRITICAL                                diagnostics.brick-
logger                (null)                                  diagnosti
cs.client-
logger               (null)                                  diagnostic
s.brick-log-
format            (null)                                  diagnostics.c
lient-log-
format           (null)                                  diagnostics.br
ick-log-buf-
size          5                                       diagnostics.clien
t-log-buf-
size         5                                       diagnostics.brick-
log-flush-
timeout     120                                     diagnostics.client-
log-flush-
timeout    120                                     diagnostics.stats-
dump-
interval         0                                       diagnostics.fo
p-sample-
interval         0                                       diagnostics.st
ats-dump-
format           json                                    diagnostics.fo
p-sample-buf-
size         65535                                   diagnostics.stats-
dnscache-ttl-
sec      86400                                   performance.cache-max-
file-
size         0                                       performance.cache-
min-file-
size         0                                       performance.cache-
refresh-
timeout       1                                       performance.cache
-priority                                                      performa
nce.cache-
size                  32MB                                    performan
ce.io-thread-
count             16                                      performance.h
igh-prio-
threads           16                                      performance.n
ormal-prio-
threads         16                                      performance.low
-prio-
threads            16                                      performance.
least-prio-
threads          1                                       performance.en
able-least-
priority       on                                      performance.cach
e-
size                  128MB                                   performan
ce.flush-
behind                on                                      performan
ce.nfs.flush-
behind            on                                      performance.w
rite-behind-window-
size    1MB                                     performance.resync-
failed-syncs-after-
fsyncoff                                     performance.nfs.write-
behind-window-
size1MB                                     performance.strict-o-
direct             off                                     performance.
nfs.strict-o-
direct         off                                     performance.stri
ct-write-
ordering       off                                     performance.nfs.
strict-write-
ordering   off                                     performance.lazy-
open                   yes                                     performa
nce.read-after-
open             no                                      performance.re
ad-ahead-page-
count       4                                       performance.md-
cache-
timeout            1                                       performance.
cache-swift-
metadata        true                                    performance.cac
he-samba-
metadata        false                                   performance.cac
he-capability-
xattrs     true                                    performance.cache-
ima-
xattrs            true                                    features.encr
yption                     off                                     encr
yption.master-
key                   (null)                                  encryptio
n.data-key-
size                256                                     encryption.
block-
size                   4096                                    network.
frame-
timeout                   1800                                    netwo
rk.ping-
timeout                    42                                      netw
ork.tcp-window-
size                 (null)                                  features.l
ock-
heal                      off                                     featu
res.grace-
timeout                  10                                      networ
k.remote-
dio                      disable                                 client
.event-
threads                    2                                       clie
nt.tcp-user-
timeout                 0                                       client.
keepalive-
time                   20                                      client.k
eepalive-
interval               2                                       client.k
eepalive-
count                  9                                       network.
tcp-window-
size                 (null)                                  network.in
ode-lru-
limit                 16384                                   auth.allo
w                              *                                       
auth.reject                             (null)                         
         transport.keepalive                     1                     
                  server.allow-
insecure                   (null)                                  serv
er.root-
squash                      off                                     ser
ver.anonuid                          65534                             
      server.anongid                          65534                    
               server.statedump-
path                   /var/run/gluster                        server.o
utstanding-rpc-
limit            64                                      features.lock-
heal                      off                                     featu
res.grace-
timeout                  10                                      server
.ssl                              (null)                               
   auth.ssl-
allow                          *                                       
server.manage-
gids                      off                                     serve
r.dynamic-
auth                     on                                      client
.send-
gids                        on                                      ser
ver.gid-
timeout                      300                                     se
rver.own-
thread                       (null)                                  se
rver.event-
threads                    1                                       serv
er.tcp-user-
timeout                 0                                       server.
keepalive-
time                   20                                      server.k
eepalive-
interval               2                                       server.k
eepalive-
count                  9                                       transpor
t.listen-
backlog                10                                      ssl.own-
cert                            (null)                                 
 ssl.private-
key                         (null)                                  ssl
.ca-
list                             (null)                                
  ssl.crl-
path                            (null)                                 
 ssl.certificate-
depth                   (null)                                  ssl.cip
her-
list                         (null)                                  ss
l.dh-
param                            (null)                                
  ssl.ec-
curve                            (null)                                
  performance.write-
behind                on                                      performan
ce.read-
ahead                  on                                      performa
nce.readdir-
ahead               on                                      performance
.io-
cache                    on                                      perfor
mance.quick-
read                  on                                      performan
ce.open-
behind                 on                                      performa
nce.nl-
cache                    off                                     perfor
mance.stat-
prefetch               on                                      performa
nce.client-io-
threads           on                                      performance.n
fs.write-
behind            on                                      performance.n
fs.read-
ahead              off                                     performance.
nfs.io-
cache                off                                     performanc
e.nfs.quick-
read              off                                     performance.n
fs.stat-
prefetch           off                                     performance.
nfs.io-
threads              off                                     performanc
e.force-
readdirp              true                                    performan
ce.cache-
invalidation          false                                   features.
uss                            off                                     
features.snapshot-
directory             .snaps                                  features.
show-snapshot-
directory        off                                     network.compre
ssion                     off                                     netwo
rk.compression.window-size         -
15                                     network.compression.mem-
level           8                                       network.compres
sion.min-
size            0                                       network.compres
sion.compression-level   -
1                                      network.compression.debug       
        false                                   features.limit-
usage                    (null)                                  featur
es.default-soft-
limit             80%                                     features.soft
-timeout                   60                                      feat
ures.hard-
timeout                   5                                       featu
res.alert-
time                     86400                                   featur
es.quota-deem-
statfs              off                                     geo-
replication.indexing                off                                
     geo-
replication.indexing                off                                
     geo-replication.ignore-pid-
check        off                                     geo-
replication.ignore-pid-
check        off                                     features.quota    
                      off                                     features.
inode-
quota                    off                                     featur
es.bitrot                         disable                              
   debug.trace                             off                         
            debug.log-
history                       no                                      d
ebug.log-
file                          no                                      d
ebug.exclude-
ops                       (null)                                  debug
.include-
ops                       (null)                                  debug
.error-
gen                         off                                     deb
ug.error-
failure                     (null)                                  deb
ug.error-
number                      (null)                                  deb
ug.random-
failure                    off                                     debu
g.error-
fops                        (null)                                  nfs
.disable                             off                               
      features.read-
only                      off                                     featu
res.worm                           off                                 
    features.worm-file-
level                off                                     features.d
efault-retention-
period       120                                     features.retention
-mode                 relax                                   features.
auto-commit-
period             180                                     storage.linu
x-
aio                       off                                     stora
ge.batch-fsync-mode                reverse-
fsync                           storage.batch-fsync-delay-
usec          0                                       storage.owner-
uid                       -
1                                      storage.owner-
gid                       -
1                                      storage.node-uuid-
pathinfo              off                                     storage.h
ealth-check-
interval           30                                      storage.buil
d-
pgfid                     off                                     stora
ge.gfid2path                       on                                  
    storage.gfid2path-
separator             :                                       storage.b
d-
aio                          off                                     cl
uster.server-quorum-
type              off                                     cluster.serve
r-quorum-
ratio             0                                       changelog.cha
ngelog                     off                                     chan
gelog.changelog-
dir                 (null)                                  changelog.e
ncoding                      ascii                                   ch
angelog.rollover-
time                 15                                      changelog.
fsync-
interval                5                                       changel
og.changelog-barrier-
timeout     120                                     changelog.capture-
del-
path              off                                     features.barr
ier                        disable                                 feat
ures.barrier-
timeout                120                                     features
.trash                          off                                    
 features.trash-
dir                      .trashcan                               featur
es.trash-eliminate-
path           (null)                                  features.trash-
max-
filesize             5MB                                     features.t
rash-internal-
op              off                                     cluster.enable-
shared-
storage           disable                                 cluster.write
-freq-
threshold            0                                       cluster.re
ad-freq-
threshold             0                                       cluster.t
ier-
pause                      off                                     clus
ter.tier-promote-
frequency          120                                     cluster.tier
-demote-
frequency           3600                                    cluster.wat
ermark-
hi                    90                                      cluster.w
atermark-
low                   75                                      cluster.t
ier-
mode                       cache                                   clus
ter.tier-max-promote-file-
size      0                                       cluster.tier-max-
mb                     4000                                    cluster.
tier-max-
files                  10000                                   cluster.
tier-query-
limit                100                                     cluster.ti
er-
compact                    on                                      clus
ter.tier-hot-compact-
frequency      604800                                  cluster.tier-
cold-compact-
frequency     604800                                  features.ctr-
enabled                    off                                     feat
ures.record-
counters                off                                     feature
s.ctr-record-metadata-
heat       off                                     features.ctr_link_co
nsistency           off                                     features.ct
r_lookupheal_link_timeout    300                                     fe
atures.ctr_lookupheal_inode_timeout   300                              
       features.ctr-sql-db-
cachesize           12500                                   features.ct
r-sql-db-wal-
autocheckpoint  25000                                   features.selinu
x                        on                                      locks.
trace                             off                                  
   locks.mandatory-
locking                 off                                     cluster
.disperse-self-heal-
daemon       enable                                  cluster.quorum-
reads                    no                                      client
.bind-
insecure                    (null)                                  fea
tures.shard                          off                               
      features.shard-block-
size               64MB                                    features.scr
ub-
throttle                 lazy                                    featur
es.scrub-
freq                     biweekly                                featur
es.scrub                          false                                
   features.expiry-
time                    120                                     feature
s.cache-
invalidation             off                                     featur
es.cache-invalidation-
timeout     60                                      features.leases    
                     off                                     features.l
ease-lock-recall-
timeout      60                                      disperse.backgroun
d-
heals               8                                       disperse.he
al-wait-
qlength              128                                     cluster.he
al-
timeout                    600                                     dht.
force-
readdirp                      on                                      d
isperse.read-policy                    gfid-
hash                               cluster.shd-max-
threads                 1                                       cluster
.shd-wait-
qlength                1024                                    cluster.
locking-
scheme                  full                                    cluster
.granular-entry-
heal             no                                      features.locks
-revocation-
secs          0                                       features.locks-
revocation-clear-
all     false                                   features.locks-
revocation-max-
blocked   0                                       features.locks-
monkey-
unlocking         false                                   disperse.shd-
max-
threads                1                                       disperse
.shd-wait-
qlength               1024                                    disperse.
cpu-
extensions                 auto                                    disp
erse.self-heal-window-
size          1                                       cluster.use-
compound-
fops               off                                     performance.
parallel-
readdir            off                                     performance.
rda-request-
size            131072                                  performance.rda
-low-
wmark               4096                                    performance
.rda-high-
wmark              128KB                                   performance.
rda-cache-
limit             10MB                                    performance.n
l-cache-positive-
entry     false                                   performance.nl-cache-
limit              10MB                                    performance.
nl-cache-
timeout            60                                      cluster.bric
k-
multiplex                 off                                     clust
er.max-bricks-per-
process          0                                       disperse.optim
istic-change-
log          on                                      cluster.halo-
enabled                    False                                   clus
ter.halo-shd-max-
latency            99999                                   cluster.halo
-nfsd-max-
latency           5                                       cluster.halo-
max-
latency                5                                       cluster.
halo-max-
replicas               99999                                   cluster.
halo-min-replicas               2    
On Thu, 2018-06-14 at 12:12 +0100, mohammad kashif wrote:
> Hi Nithya
> 
> It seems that problem can be solved by either turning parallel-readir 
> off or downgrading client to 3.10.12-1 . Yesterday I downgraded some
> clients to 3.10.12-1 and it seems to fixed the problem. Today when I
> saw your email then I disabled parallel-readir off and the current
> client 3.12.9-1 started  to work.   I upgraded server and clients to
> 3.12.9-1 last month and since then clients were intermittently
> unmounting once in a week. But during last three days, it started
> unmounting every few minutes. I don't know that what triggered this
> sudden panic except that file system was quite full; around 98%. It
> is 480 TB file system. The file system has almost 80 Million files.
> 
> Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested
> with 192GB RAM client and it still had the same issue.    
> 
> 
> Volume Name: atlasglust
> Type: Distribute
> Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 7
> Transport-type: tcp
> Bricks:
> Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0
> Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0
> Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0
> Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0
> Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0
> Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0
> Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0
> Options Reconfigured:
> diagnostics.client-log-level: ERROR
> diagnostics.brick-log-level: ERROR
> performance.cache-invalidation: on
> server.event-threads: 4
> client.event-threads: 4
> cluster.lookup-optimize: on
> performance.client-io-threads: on
> performance.cache-size: 1GB
> performance.parallel-readdir: off
> performance.md-cache-timeout: 600
> performance.stat-prefetch: on
> features.cache-invalidation-timeout: 600
> features.cache-invalidation: on
> auth.allow: X.Y.Z.*
> transport.address-family: inet
> performance.readdir-ahead: on
> nfs.disable: on
> 
> 
> Thanks
> 
> Kashif
> 
> On Thu, Jun 14, 2018 at 5:39 AM, Nithya Balachandran <nbalacha at redhat
> .com> wrote:
> > +Poornima who works on parallel-readdir. 
> > @Poornima, Have you seen anything like this before?
> > 
> > On 14 June 2018 at 10:07, Nithya Balachandran <nbalacha at redhat.com>
> > wrote:
> > > This is not the same issue as the one you are referring - that
> > > was in the RPC layer and caused the bricks to crash. This one is
> > > different as it seems to be in the dht and rda layers. It does
> > > look like a stack overflow though.
> > > @Mohammad,
> > > 
> > > Please send the following information:
> > > 
> > > 1. gluster volume info 
> > > 2. The number of entries in the directory being listed
> > > 3. System memory
> > > 
> > > Does this still happen if you turn off parallel-readdir?
> > > 
> > > Regards,
> > > Nithya
> > > 
> > > 
> > > 
> > > 
> > > On 13 June 2018 at 16:40, Milind Changire <mchangir at redhat.com>
> > > wrote:
> > > > +Nithya
> > > > 
> > > > Nithya,
> > > > Do these logs [1]  look similar to the recursive readdir()
> > > > issue that you encountered just a while back ?
> > > > i.e. recursive readdir() response definition in the XDR
> > > > 
> > > > [1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
> > > > 
> > > > 
> > > > On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.alig at g
> > > > mail.com> wrote:
> > > > > Hi Milind
> > > > > 
> > > > > Thanks a lot, I manage to run gdb and produced traceback as
> > > > > well. Its here
> > > > > 
> > > > > http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log 
> > > > > 
> > > > > 
> > > > > I am trying to understand but still not able to make sense
> > > > > out of it.
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > Kashif
> > > > > 
> > > > > 
> > > > > On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchangir at r
> > > > > edhat.com> wrote:
> > > > > > Kashif,
> > > > > > FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <kashif.al
> > > > > > ig at gmail.com> wrote:
> > > > > > > Hi Milind 
> > > > > > > 
> > > > > > > There is no 
> > > > > > > glusterfs-debuginfo available for gluster-3.12 from 
> > > > > > > http://mirror.centos.org/centos/6/storage/x86_64/gluster-
> > > > > > > 3.12/ repo. Do 
> > > > > > > you know from where I can get it? 
> > > > > > > Also when I run gdb, it says 
> > > > > > > 
> > > > > > > Missing separate debuginfos, use: debuginfo-install
> > > > > > > glusterfs-fuse-3.12.9-1.el6.x86_64 
> > > > > > > 
> > > > > > > I can't find debug package for glusterfs-fuse either
> > > > > > > 
> > > > > > > Thanks from the pit of despair ;)
> > > > > > > 
> > > > > > > Kashif
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <kashif.
> > > > > > > alig at gmail.com> wrote:
> > > > > > > > Hi Milind
> > > > > > > > 
> > > > > > > > I will send you links for logs.
> > > > > > > > 
> > > > > > > > I collected these core dumps at client and there is no
> > > > > > > > glusterd process running on client.
> > > > > > > > 
> > > > > > > > Kashif
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <mchan
> > > > > > > > gir at redhat.com> wrote:
> > > > > > > > > Kashif,
> > > > > > > > > Could you also send over the client/mount log file as
> > > > > > > > > Vijay suggested ?
> > > > > > > > > Or maybe the lines with the crash backtrace lines
> > > > > > > > > 
> > > > > > > > > Also, you've mentioned that you straced glusterd, but
> > > > > > > > > when you ran gdb, you ran it over /usr/sbin/glusterfs
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbellu
> > > > > > > > > r at redhat.com> wrote:
> > > > > > > > > > On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <k
> > > > > > > > > > ashif.alig at gmail.com> wrote:
> > > > > > > > > > > Hi Milind
> > > > > > > > > > > 
> > > > > > > > > > > The operating system is Scientific Linux 6 which
> > > > > > > > > > > is based on RHEL6. The cpu arch is Intel x86_64.
> > > > > > > > > > > 
> > > > > > > > > > > I will send you a separate email with link to
> > > > > > > > > > > core dump.
> > > > > > > > > > 
> > > > > > > > > > You could also grep for crash in the client log
> > > > > > > > > > file and the lines following crash would have a
> > > > > > > > > > backtrace in most cases.
> > > > > > > > > > 
> > > > > > > > > > HTH,
> > > > > > > > > > Vijay
> > > > > > > > > >  
> > > > > > > > > > > Thanks for your help.
> > > > > > > > > > > 
> > > > > > > > > > > Kashif
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire
> > > > > > > > > > > <mchangir at redhat.com> wrote:
> > > > > > > > > > > > Kashif,
> > > > > > > > > > > > Could you share the core dump via Google Drive
> > > > > > > > > > > > or something similar
> > > > > > > > > > > > 
> > > > > > > > > > > > Also, let me know the CPU arch and OS
> > > > > > > > > > > > Distribution on which you are running gluster.
> > > > > > > > > > > > 
> > > > > > > > > > > > If you've installed the glusterfs-debuginfo
> > > > > > > > > > > > package, you'll also get the source lines in
> > > > > > > > > > > > the backtrace via gdb
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > On Tue, Jun 12, 2018 at 5:59 PM, mohammad
> > > > > > > > > > > > kashif <kashif.alig at gmail.com> wrote:
> > > > > > > > > > > > > Hi Milind, Vijay 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks, I have some more information now as I
> > > > > > > > > > > > > straced glusterd on client
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 138544      0.000131 mprotect(0x7f2f70785000,
> > > > > > > > > > > > > 4096, PROT_READ|PROT_WRITE) = 0 <0.000026>
> > > > > > > > > > > > > 138544      0.000128 mprotect(0x7f2f70786000,
> > > > > > > > > > > > > 4096, PROT_READ|PROT_WRITE) = 0 <0.000027>
> > > > > > > > > > > > > 138544      0.000126 mprotect(0x7f2f70787000,
> > > > > > > > > > > > > 4096, PROT_READ|PROT_WRITE) = 0 <0.000027>
> > > > > > > > > > > > > 138544      0.000124 --- SIGSEGV
> > > > > > > > > > > > > {si_signo=SIGSEGV, si_code=SEGV_ACCERR,
> > > > > > > > > > > > > si_addr=0x7f2f7c60ef88} ---
> > > > > > > > > > > > > 138544      0.000051 --- SIGSEGV
> > > > > > > > > > > > > {si_signo=SIGSEGV, si_code=SI_KERNEL,
> > > > > > > > > > > > > si_addr=0} ---
> > > > > > > > > > > > > 138551      0.105048 +++ killed by SIGSEGV
> > > > > > > > > > > > > (core dumped) +++
> > > > > > > > > > > > > 138550      0.000041 +++ killed by SIGSEGV
> > > > > > > > > > > > > (core dumped) +++
> > > > > > > > > > > > > 138547      0.000008 +++ killed by SIGSEGV
> > > > > > > > > > > > > (core dumped) +++
> > > > > > > > > > > > > 138546      0.000007 +++ killed by SIGSEGV
> > > > > > > > > > > > > (core dumped) +++
> > > > > > > > > > > > > 138545      0.000007 +++ killed by SIGSEGV
> > > > > > > > > > > > > (core dumped) +++
> > > > > > > > > > > > > 138544      0.000008 +++ killed by SIGSEGV
> > > > > > > > > > > > > (core dumped) +++
> > > > > > > > > > > > > 138543      0.000007 +++ killed by SIGSEGV
> > > > > > > > > > > > > (core dumped) +++
> > > > > > > > > > > > > 
> > > > > > > > > > > > > As for I understand that somehow gluster is
> > > > > > > > > > > > > trying to access memory in appropriate manner
> > > > > > > > > > > > > and kernel sends SIGSEGV 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I also got the core dump. I am trying gdb
> > > > > > > > > > > > > first time so I am not sure whether I am
> > > > > > > > > > > > > using it correctly 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > gdb /usr/sbin/glusterfs core.138536
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It just tell me that program terminated with
> > > > > > > > > > > > > signal 11, segmentation fault .
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The problem is not limited to one client but
> > > > > > > > > > > > > happening to many clients. 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I will really appreciate any help as whole
> > > > > > > > > > > > > file system has become unusable 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Kashif
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Tue, Jun 12, 2018 at 12:26 PM, Milind
> > > > > > > > > > > > > Changire <mchangir at redhat.com> wrote:
> > > > > > > > > > > > > > Kashif,
> > > > > > > > > > > > > > You can change the log level by:
> > > > > > > > > > > > > > $ gluster volume set <vol>
> > > > > > > > > > > > > > diagnostics.brick-log-level TRACE
> > > > > > > > > > > > > > $ gluster volume set <vol>
> > > > > > > > > > > > > > diagnostics.client-log-level TRACE
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > and see how things fare
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > If you want fewer logs you can change the
> > > > > > > > > > > > > > log-level to DEBUG instead of TRACE.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Tue, Jun 12, 2018 at 3:37 PM, mohammad
> > > > > > > > > > > > > > kashif <kashif.alig at gmail.com> wrote:
> > > > > > > > > > > > > > > Hi Vijay
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Now it is unmounting every 30 mins ! 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The server log at
> > > > > > > > > > > > > > > /var/log/glusterfs/bricks/glusteratlas-
> > > > > > > > > > > > > > > brics001-gv0.log have this line only
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 2018-06-12 09:53:19.303102] I [MSGID:
> > > > > > > > > > > > > > > 115013] [server-
> > > > > > > > > > > > > > > helpers.c:289:do_fd_cleanup] 0-
> > > > > > > > > > > > > > > atlasglust-server: fd cleanup on
> > > > > > > > > > > > > > > /atlas/atlasdata/zgubic/hmumu/histograms/
> > > > > > > > > > > > > > > v14.3/Signal
> > > > > > > > > > > > > > > [2018-06-12 09:53:19.306190] I [MSGID:
> > > > > > > > > > > > > > > 101055] [client_t.c:443:gf_client_unref]
> > > > > > > > > > > > > > > 0-atlasglust-server: Shutting down
> > > > > > > > > > > > > > > connection <server-name> -2224879-
> > > > > > > > > > > > > > > 2018/06/12-09:51:01:460889-atlasglust-
> > > > > > > > > > > > > > > client-0-0-0
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > There is no other information. Is there
> > > > > > > > > > > > > > > any way to increase log verbosity?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > on the client 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 2018-06-12 09:51:01.744980] I [MSGID:
> > > > > > > > > > > > > > > 114057] [client-
> > > > > > > > > > > > > > > handshake.c:1478:select_server_supported_
> > > > > > > > > > > > > > > programs] 0-atlasglust-client-5: Using
> > > > > > > > > > > > > > > Program GlusterFS 3.3, Num (1298437),
> > > > > > > > > > > > > > > Version (330)
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.746508] I [MSGID:
> > > > > > > > > > > > > > > 114046] [client-
> > > > > > > > > > > > > > > handshake.c:1231:client_setvolume_cbk] 0-
> > > > > > > > > > > > > > > atlasglust-client-5: Connected to
> > > > > > > > > > > > > > > atlasglust-client-5, attached to remote
> > > > > > > > > > > > > > > volume '/glusteratlas/brick006/gv0'.
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.746543] I [MSGID:
> > > > > > > > > > > > > > > 114047] [client-
> > > > > > > > > > > > > > > handshake.c:1242:client_setvolume_cbk] 0-
> > > > > > > > > > > > > > > atlasglust-client-5: Server and Client
> > > > > > > > > > > > > > > lk-version numbers are not same,
> > > > > > > > > > > > > > > reopening the fds
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.746814] I [MSGID:
> > > > > > > > > > > > > > > 114035] [client-
> > > > > > > > > > > > > > > handshake.c:202:client_set_lk_version_cbk
> > > > > > > > > > > > > > > ] 0-atlasglust-client-5: Server lk
> > > > > > > > > > > > > > > version = 1
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.748449] I [MSGID:
> > > > > > > > > > > > > > > 114057] [client-
> > > > > > > > > > > > > > > handshake.c:1478:select_server_supported_
> > > > > > > > > > > > > > > programs] 0-atlasglust-client-6: Using
> > > > > > > > > > > > > > > Program GlusterFS 3.3, Num (1298437),
> > > > > > > > > > > > > > > Version (330)
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.750219] I [MSGID:
> > > > > > > > > > > > > > > 114046] [client-
> > > > > > > > > > > > > > > handshake.c:1231:client_setvolume_cbk] 0-
> > > > > > > > > > > > > > > atlasglust-client-6: Connected to
> > > > > > > > > > > > > > > atlasglust-client-6, attached to remote
> > > > > > > > > > > > > > > volume '/glusteratlas/brick007/gv0'.
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.750261] I [MSGID:
> > > > > > > > > > > > > > > 114047] [client-
> > > > > > > > > > > > > > > handshake.c:1242:client_setvolume_cbk] 0-
> > > > > > > > > > > > > > > atlasglust-client-6: Server and Client
> > > > > > > > > > > > > > > lk-version numbers are not same,
> > > > > > > > > > > > > > > reopening the fds
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.750503] I [MSGID:
> > > > > > > > > > > > > > > 114035] [client-
> > > > > > > > > > > > > > > handshake.c:202:client_set_lk_version_cbk
> > > > > > > > > > > > > > > ] 0-atlasglust-client-6: Server lk
> > > > > > > > > > > > > > > version = 1
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.752207] I [fuse-
> > > > > > > > > > > > > > > bridge.c:4205:fuse_init] 0-glusterfs-
> > > > > > > > > > > > > > > fuse: FUSE inited with protocol versions:
> > > > > > > > > > > > > > > glusterfs 7.24 kernel 7.14
> > > > > > > > > > > > > > > [2018-06-12 09:51:01.752261] I [fuse-
> > > > > > > > > > > > > > > bridge.c:4835:fuse_graph_sync] 0-fuse:
> > > > > > > > > > > > > > > switched to graph 0
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > is there a problem with server and client
> > > > > > > > > > > > > > > 1k version?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Thanks for your help.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Kashif
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   
> > > > > > > > > > > > > > > On Mon, Jun 11, 2018 at 11:52 PM, Vijay
> > > > > > > > > > > > > > > Bellur <vbellur at redhat.com> wrote:
> > > > > > > > > > > > > > > > On Mon, Jun 11, 2018 at 8:50 AM,
> > > > > > > > > > > > > > > > mohammad kashif <kashif.alig at gmail.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Since I have updated our gluster
> > > > > > > > > > > > > > > > > server and client to latest version
> > > > > > > > > > > > > > > > > 3.12.9-1, I am having this issue of
> > > > > > > > > > > > > > > > > gluster getting unmounted from client
> > > > > > > > > > > > > > > > > very regularly. It was not a problem
> > > > > > > > > > > > > > > > > before update.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Its a distributed file system with no
> > > > > > > > > > > > > > > > > replication. We have seven servers
> > > > > > > > > > > > > > > > > totaling around 480TB data. Its 97%
> > > > > > > > > > > > > > > > > full. 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I am using following config on server
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > features.cache-invalidation on
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > features.cache-invalidation-timeout
> > > > > > > > > > > > > > > > > 600
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > performance.stat-prefetch on
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > performance.cache-invalidation on
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > performance.md-cache-timeout 600
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > performance.parallel-readdir on
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > performance.cache-size 1GB
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > performance.client-io-threads on
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > cluster.lookup-optimize on
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > performance.stat-prefetch on
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > client.event-threads 4
> > > > > > > > > > > > > > > > > gluster volume set atlasglust
> > > > > > > > > > > > > > > > > server.event-threads 4
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > clients are mounted with this option
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > defaults,direct-io-
> > > > > > > > > > > > > > > > > mode=disable,attribute-
> > > > > > > > > > > > > > > > > timeout=600,entry-
> > > > > > > > > > > > > > > > > timeout=600,negative-
> > > > > > > > > > > > > > > > > timeout=600,fopen-keep-
> > > > > > > > > > > > > > > > > cache,rw,_netdev 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I can't see anything in the log file.
> > > > > > > > > > > > > > > > > Can someone suggest that how to
> > > > > > > > > > > > > > > > > troubleshoot this issue?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Can you please share the log file?
> > > > > > > > > > > > > > > > Checking for messages related to
> > > > > > > > > > > > > > > > disconnections/crashes in the log file
> > > > > > > > > > > > > > > > would be a good way to start
> > > > > > > > > > > > > > > > troubleshooting the problem.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Vijay 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > _________________________________________
> > > > > > > > > > > > > > > ______
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Gluster-users mailing list
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Gluster-users at gluster.org
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > http://lists.gluster.org/mailman/listinfo
> > > > > > > > > > > > > > > /gluster-users
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > ___________________________________________
> > > > > > > > > > > > > > ____Gluster-users mailing listGluster-
> > > > > > > > > > > > > > users at gluster.orghttp://lists.gluster.org/m
> > > > > > > > > > > > > > ailman/listinfo/gluster-users
-- 
James P. Kinney III

Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain

http://heretothereideas.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180614/9d64400f/attachment.html>