[Gluster-users] glusterfsd process spinning

Wed Jun 18 07:54:24 UTC 2014

On Wed, 2014-06-18 at 13:09 +0530, Lalatendu Mohanty wrote: 
> On 06/17/2014 02:25 PM, Susant Palai wrote:
> > Hi Franco:
> >     The following patches address the ENOTEMPTY issue.
> >
> >                   1. http://review.gluster.org/#/c/7733/
> >                   2. http://review.gluster.org/#/c/7599/
> >   
> > I think the above patches will be available in 3.5.1 which will be a minor upgrade.(Need ack from Niels de Vos.)
> >
> > Hi Lala,
> >      Can you provide the steps to downgrade to 3.4 from 3.5 ?
> >
> > Thanks :)
> 
> If you are using a RPM based distribution, "yum downgrade" command 
> should work if yum have access to 3.5 and 3.4 repos. In particular I 
> have not tested the downgrade scenario from 3.5 to 3.4. I would suggest 
> that you stop your volume and kill gluster processes while downgrading.

I did try installing 3.4 but the volume wouldn't start: 

[2014-06-16 02:53:16.886995] I [glusterfsd.c:1910:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.3 (/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)
[2014-06-16 02:53:16.889605] I [glusterd.c:961:init] 0-management: Using /var/lib/glusterd as working directory
[2014-06-16 02:53:16.891580] I [socket.c:3480:socket_init] 0-socket.management: SSL support is NOT enabled
[2014-06-16 02:53:16.891600] I [socket.c:3495:socket_init] 0-socket.management: using system polling thread
[2014-06-16 02:53:16.891675] E [rpc-transport.c:253:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.4.3/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2014-06-16 02:53:16.891691] W [rpc-transport.c:257:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2014-06-16 02:53:16.891700] W [rpcsvc.c:1389:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed
[2014-06-16 02:53:16.892457] I [glusterd.c:354:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system
[2014-06-16 02:53:17.087325] E [glusterd-store.c:1333:glusterd_restore_op_version] 0-management: wrong op-version (3) retreived
[2014-06-16 02:53:17.087352] E [glusterd-store.c:2510:glusterd_restore] 0-management: Failed to restore op_version
[2014-06-16 02:53:17.087365] E [xlator.c:390:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2014-06-16 02:53:17.087375] E [graph.c:292:glusterfs_graph_init] 0-management: initializing translator failed
[2014-06-16 02:53:17.087383] E [graph.c:479:glusterfs_graph_activate] 0-graph: init failed
[2014-06-16 02:53:17.087534] W [glusterfsd.c:1002:cleanup_and_exit] (-->/usr/sbin/glusterd(main+0x5d2) [0x406802] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xb7) [0x4051b7] (-->/usr/sbin/glusterd(glusterfs_process_volfp+0x103) [0x4050c3]))) 0-: received signum (0), shutting down


> 
> Thanks,
> Lala
> >
> >
> > ----- Original Message -----
> > From: "Franco Broi" <franco.broi at iongeo.com>
> > To: "Susant Palai" <spalai at redhat.com>
> > Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, gluster-users at gluster.org, "Raghavendra Gowdappa" <rgowdapp at redhat.com>, kdhananj at redhat.com, vsomyaju at redhat.com, nbalacha at redhat.com
> > Sent: Monday, 16 June, 2014 5:47:55 AM
> > Subject: Re: [Gluster-users] glusterfsd process spinning
> >
> >
> > Is it possible to downgrade to 3.4 from 3.5? I can't afford to spend any
> > more time testing 3.5 and it doesn't seem to work as well as 3.4.
> >
> > Cheers,
> >
> > On Wed, 2014-06-04 at 01:51 -0400, Susant Palai wrote:
> >>  From the logs it seems files are present on data(21,22,23,24) which are on nas6 while missing on data(17,18,19,20) which are on nas5 (interesting). There is an existing issue where directories does not show up on mount point if they are not present on first_up_subvol(longest living brick) and the current issue looks more similar. Well will look at the client logs for more information.
> >>
> >> Susant.
> >>
> >> ----- Original Message -----
> >> From: "Franco Broi" <franco.broi at iongeo.com>
> >> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> >> Cc: "Susant Palai" <spalai at redhat.com>, gluster-users at gluster.org, "Raghavendra Gowdappa" <rgowdapp at redhat.com>, kdhananj at redhat.com, vsomyaju at redhat.com, nbalacha at redhat.com
> >> Sent: Wednesday, 4 June, 2014 10:32:37 AM
> >> Subject: Re: [Gluster-users] glusterfsd process spinning
> >>
> >> On Wed, 2014-06-04 at 10:19 +0530, Pranith Kumar Karampuri wrote:
> >>> On 06/04/2014 08:07 AM, Susant Palai wrote:
> >>>> Pranith can you send the client and bricks logs.
> >>> I have the logs. But I believe for this issue of directory not listing
> >>> entries, it would help more if we have the contents of that directory on
> >>> all the directories in the bricks + their hash values in the xattrs.
> >> Strange thing is, all the invisible files are on the one server (nas6),
> >> the other seems ok. I did rm -Rf of /data2/franco/dir* and was left with
> >> this one directory - there were many hundreds which were removed
> >> successfully.
> >>
> >> I've attached listings and xattr dumps.
> >>
> >> Cheers,
> >>
> >> Volume Name: data2
> >> Type: Distribute
> >> Volume ID: d958423f-bd25-49f1-81f8-f12e4edc6823
> >> Status: Started
> >> Number of Bricks: 8
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: nas5-10g:/data17/gvol
> >> Brick2: nas5-10g:/data18/gvol
> >> Brick3: nas5-10g:/data19/gvol
> >> Brick4: nas5-10g:/data20/gvol
> >> Brick5: nas6-10g:/data21/gvol
> >> Brick6: nas6-10g:/data22/gvol
> >> Brick7: nas6-10g:/data23/gvol
> >> Brick8: nas6-10g:/data24/gvol
> >> Options Reconfigured:
> >> nfs.drc: on
> >> cluster.min-free-disk: 5%
> >> network.frame-timeout: 10800
> >> nfs.export-volumes: on
> >> nfs.disable: on
> >> cluster.readdir-optimize: on
> >>
> >> Gluster process						Port	Online	Pid
> >> ------------------------------------------------------------------------------
> >> Brick nas5-10g:/data17/gvol				49152	Y	6553
> >> Brick nas5-10g:/data18/gvol				49153	Y	6564
> >> Brick nas5-10g:/data19/gvol				49154	Y	6575
> >> Brick nas5-10g:/data20/gvol				49155	Y	6586
> >> Brick nas6-10g:/data21/gvol				49160	Y	20608
> >> Brick nas6-10g:/data22/gvol				49161	Y	20613
> >> Brick nas6-10g:/data23/gvol				49162	Y	20614
> >> Brick nas6-10g:/data24/gvol				49163	Y	20621
> >>   
> >> Task Status of Volume data2
> >> ------------------------------------------------------------------------------
> >> There are no active volume tasks
> >>
> >>
> >>
> >>> Pranith
> >>>> Thanks,
> >>>> Susant~
> >>>>
> >>>> ----- Original Message -----
> >>>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> >>>> To: "Franco Broi" <franco.broi at iongeo.com>
> >>>> Cc: gluster-users at gluster.org, "Raghavendra Gowdappa" <rgowdapp at redhat.com>, spalai at redhat.com, kdhananj at redhat.com, vsomyaju at redhat.com, nbalacha at redhat.com
> >>>> Sent: Wednesday, 4 June, 2014 7:53:41 AM
> >>>> Subject: Re: [Gluster-users] glusterfsd process spinning
> >>>>
> >>>> hi Franco,
> >>>>         CC Devs who work on DHT to comment.
> >>>>
> >>>> Pranith
> >>>>
> >>>> On 06/04/2014 07:39 AM, Franco Broi wrote:
> >>>>> On Wed, 2014-06-04 at 07:28 +0530, Pranith Kumar Karampuri wrote:
> >>>>>> Franco,
> >>>>>>           Thanks for providing the logs. I just copied over the logs to my
> >>>>>> machine. Most of the logs I see are related to "No such File or
> >>>>>> Directory" I wonder what lead to this. Do you have any idea?
> >>>>> No but I'm just looking at my 3.5 Gluster volume and it has a directory
> >>>>> that looks empty but can't be deleted. When I look at the directories on
> >>>>> the servers there are definitely files in there.
> >>>>>
> >>>>> [franco at charlie1 franco]$ rmdir /data2/franco/dir1226/dir25
> >>>>> rmdir: failed to remove `/data2/franco/dir1226/dir25': Directory not empty
> >>>>> [franco at charlie1 franco]$ ls -la  /data2/franco/dir1226/dir25
> >>>>> total 8
> >>>>> drwxrwxr-x 2 franco support 60 May 21 03:58 .
> >>>>> drwxrwxr-x 3 franco support 24 Jun  4 09:37 ..
> >>>>>
> >>>>> [root at nas6 ~]# ls -la /data*/gvol/franco/dir1226/dir25
> >>>>> /data21/gvol/franco/dir1226/dir25:
> >>>>> total 2081
> >>>>> drwxrwxr-x 13 1348 200 13 May 21 03:58 .
> >>>>> drwxrwxr-x  3 1348 200  3 May 21 03:58 ..
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13017
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13018
> >>>>> drwxrwxr-x  2 1348 200  3 May 16 12:05 dir13020
> >>>>> drwxrwxr-x  2 1348 200  3 May 16 12:05 dir13021
> >>>>> drwxrwxr-x  2 1348 200  3 May 16 12:05 dir13022
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13024
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13027
> >>>>> drwxrwxr-x  2 1348 200  3 May 16 12:05 dir13028
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:06 dir13029
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:06 dir13031
> >>>>> drwxrwxr-x  2 1348 200  3 May 16 12:06 dir13032
> >>>>>
> >>>>> /data22/gvol/franco/dir1226/dir25:
> >>>>> total 2084
> >>>>> drwxrwxr-x 13 1348 200 13 May 21 03:58 .
> >>>>> drwxrwxr-x  3 1348 200  3 May 21 03:58 ..
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13017
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13018
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13020
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13021
> >>>>> drwxrwxr-x  2 1348 200  2 May 16 12:05 dir13022
> >>>>> .....
> >>>>>
> >>>>> Maybe Gluster is losing track of the files??
> >>>>>
> >>>>>> Pranith
> >>>>>>
> >>>>>> On 06/02/2014 02:48 PM, Franco Broi wrote:
> >>>>>>> Hi Pranith
> >>>>>>>
> >>>>>>> Here's a listing of the brick logs, looks very odd especially the size
> >>>>>>> of the log for data10.
> >>>>>>>
> >>>>>>> [root at nas3 bricks]# ls -ltrh
> >>>>>>> total 2.6G
> >>>>>>> -rw------- 1 root root 381K May 13 12:15 data12-gvol.log-20140511
> >>>>>>> -rw------- 1 root root 430M May 13 12:15 data11-gvol.log-20140511
> >>>>>>> -rw------- 1 root root 328K May 13 12:15 data9-gvol.log-20140511
> >>>>>>> -rw------- 1 root root 2.0M May 13 12:15 data10-gvol.log-20140511
> >>>>>>> -rw------- 1 root root    0 May 18 03:43 data10-gvol.log-20140525
> >>>>>>> -rw------- 1 root root    0 May 18 03:43 data11-gvol.log-20140525
> >>>>>>> -rw------- 1 root root    0 May 18 03:43 data12-gvol.log-20140525
> >>>>>>> -rw------- 1 root root    0 May 18 03:43 data9-gvol.log-20140525
> >>>>>>> -rw------- 1 root root    0 May 25 03:19 data10-gvol.log-20140601
> >>>>>>> -rw------- 1 root root    0 May 25 03:19 data11-gvol.log-20140601
> >>>>>>> -rw------- 1 root root    0 May 25 03:19 data9-gvol.log-20140601
> >>>>>>> -rw------- 1 root root  98M May 26 03:04 data12-gvol.log-20140518
> >>>>>>> -rw------- 1 root root    0 Jun  1 03:37 data10-gvol.log
> >>>>>>> -rw------- 1 root root    0 Jun  1 03:37 data11-gvol.log
> >>>>>>> -rw------- 1 root root    0 Jun  1 03:37 data12-gvol.log
> >>>>>>> -rw------- 1 root root    0 Jun  1 03:37 data9-gvol.log
> >>>>>>> -rw------- 1 root root 1.8G Jun  2 16:35 data10-gvol.log-20140518
> >>>>>>> -rw------- 1 root root 279M Jun  2 16:35 data9-gvol.log-20140518
> >>>>>>> -rw------- 1 root root 328K Jun  2 16:35 data12-gvol.log-20140601
> >>>>>>> -rw------- 1 root root 8.3M Jun  2 16:35 data11-gvol.log-20140518
> >>>>>>>
> >>>>>>> Too big to post everything.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> On Sun, 2014-06-01 at 22:00 -0400, Pranith Kumar Karampuri wrote:
> >>>>>>>> ----- Original Message -----
> >>>>>>>>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> >>>>>>>>> To: "Franco Broi" <franco.broi at iongeo.com>
> >>>>>>>>> Cc: gluster-users at gluster.org
> >>>>>>>>> Sent: Monday, June 2, 2014 7:01:34 AM
> >>>>>>>>> Subject: Re: [Gluster-users] glusterfsd process spinning
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ----- Original Message -----
> >>>>>>>>>> From: "Franco Broi" <franco.broi at iongeo.com>
> >>>>>>>>>> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> >>>>>>>>>> Cc: gluster-users at gluster.org
> >>>>>>>>>> Sent: Sunday, June 1, 2014 10:53:51 AM
> >>>>>>>>>> Subject: Re: [Gluster-users] glusterfsd process spinning
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The volume is almost completely idle now and the CPU for the brick
> >>>>>>>>>> process has returned to normal. I've included the profile and I think it
> >>>>>>>>>> shows the latency for the bad brick (data12) is unusually high, probably
> >>>>>>>>>> indicating the filesystem is at fault after all??
> >>>>>>>>> I am not sure if we can believe the outputs now that you say the brick
> >>>>>>>>> returned to normal. Next time it is acting up, do the same procedure and
> >>>>>>>>> post the result.
> >>>>>>>> On second thought may be its not a bad idea to inspect the log files of the bricks in nas3. Could you post them.
> >>>>>>>>
> >>>>>>>> Pranith
> >>>>>>>>
> >>>>>>>>> Pranith
> >>>>>>>>>> On Sun, 2014-06-01 at 01:01 -0400, Pranith Kumar Karampuri wrote:
> >>>>>>>>>>> Franco,
> >>>>>>>>>>>         Could you do the following to get more information:
> >>>>>>>>>>>
> >>>>>>>>>>> "gluster volume profile <volname> start"
> >>>>>>>>>>>
> >>>>>>>>>>> Wait for some time, this will start gathering what operations are coming
> >>>>>>>>>>> to
> >>>>>>>>>>> all the bricks"
> >>>>>>>>>>> Now execute "gluster volume profile <volname> info" >
> >>>>>>>>>>> /file/you/should/reply/to/this/mail/with
> >>>>>>>>>>>
> >>>>>>>>>>> Then execute:
> >>>>>>>>>>> gluster volume profile <volname> stop
> >>>>>>>>>>>
> >>>>>>>>>>> Lets see if this throws any light on the problem at hand
> >>>>>>>>>>>
> >>>>>>>>>>> Pranith
> >>>>>>>>>>> ----- Original Message -----
> >>>>>>>>>>>> From: "Franco Broi" <franco.broi at iongeo.com>
> >>>>>>>>>>>> To: gluster-users at gluster.org
> >>>>>>>>>>>> Sent: Sunday, June 1, 2014 9:02:48 AM
> >>>>>>>>>>>> Subject: [Gluster-users] glusterfsd process spinning
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi
> >>>>>>>>>>>>
> >>>>>>>>>>>> I've been suffering from continual problems with my gluster filesystem
> >>>>>>>>>>>> slowing down due to what I thought was congestion on a single brick
> >>>>>>>>>>>> being caused by a problem with the underlying filesystem running slow
> >>>>>>>>>>>> but I've just noticed that the glusterfsd process for that particular
> >>>>>>>>>>>> brick is running at 100%+, even when the filesystem is almost idle.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I've done a couple of straces of the brick and another on the same
> >>>>>>>>>>>> server, does the high number of futex errors give any clues as to what
> >>>>>>>>>>>> might be wrong?
> >>>>>>>>>>>>
> >>>>>>>>>>>> % time     seconds  usecs/call     calls    errors syscall
> >>>>>>>>>>>> ------ ----------- ----------- --------- --------- ----------------
> >>>>>>>>>>>> 45.58    0.027554           0    191665     20772 futex
> >>>>>>>>>>>> 28.26    0.017084           0    137133           readv
> >>>>>>>>>>>> 26.04    0.015743           0     66259           epoll_wait
> >>>>>>>>>>>>       0.13    0.000077           3        23           writev
> >>>>>>>>>>>>       0.00    0.000000           0         1           epoll_ctl
> >>>>>>>>>>>> ------ ----------- ----------- --------- --------- ----------------
> >>>>>>>>>>>> 100.00    0.060458                395081     20772 total
> >>>>>>>>>>>>
> >>>>>>>>>>>> % time     seconds  usecs/call     calls    errors syscall
> >>>>>>>>>>>> ------ ----------- ----------- --------- --------- ----------------
> >>>>>>>>>>>> 99.25    0.334020         133      2516           epoll_wait
> >>>>>>>>>>>>       0.40    0.001347           0      4090        26 futex
> >>>>>>>>>>>>       0.35    0.001192           0      5064           readv
> >>>>>>>>>>>>       0.00    0.000000           0        20           writev
> >>>>>>>>>>>> ------ ----------- ----------- --------- --------- ----------------
> >>>>>>>>>>>> 100.00    0.336559                 11690        26 total
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Gluster-users mailing list
> >>>>>>>>>>>> Gluster-users at gluster.org
> >>>>>>>>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> >>>>>>>>>>>>
>