[Gluster-devel] [Gluster-users] heal hanging

Glomski, Patrick patrick.glomski at corvidtec.com
Fri Jan 22 03:21:30 UTC 2016


Samba version is 4.1.17 that you guys maintain at download.gluster.org. The
vfs plugin comes packaged with it.

http://download.gluster.org/pub/gluster/glusterfs/samba/EPEL.repo/epel-6/x86_64/

# smbd --version
Version 4.1.17

# rpm -qa | grep samba-vfs-glusterfs
samba-vfs-glusterfs-4.1.17-4.el6rhs.x86_64

Let us know if there's anything else we can provide,

Patrick


On Thu, Jan 21, 2016 at 10:07 PM, Raghavendra Talur <rtalur at redhat.com>
wrote:

>
> On Jan 22, 2016 7:27 AM, "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> wrote:
> >
> >
> >
> > On 01/22/2016 07:19 AM, Pranith Kumar Karampuri wrote:
> >>
> >>
> >>
> >> On 01/22/2016 07:13 AM, Glomski, Patrick wrote:
> >>>
> >>> We use the samba glusterfs virtual filesystem (the current version
> provided on download.gluster.org), but no windows clients connecting
> directly.
> >>
> >>
> >> Hmm.. Is there a way to disable using this and check if the CPU% still
> increases? What getxattr of "glusterfs.get_real_filename <filanme>" does is
> to scan the entire directory looking for strcasecmp(<filname>,
> <scanned-filename>). If anything matches then it will return the
> <scanned-filename>. But the problem is the scan is costly. So I wonder if
> this is the reason for the CPU spikes.
> >
> > +Raghavendra Talur, +Poornima
> >
> > Raghavendra, Poornima,
> >             When are these getxattrs triggered? Did you guys see any
> brick CPU spikes before? I initially thought it could be because of big
> directory heals. But this is happening even when no self-heals are
> required. So I had to move away from that theory.
>
> These getxattrs are triggered when a SMB client performs a path based
> operation. It is necessary then that some client was connected.
>
> The last fix to go in that code for 3.6 was
> http://review.gluster.org/#/c/10403/.
>
> I am not able to determine which release of 3.6 it made into. Will update.
>
> Also we would need version of Samba installed. Including the vfs plugin
> package.
>
> There is a for loop of strcmp involved here which does take a lot of CPU.
> It should be for short bursts though and is expected and harmless.
>
> >
> > Pranith
> >
> >>
> >> Pranith
> >>>
> >>>
> >>> On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
> >>>>
> >>>> Do you have any windows clients? I see a lot of getxattr calls for
> "glusterfs.get_real_filename" which lead to full readdirs of the
> directories on the brick.
> >>>>
> >>>> Pranith
> >>>>
> >>>> On 01/22/2016 12:51 AM, Glomski, Patrick wrote:
> >>>>>
> >>>>> Pranith, could this kind of behavior be self-inflicted by us
> deleting files directly from the bricks? We have done that in the past to
> clean up an issues where gluster wouldn't allow us to delete from the mount.
> >>>>>
> >>>>> If so, is it feasible to clean them up by running a search on the
> .glusterfs directories directly and removing files with a reference count
> of 1 that are non-zero size (or directly checking the xattrs to be sure
> that it's not a DHT link).
> >>>>>
> >>>>> find /data/brick01a/homegfs/.glusterfs -type f -not -empty -links -2
> -exec rm -f "{}" \;
> >>>>>
> >>>>> Is there anything I'm inherently missing with that approach that
> will further corrupt the system?
> >>>>>
> >>>>>
> >>>>> On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick <
> patrick.glomski at corvidtec.com> wrote:
> >>>>>>
> >>>>>> Load spiked again: ~1200%cpu on gfs02a for glusterfsd. Crawl has
> been running on one of the bricks on gfs02b for 25 min or so and users
> cannot access the volume.
> >>>>>>
> >>>>>> I re-listed the xattrop directories as well as a 'top' entry and
> heal statistics. Then I restarted the gluster services on gfs02a.
> >>>>>>
> >>>>>> =================== top ===================
> >>>>>> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
> >>>>>>  8969 root      20   0 2815m 204m 3588 S 1181.0  0.6 591:06.93
> glusterfsd
> >>>>>>
> >>>>>> =================== xattrop ===================
> >>>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-41f19453-91e4-437c-afa9-3b25614de210
> xattrop-9b815879-2f4d-402b-867c-a6d65087788c
> >>>>>>
> >>>>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-70131855-3cfb-49af-abce-9d23f57fb393
> xattrop-dfb77848-a39d-4417-a725-9beca75d78c6
> >>>>>>
> >>>>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop:
> >>>>>> e6e47ed9-309b-42a7-8c44-28c29b9a20f8
> xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
> >>>>>> xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
> xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0
> >>>>>>
> >>>>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
> xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413
> >>>>>>
> >>>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531
> >>>>>>
> >>>>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-7e20fdb1-5224-4b9a-be06-568708526d70
> >>>>>>
> >>>>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop:
> >>>>>> 8034bc06-92cd-4fa5-8aaf-09039e79d2c8
> c9ce22ed-6d8b-471b-a111-b39e57f0b512
> >>>>>> 94fa1d60-45ad-4341-b69c-315936b51e8d
> xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7
> >>>>>>
> >>>>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d
> >>>>>>
> >>>>>>
> >>>>>> =================== heal stats ===================
> >>>>>>
> >>>>>> homegfs [b0-gfsib01a] : Starting time of crawl       : Thu Jan 21
> 12:36:45 2016
> >>>>>> homegfs [b0-gfsib01a] : Ending time of crawl         : Thu Jan 21
> 12:36:45 2016
> >>>>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX
> >>>>>> homegfs [b0-gfsib01a] : No. of entries healed        : 0
> >>>>>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b0-gfsib01a] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b1-gfsib01b] : Starting time of crawl       : Thu Jan 21
> 12:36:19 2016
> >>>>>> homegfs [b1-gfsib01b] : Ending time of crawl         : Thu Jan 21
> 12:36:19 2016
> >>>>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX
> >>>>>> homegfs [b1-gfsib01b] : No. of entries healed        : 0
> >>>>>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b1-gfsib01b] : No. of heal failed entries   : 1
> >>>>>>
> >>>>>> homegfs [b2-gfsib01a] : Starting time of crawl       : Thu Jan 21
> 12:36:48 2016
> >>>>>> homegfs [b2-gfsib01a] : Ending time of crawl         : Thu Jan 21
> 12:36:48 2016
> >>>>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX
> >>>>>> homegfs [b2-gfsib01a] : No. of entries healed        : 0
> >>>>>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b2-gfsib01a] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b3-gfsib01b] : Starting time of crawl       : Thu Jan 21
> 12:36:47 2016
> >>>>>> homegfs [b3-gfsib01b] : Ending time of crawl         : Thu Jan 21
> 12:36:47 2016
> >>>>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX
> >>>>>> homegfs [b3-gfsib01b] : No. of entries healed        : 0
> >>>>>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b3-gfsib01b] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b4-gfsib02a] : Starting time of crawl       : Thu Jan 21
> 12:36:06 2016
> >>>>>> homegfs [b4-gfsib02a] : Ending time of crawl         : Thu Jan 21
> 12:36:06 2016
> >>>>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX
> >>>>>> homegfs [b4-gfsib02a] : No. of entries healed        : 0
> >>>>>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b4-gfsib02a] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b5-gfsib02b] : Starting time of crawl       : Thu Jan 21
> 12:13:40 2016
> >>>>>> homegfs [b5-gfsib02b] :                                *** Crawl is
> in progress ***
> >>>>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX
> >>>>>> homegfs [b5-gfsib02b] : No. of entries healed        : 0
> >>>>>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b5-gfsib02b] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b6-gfsib02a] : Starting time of crawl       : Thu Jan 21
> 12:36:58 2016
> >>>>>> homegfs [b6-gfsib02a] : Ending time of crawl         : Thu Jan 21
> 12:36:58 2016
> >>>>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX
> >>>>>> homegfs [b6-gfsib02a] : No. of entries healed        : 0
> >>>>>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b6-gfsib02a] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b7-gfsib02b] : Starting time of crawl       : Thu Jan 21
> 12:36:50 2016
> >>>>>> homegfs [b7-gfsib02b] : Ending time of crawl         : Thu Jan 21
> 12:36:50 2016
> >>>>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX
> >>>>>> homegfs [b7-gfsib02b] : No. of entries healed        : 0
> >>>>>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b7-gfsib02b] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>>
> >>>>>>
> ========================================================================================
> >>>>>> I waited a few minutes for the heals to finish and ran the heal
> statistics and info again. one file is in split-brain. Aside from the
> split-brain, the load on all systems is down now and they are behaving
> normally. glustershd.log is attached. What is going on???
> >>>>>>
> >>>>>> Thu Jan 21 12:53:50 EST 2016
> >>>>>>
> >>>>>> =================== homegfs ===================
> >>>>>>
> >>>>>> homegfs [b0-gfsib01a] : Starting time of crawl       : Thu Jan 21
> 12:53:02 2016
> >>>>>> homegfs [b0-gfsib01a] : Ending time of crawl         : Thu Jan 21
> 12:53:02 2016
> >>>>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX
> >>>>>> homegfs [b0-gfsib01a] : No. of entries healed        : 0
> >>>>>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b0-gfsib01a] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b1-gfsib01b] : Starting time of crawl       : Thu Jan 21
> 12:53:38 2016
> >>>>>> homegfs [b1-gfsib01b] : Ending time of crawl         : Thu Jan 21
> 12:53:38 2016
> >>>>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX
> >>>>>> homegfs [b1-gfsib01b] : No. of entries healed        : 0
> >>>>>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b1-gfsib01b] : No. of heal failed entries   : 1
> >>>>>>
> >>>>>> homegfs [b2-gfsib01a] : Starting time of crawl       : Thu Jan 21
> 12:53:04 2016
> >>>>>> homegfs [b2-gfsib01a] : Ending time of crawl         : Thu Jan 21
> 12:53:04 2016
> >>>>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX
> >>>>>> homegfs [b2-gfsib01a] : No. of entries healed        : 0
> >>>>>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b2-gfsib01a] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b3-gfsib01b] : Starting time of crawl       : Thu Jan 21
> 12:53:04 2016
> >>>>>> homegfs [b3-gfsib01b] : Ending time of crawl         : Thu Jan 21
> 12:53:04 2016
> >>>>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX
> >>>>>> homegfs [b3-gfsib01b] : No. of entries healed        : 0
> >>>>>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b3-gfsib01b] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b4-gfsib02a] : Starting time of crawl       : Thu Jan 21
> 12:53:33 2016
> >>>>>> homegfs [b4-gfsib02a] : Ending time of crawl         : Thu Jan 21
> 12:53:33 2016
> >>>>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX
> >>>>>> homegfs [b4-gfsib02a] : No. of entries healed        : 0
> >>>>>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b4-gfsib02a] : No. of heal failed entries   : 1
> >>>>>>
> >>>>>> homegfs [b5-gfsib02b] : Starting time of crawl       : Thu Jan 21
> 12:53:14 2016
> >>>>>> homegfs [b5-gfsib02b] : Ending time of crawl         : Thu Jan 21
> 12:53:15 2016
> >>>>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX
> >>>>>> homegfs [b5-gfsib02b] : No. of entries healed        : 0
> >>>>>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b5-gfsib02b] : No. of heal failed entries   : 3
> >>>>>>
> >>>>>> homegfs [b6-gfsib02a] : Starting time of crawl       : Thu Jan 21
> 12:53:04 2016
> >>>>>> homegfs [b6-gfsib02a] : Ending time of crawl         : Thu Jan 21
> 12:53:04 2016
> >>>>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX
> >>>>>> homegfs [b6-gfsib02a] : No. of entries healed        : 0
> >>>>>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b6-gfsib02a] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> homegfs [b7-gfsib02b] : Starting time of crawl       : Thu Jan 21
> 12:53:09 2016
> >>>>>> homegfs [b7-gfsib02b] : Ending time of crawl         : Thu Jan 21
> 12:53:09 2016
> >>>>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX
> >>>>>> homegfs [b7-gfsib02b] : No. of entries healed        : 0
> >>>>>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b7-gfsib02b] : No. of heal failed entries   : 0
> >>>>>>
> >>>>>> *** gluster bug in 'gluster volume heal homegfs statistics'   ***
> >>>>>> *** Use 'gluster volume heal homegfs info' until bug is fixed ***
> >>>>>>
> >>>>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
> >>>>>> /users/bangell/.gconfd - Is in split-brain
> >>>>>>
> >>>>>> Number of entries: 1
> >>>>>>
> >>>>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
> >>>>>> /users/bangell/.gconfd - Is in split-brain
> >>>>>>
> >>>>>> /users/bangell/.gconfd/saved_state
> >>>>>> Number of entries: 2
> >>>>>>
> >>>>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 01/21/2016 09:26 PM, Glomski, Patrick wrote:
> >>>>>>>>
> >>>>>>>> I should mention that the problem is not currently occurring and
> there are no heals (output appended). By restarting the gluster services,
> we can stop the crawl, which lowers the load for a while. Subsequent crawls
> seem to finish properly. For what it's worth, files/folders that show up in
> the 'volume info' output during a hung crawl don't seem to be anything out
> of the ordinary.
> >>>>>>>>
> >>>>>>>> Over the past four days, the typical time before the problem
> recurs after suppressing it in this manner is an hour. Last night when we
> reached out to you was the last time it happened and the load has been low
> since (a relief).  David believes that recursively listing the files (ls
> -alR or similar) from a client mount can force the issue to happen, but
> obviously I'd rather not unless we have some precise thing we're looking
> for. Let me know if you'd like me to attempt to drive the system unstable
> like that and what I should look for. As it's a production system, I'd
> rather not leave it in this state for long.
> >>>>>>>
> >>>>>>>
> >>>>>>> Will it be possible to send glustershd, mount logs of the past 4
> days? I would like to see if this is because of directory self-heal going
> wild (Ravi is working on throttling feature for 3.8, which will allow to
> put breaks on self-heal traffic)
> >>>>>>>
> >>>>>>> Pranith
> >>>>>>>
> >>>>>>>>
> >>>>>>>> [root at gfs01a xattrop]# gluster volume heal homegfs info
> >>>>>>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 01/21/2016 08:25 PM, Glomski, Patrick wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hello, Pranith. The typical behavior is that the %cpu on a
> glusterfsd process jumps to number of processor cores available (800% or
> 1200%, depending on the pair of nodes involved) and the load average on the
> machine goes very high (~20). The volume's heal statistics output shows
> that it is crawling one of the bricks and trying to heal, but this crawl
> hangs and never seems to finish.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The number of files in the xattrop directory varies over time,
> so I ran a wc -l as you requested periodically for some time and then
> started including a datestamped list of the files that were in the xattrops
> directory on each brick to see which were persistent. All bricks had files
> in the xattrop folder, so all results are attached.
> >>>>>>>>>
> >>>>>>>>> Thanks this info is helpful. I don't see a lot of files. Could
> you give output of "gluster volume heal <volname> info"? Is there any
> directory in there which is LARGE?
> >>>>>>>>>
> >>>>>>>>> Pranith
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Please let me know if there is anything else I can provide.
> >>>>>>>>>>
> >>>>>>>>>> Patrick
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> hey,
> >>>>>>>>>>>        Which process is consuming so much cpu? I went through
> the logs you gave me. I see that the following files are in gfid mismatch
> state:
> >>>>>>>>>>>
> >>>>>>>>>>> <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
> >>>>>>>>>>> <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
> >>>>>>>>>>> <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,
> >>>>>>>>>>>
> >>>>>>>>>>> Could you give me the output of "ls
> <brick-path>/indices/xattrop | wc -l" output on all the bricks which are
> acting this way? This will tell us the number of pending self-heals on the
> system.
> >>>>>>>>>>>
> >>>>>>>>>>> Pranith
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 01/20/2016 09:26 PM, David Robinson wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> resending with parsed logs...
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am having issues with 3.6.6 where the load will spike up
> to 800% for one of the glusterfsd processes and the users can no longer
> access the system.  If I reboot the node, the heal will finish normally
> after a few minutes and the system will be responsive, but a few hours
> later the issue will start again.  It look like it is hanging in a heal and
> spinning up the load on one of the bricks.  The heal gets stuck and says it
> is crawling and never returns.  After a few minutes of the heal saying it
> is crawling, the load spikes up and the mounts become unresponsive.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Any suggestions on how to fix this?  It has us stopped cold
> as the user can no longer access the systems when the load spikes... Logs
> attached.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> System setup info is:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [root at gfs01a ~]# gluster volume info homegfs
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Volume Name: homegfs
> >>>>>>>>>>>>>> Type: Distributed-Replicate
> >>>>>>>>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> >>>>>>>>>>>>>> Status: Started
> >>>>>>>>>>>>>> Number of Bricks: 4 x 2 = 8
> >>>>>>>>>>>>>> Transport-type: tcp
> >>>>>>>>>>>>>> Bricks:
> >>>>>>>>>>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> >>>>>>>>>>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> >>>>>>>>>>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> >>>>>>>>>>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> >>>>>>>>>>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> >>>>>>>>>>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> >>>>>>>>>>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> >>>>>>>>>>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> >>>>>>>>>>>>>> Options Reconfigured:
> >>>>>>>>>>>>>> performance.io-thread-count: 32
> >>>>>>>>>>>>>> performance.cache-size: 128MB
> >>>>>>>>>>>>>> performance.write-behind-window-size: 128MB
> >>>>>>>>>>>>>> server.allow-insecure: on
> >>>>>>>>>>>>>> network.ping-timeout: 42
> >>>>>>>>>>>>>> storage.owner-gid: 100
> >>>>>>>>>>>>>> geo-replication.indexing: off
> >>>>>>>>>>>>>> geo-replication.ignore-pid-check: on
> >>>>>>>>>>>>>> changelog.changelog: off
> >>>>>>>>>>>>>> changelog.fsync-interval: 3
> >>>>>>>>>>>>>> changelog.rollover-time: 15
> >>>>>>>>>>>>>> server.manage-gids: on
> >>>>>>>>>>>>>> diagnostics.client-log-level: WARNING
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [root at gfs01a ~]# rpm -qa | grep gluster
> >>>>>>>>>>>>>> gluster-nagios-common-0.1.1-0.el6.noarch
> >>>>>>>>>>>>>> glusterfs-fuse-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-libs-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-api-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-devel-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-api-devel-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-cli-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-rdma-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-server-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Gluster-devel mailing list
> >>>>>>>>>>>> Gluster-devel at gluster.org
> >>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Gluster-users mailing list
> >>>>>>>>>>> Gluster-users at gluster.org
> >>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160121/2aeec14e/attachment-0001.html>


More information about the Gluster-devel mailing list