From jahernan at redhat.com Wed Jan 2 07:03:03 2019 From: jahernan at redhat.com (Xavi Hernandez) Date: Wed, 2 Jan 2019 08:03:03 +0100 Subject: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench In-Reply-To: References: Message-ID: On Mon, Dec 24, 2018 at 11:30 AM Sankarshan Mukhopadhyay < sankarshan.mukhopadhyay at gmail.com> wrote: > [pulling the conclusions up to enable better in-line] > > > Conclusions: > > > > We should never have a volume with caching-related xlators disabled. The > price we pay for it is too high. We need to make them work consistently and > aggressively to avoid as many requests as we can. > > Are there current issues in terms of behavior which are known/observed > when these are enabled? > > > We need to analyze client/server xlators deeper to see if we can avoid > some delays. However optimizing something that is already at the > microsecond level can be very hard. > > That is true - are there any significant gains which can be accrued by > putting efforts here or, should this be a lower priority? > I would say that for volumes based on spinning disks this is not a high priority, but if we want to provide good performance for NVME storage, this is something that needs to be done. On NVME, reads and writes can be served in few tens of microseconds, so adding 100 us in the network layer could easily mean a performance reduction of 70% or more. > > We need to determine what causes the fluctuations in brick side and > avoid them. > > This scenario is very similar to a smallfile/metadata workload, so this > is probably one important cause of its bad performance. > > What kind of instrumentation is required to enable the determination? > > On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez > wrote: > > > > Hi, > > > > I've done some tracing of the latency that network layer introduces in > gluster. I've made the analysis as part of the pgbench performance issue > (in particulat the initialization and scaling phase), so I decided to look > at READV for this particular workload, but I think the results can be > extrapolated to other operations that also have small latency (cached data > from FS for example). > > > > Note that measuring latencies introduces some latency. It consists in a > call to clock_get_time() for each probe point, so the real latency will be > a bit lower, but still proportional to these numbers. > > > > [snip] > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed Jan 2 08:00:20 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 2 Jan 2019 13:30:20 +0530 Subject: [Gluster-devel] [Gluster-users] On making ctime generator enabled by default in stack In-Reply-To: References: Message-ID: On Mon, Nov 12, 2018 at 10:48 AM Amar Tumballi wrote: > > > On Mon, Nov 12, 2018 at 10:39 AM Vijay Bellur wrote: > >> >> >> On Sun, Nov 11, 2018 at 8:25 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Sun, Nov 11, 2018 at 11:41 PM Vijay Bellur >>> wrote: >>> >>>> >>>> >>>> On Mon, Nov 5, 2018 at 8:31 PM Raghavendra Gowdappa < >>>> rgowdapp at redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On Tue, Nov 6, 2018 at 9:58 AM Vijay Bellur >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Mon, Nov 5, 2018 at 7:56 PM Raghavendra Gowdappa < >>>>>> rgowdapp at redhat.com> wrote: >>>>>> >>>>>>> All, >>>>>>> >>>>>>> There is a patch [1] from Kotresh, which makes ctime generator as >>>>>>> default in stack. Currently ctime generator is being recommended only for >>>>>>> usecases where ctime is important (like for Elasticsearch). However, a >>>>>>> reliable (c)(m)time can fix many consistency issues within glusterfs stack >>>>>>> too. These are issues with caching layers having stale (meta)data >>>>>>> [2][3][4]. Basically just like applications, components within glusterfs >>>>>>> stack too need a time to find out which among racing ops (like write, stat, >>>>>>> etc) has latest (meta)data. >>>>>>> >>>>>>> Also note that a consistent (c)(m)time is not an optional feature, >>>>>>> but instead forms the core of the infrastructure. So, I am proposing to >>>>>>> merge this patch. If you've any objections, please voice out before Nov 13, >>>>>>> 2018 (a week from today). >>>>>>> >>>>>>> As to the existing known issues/limitations with ctime generator, my >>>>>>> conversations with Kotresh, revealed following: >>>>>>> * Potential performance degradation (we don't yet have data to >>>>>>> conclusively prove it, preliminary basic tests from Kotresh didn't indicate >>>>>>> a significant perf drop). >>>>>>> >>>>>> >>>>>> Do we have this data captured somewhere? If not, would it be possible >>>>>> to share that data here? >>>>>> >>>>> >>>>> I misquoted Kotresh. He had measured impact of gfid2path and said both >>>>> features might've similar impact as major perf cost is related to storing >>>>> xattrs on backend fs. I am in the process of getting a fresh set of >>>>> numbers. Will post those numbers when available. >>>>> >>>>> >>>> >>>> I observe that the patch under discussion has been merged now [1]. A >>>> quick search did not yield me any performance data. Do we have the >>>> performance numbers posted somewhere? >>>> >>> >>> No. Perf benchmarking is a task pending on me. >>> >> >> When can we expect this task to be complete? >> >> In any case, I don't think it is ideal for us to merge a patch without >> completing our due diligence on it. How do we want to handle this scenario >> since the patch is already merged? >> >> We could: >> >> 1. Revert the patch now >> 2. Review the performance data and revert the patch if performance >> characterization indicates a significant dip. It would be preferable to >> complete this activity before we branch off for the next release. >> > > I am for option 2. Considering the branch out for next release is another > 2 months, and no one is expected to use the 'release' off a master branch > yet, it makes sense to give that buffer time to get this activity completed. > Its unlikely I'll have time for carrying out perf benchmark. Hence I've posted a revert here: https://review.gluster.org/#/c/glusterfs/+/21975/ > Regards, > Amar > > 3. Think of some other option? >> >> Thanks, >> Vijay >> >> >>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nigelb at redhat.com Thu Jan 3 14:41:57 2019 From: nigelb at redhat.com (Nigel Babu) Date: Thu, 3 Jan 2019 20:11:57 +0530 Subject: [Gluster-devel] Tests for the GCS stack using the k8s framework Message-ID: Hello, Deepshikha and I have been working on understanding and using the k8s framework for testing the GCS stack. With the help of the folks from sig-storage, we've managed to write a sample test that needs to be run against an already setup k8s gluster with GCS installed on top[1]. This is a temporary location for the tests and we'll move these into gluster-csi-driver repo[2] once some of the dependency issues[3] are sorted out. The upstream storage tests are being split out into a test suite[4] that can be consumed out of tree by folks like us who are implementing a CSI driver interface. When that happens, we should be able to continuously validate against the standards set for the storage interface. [1]: https://github.com/nigelbabu/gcs-test/ [2]: https://github.com/gluster/gluster-csi-driver/ [3]: https://github.com/gluster/gluster-csi-driver/issues/131 [4]: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage/testsuites -- nigelb -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon Jan 7 01:45:03 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 7 Jan 2019 01:45:03 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <1605456600.113.1546825503928.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1660404 / core: Conditional freeing of string after returning from dict_set_dynstr function https://bugzilla.redhat.com/1657645 / core: [Glusterfs-server-5.1] Gluster storage domain creation fails on MountError https://bugzilla.redhat.com/1658108 / disperse: [disperse] Dump respective itables in EC to statedumps. https://bugzilla.redhat.com/1658472 / disperse: Mountpoint not accessible for few seconds when bricks are brought down to max redundancy after reset brick https://bugzilla.redhat.com/1663337 / doc: Gluster documentation on quorum-reads option is incorrect https://bugzilla.redhat.com/1659334 / fuse: FUSE mount seems to be hung and not accessible https://bugzilla.redhat.com/1663205 / fuse: List dictionary is too slow https://bugzilla.redhat.com/1659824 / fuse: Unable to mount gluster fs on glusterfs client: Transport endpoint is not connected https://bugzilla.redhat.com/1657743 / fuse: Very high memory usage (25GB) on Gluster FUSE mountpoint https://bugzilla.redhat.com/1663583 / geo-replication: Geo-replication fails to open logfile "/var/log/glusterfs/cli.log" on slave. https://bugzilla.redhat.com/1662178 / glusterd: Compilation fails for xlators/mgmt/glusterd/src with error "undefined reference to `dlclose'" https://bugzilla.redhat.com/1663247 / glusterd: remove static memory allocations from code https://bugzilla.redhat.com/1663519 / gluster-smb: Memory leak when smb.conf has "store dos attributes = yes" https://bugzilla.redhat.com/1657607 / posix: Convert nr_files to gf_atomic in posix_private structure https://bugzilla.redhat.com/1659371 / posix: posix_janitor_thread_proc has bug that can't go into the janitor_walker if change the system time forward and change back https://bugzilla.redhat.com/1659374 / posix: posix_janitor_thread_proc has bug that can't go into the janitor_walker if change the system time forward and change back https://bugzilla.redhat.com/1659378 / posix: posix_janitor_thread_proc has bug that can't go into the janitor_walker if change the system time forward and change back https://bugzilla.redhat.com/1657860 / project-infrastructure: Archives for ci-results mailinglist are getting wiped (with each mail?) https://bugzilla.redhat.com/1659934 / project-infrastructure: Cannot unsubscribe the review.gluster.org https://bugzilla.redhat.com/1659394 / project-infrastructure: Maintainer permissions on gluster-mixins project for Ankush https://bugzilla.redhat.com/1661895 / replicate: [disperse] Dump respective itables in EC to statedumps. https://bugzilla.redhat.com/1662557 / replicate: glusterfs process crashes, causing "Transport endpoint not connected". https://bugzilla.redhat.com/1658742 / rpc: Inconsistent type for 'remote-port' parameter [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 3119 bytes Desc: not available URL: From atumball at redhat.com Mon Jan 7 03:34:47 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Mon, 7 Jan 2019 09:04:47 +0530 Subject: [Gluster-devel] Gluster Maintainer's meeting: 7th Jan, 2019 - Agenda Message-ID: Meeting date: 2019-01-07 18:30 IST, 13:00 UTC, 08:00 EDTBJ Link - Bridge: https://bluejeans.com/217609845 Attendance Agenda - Welcome 2019: Discuss about goals : - https://hackmd.io/OiQId65pStuBa_BPPazcmA - Progress with GCS - Scale testing showing GD2 can scale to 1000s of PVs (each is a gluster volume, in RWX mode) - new CSI for gluster-block showing good scale numbers, which is reaching higher than current 1k RWO PV per cluster, but need to iron out few things. (https://github.com/gluster/gluster-csi-driver/pull/105) - Performance focus: - Any update? What are the patch in progress? - How to measure the perf of a patch, is there any hardware? - Static Analyzers: - glusterfs: - coverity - 63 open - clang-scan - 32 open (with many false-positives). - gluster-block: - coverity: 1 open (66 last week) - GlusterFS-6: - Any priority review needed? - What are the critical areas need focus? - How to make glusto automated tests become blocker for the release? - Upgrade tests, need to start early. - Schedule as called out in the mail NOTE: Working backwards on the schedule, here?s what we have: - Announcement: Week of Mar 4th, 2019 - GA tagging: Mar-01-2019 - RC1: On demand before GA - RC0: Feb-04-2019 - Late features cut-off: Week of Jan-21st, 2018 - Branching (feature cutoff date): Jan-14-2018 (~45 days prior to branching) - Feature/scope proposal for the release (end date): Dec-12-2018 - Round Table? ================= Feel free to add your topic into : https://hackmd.io/yTC-un5XT6KUB9V37LG6OQ?edit -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rkavunga at redhat.com Tue Jan 8 06:53:04 2019 From: rkavunga at redhat.com (RAFI KC) Date: Tue, 8 Jan 2019 12:23:04 +0530 Subject: [Gluster-devel] Implementing multiplexing for self heal client. In-Reply-To: References: <8050438a-10a3-4329-ff58-6eae863c62cd@redhat.com> Message-ID: <63459166-fd5b-5ff0-f1f8-9a966c02f27a@redhat.com> I have completed the patches and pushed for reviews. Please feel free to raise your review concerns/suggestions. https://review.gluster.org/#/c/glusterfs/+/21868 https://review.gluster.org/#/c/glusterfs/+/21907 https://review.gluster.org/#/c/glusterfs/+/21960 https://review.gluster.org/#/c/glusterfs/+/21989/ Regards Rafi KC On 12/24/18 3:58 PM, RAFI KC wrote: > > On 12/21/18 6:56 PM, Sankarshan Mukhopadhyay wrote: >> On Fri, Dec 21, 2018 at 6:30 PM RAFI KC wrote: >>> Hi All, >>> >>> What is the problem? >>> As of now self-heal client is running as one daemon per node, this >>> means >>> even if there are multiple volumes, there will only be one self-heal >>> daemon. So to take effect of each configuration changes in the cluster, >>> the self-heal has to be reconfigured. But it doesn't have ability to >>> dynamically reconfigure. Which means when you have lot of volumes in >>> the >>> cluster, every management operation that involves configurations >>> changes >>> like volume start/stop, add/remove brick etc will result in self-heal >>> daemon restart. If such operation is executed more often, it is not >>> only >>> slow down self-heal for a volume, but also increases the slef-heal logs >>> substantially. >> What is the value of the number of volumes when you write "lot of >> volumes"? 1000 volumes, more etc > > Yes, more than 1000 volumes. It also depends on how often you execute > glusterd management operations (mentioned above). Each time self heal > daemon is restarted, it prints the entire graph. This graph traces in > the log will contribute the majority it's size. > > >> >>> >>> How to fix it? >>> >>> We are planning to follow a similar procedure as attach/detach graphs >>> dynamically which is similar to brick multiplex. The detailed steps is >>> as below, >>> >>> >>> >>> >>> 1) First step is to make shd per volume daemon, to generate/reconfigure >>> volfiles per volume basis . >>> >>> ??? 1.1) This will help to attach the volfiles easily to existing >>> shd daemon >>> >>> ??? 1.2) This will help to send notification to shd daemon as each >>> volinfo keeps the daemon object >>> >>> ??? 1.3) reconfiguring a particular subvolume is easier as we can check >>> the topology better >>> >>> ??? 1.4) With this change the volfiles will be moved to workdir/vols/ >>> directory. >>> >>> 2) Writing new rpc requests like attach/detach_client_graph function to >>> support clients attach/detach >>> >>> ??? 2.1) Also functions like graph reconfigure, mgmt_getspec_cbk has to >>> be modified >>> >>> 3) Safely detaching a subvolume when there are pending frames to >>> unwind. >>> >>> ??? 3.1) We can mark the client disconnected and make all the frames to >>> unwind with ENOTCONN >>> >>> ??? 3.2) We can wait all the i/o to unwind until the new updated subvol >>> attaches >>> >>> 4) Handle scenarios like glusterd restart, node reboot, etc >>> >>> >>> >>> At the moment we are not planning to limit the number of heal subvolmes >>> per process as, because with the current approach also for every volume >>> heal was doing from a single process. We have not heared any major >>> complains on this? >> Is the plan to not ever limit or, have a throttle set to a default >> high(er) value? How would system resources be impacted if the proposed >> design is implemented? > > The plan is to implement in a way that it can support more than one > multiplexed self-heal daemon. The throttling function as of now > returns the same process to multiplex, but it can be easily modified > to create a new process. > > This multiplexing logic won't utilize any additional resources that it > currently does. > > > Rafi KC > > >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Tue Jan 8 13:33:13 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Tue, 8 Jan 2019 19:03:13 +0530 Subject: [Gluster-devel] https://review.gluster.org/#/c/glusterfs/+/19778/ In-Reply-To: References: Message-ID: Shyam, what is your take on this? An upstream user has tried it out and reported that it seems to fix the issue , however cpu utilization doubles. Regards, Nithya On Fri, 28 Dec 2018 at 09:17, Amar Tumballi wrote: > I feel its good to backport considering glusterfs-6.0 is another 2 months > away. > > On Fri, Dec 28, 2018 at 8:19 AM Nithya Balachandran > wrote: > >> Hi, >> >> Can we backport this to release-5 ? We have several reports of high >> memory usage in fuse clients from users and this is likely to help. >> >> Regards, >> Nithya >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srangana at redhat.com Tue Jan 8 14:33:58 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Tue, 8 Jan 2019 09:33:58 -0500 Subject: [Gluster-devel] https://review.gluster.org/#/c/glusterfs/+/19778/ In-Reply-To: References: Message-ID: On 1/8/19 8:33 AM, Nithya Balachandran wrote: > Shyam, what is your take on this? > An upstream user has tried it out and reported that it seems to fix the > issue , however cpu utilization doubles. We usually do not backport big fixes unless they are critical. My first answer would be, can't this wait for rel-6 which is up next? The change has gone through a good review overall, so from a review thoroughness perspective it looks good. The change has a test case to ensure that the limits are honored, so again a plus. Also, it is a switch, so in the worst case moving back to unlimited should be possible with little adverse effects in case the fix has issues. It hence, comes down to how confident are we that the change is not disruptive to an existing branch? If we can answer this with resonable confidence we can backport it and release it with the next 5.x update release. > > Regards, > Nithya > > On Fri, 28 Dec 2018 at 09:17, Amar Tumballi > wrote: > > I feel its good to backport considering glusterfs-6.0 is another 2 > months away. > > On Fri, Dec 28, 2018 at 8:19 AM Nithya Balachandran > > wrote: > > Hi, > > Can we backport this to release-5 ? We have several reports of > high memory usage in fuse clients from users and this is likely > to help. > > Regards, > Nithya > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > -- > Amar Tumballi (amarts) > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > From atumball at redhat.com Wed Jan 9 02:57:03 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 9 Jan 2019 08:27:03 +0530 Subject: [Gluster-devel] https://review.gluster.org/#/c/glusterfs/+/19778/ In-Reply-To: References: Message-ID: On Tue, Jan 8, 2019 at 8:04 PM Shyam Ranganathan wrote: > On 1/8/19 8:33 AM, Nithya Balachandran wrote: > > Shyam, what is your take on this? > > An upstream user has tried it out and reported that it seems to fix the > > issue , however cpu utilization doubles. > > We usually do not backport big fixes unless they are critical. My first > answer would be, can't this wait for rel-6 which is up next? > > Considering it may take some more time to get adoption, doing a backport may surely benefit users, IMO. > The change has gone through a good review overall, so from a review > thoroughness perspective it looks good. > > The change has a test case to ensure that the limits are honored, so > again a plus. > > Also, it is a switch, so in the worst case moving back to unlimited > should be possible with little adverse effects in case the fix has issues. > > It hence, comes down to how confident are we that the change is not > disruptive to an existing branch? If we can answer this with resonable > confidence we can backport it and release it with the next 5.x update > release. > > Considering the code which the patch changes has changed very little over last few years, I feel it is totally safe to do the backport. Don't see any possible surprises. Will send a patch today on release-5 branch. -Amar > > > > Regards, > > Nithya > > > > On Fri, 28 Dec 2018 at 09:17, Amar Tumballi > > wrote: > > > > I feel its good to backport considering glusterfs-6.0 is another 2 > > months away. > > > > On Fri, Dec 28, 2018 at 8:19 AM Nithya Balachandran > > > wrote: > > > > Hi, > > > > Can we backport this to release-5 ? We have several reports of > > high memory usage in fuse clients from users and this is likely > > to help. > > > > Regards, > > Nithya > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > > > > -- > > Amar Tumballi (amarts) > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Wed Jan 9 03:05:19 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 9 Jan 2019 08:35:19 +0530 Subject: [Gluster-devel] Gluster Maintainer's meeting: 7th Jan, 2019 - Meeting minutes In-Reply-To: References: Message-ID: Meeting date: 2019-01-07 18:30 IST, 13:00 UTC, 08:00 EDT BJ Link - Bridge: https://bluejeans.com/217609845 - Watch: https://bluejeans.com/s/sGFpa Attendance Agenda - Welcome 2019: New goals / Discuss: - https://hackmd.io/OiQId65pStuBa_BPPazcmA - Give it a week and take it to mailing list, discuss and agree upon - [Nigel] Some of the above points are threads of its own. May need separate thread. - Progress with GCS - Email about GCS in community. - RWX: - Scale testing showing GD2 can scale to 1000s of PVs (each is a gluster volume) - Bricks with LVM - Some delete issues seen, specially with LV command scale. Patch sent. - Create rate: 500 PVs / 12mins - More details by end of the week, including delete numbers. - RWO: - new CSI for gluster-block showing good scale numbers, which is reaching higher than current 1k RWO PV per cluster, but need to iron out few things. (https://github.com/gluster/gluster-csi-driver/pull/105 ) - 280 pods in 3 hosts, 1-1 Pod->PV ratio: leaner graph. - 1080 PVs with 1-12 ratio on 3 machines - Working on 3000+ PVC on just 3 hosts, will update by another 2 days. - Poornima is coming up with steps and details about the PR/version used etc. - Static Analyzers: - glusterfs: - Coverity - 63 open - https://scan.coverity.com/projects/gluster-glusterfs?tab=overview - clang-scan - 32 open - https://build.gluster.org/job/clang-scan/lastCompletedBuild/clangScanBuildBugs/ - gluster-block: - https://scan.coverity.com/projects/gluster-gluster-block?tab=overview - coverity: 1 open (66 last week) - GlusterFS-6: - Any priority review needed? - Fencing patches - Reducing threads (GH Issue: 475) - glfs-api statx patches [merged] - What are the critical areas need focus? - Asan Build ? Currently not green - Some java errors, machine offline. Need to look into this. - How to make glusto automated tests become blocker for the release? - Upgrade tests, need to start early. - Schedule as called out in the mail NOTE: Working backwards on the schedule, here?s what we have: - Announcement: Week of Mar 4th, 2019 - GA tagging: Mar-01-2019 - RC1: On demand before GA - RC0: Feb-04-2019 - Late features cut-off: Week of Jan-21st, 2018 - Branching (feature cutoff date): Jan-14-2018 (~45 days prior to branching) - Feature/scope proposal for the release (end date): Dec-12-2018 - Round Table? - [Sunny] Meetup in BLR this weekend. Please do come (at least those who are in BLR) - [Susant] Softserve has 4hrs timeout, which can?t get full regression cycle. Can we get at least 2 more hours added, so full regression can be run. ------- On Mon, Jan 7, 2019 at 9:04 AM Amar Tumballi Suryanarayan < atumball at redhat.com> wrote: > > Meeting date: 2019-01-07 18:30 IST, 13:00 UTC, 08:00 EDTBJ Link > > - Bridge: https://bluejeans.com/217609845 > > Attendance > Agenda > > - > > Welcome 2019: Discuss about goals : > - https://hackmd.io/OiQId65pStuBa_BPPazcmA > - > > Progress with GCS > - Scale testing showing GD2 can scale to 1000s of PVs (each is a > gluster volume, in RWX mode) > - new CSI for gluster-block showing good scale numbers, which is > reaching higher than current 1k RWO PV per cluster, but need to iron out > few things. (https://github.com/gluster/gluster-csi-driver/pull/105) > - > > Performance focus: > - Any update? What are the patch in progress? > - How to measure the perf of a patch, is there any hardware? > - > > Static Analyzers: > - glusterfs: > - coverity - 63 open > - clang-scan - 32 open (with many false-positives). > - gluster-block: > - coverity: 1 open (66 last week) > - > > GlusterFS-6: > - Any priority review needed? > - What are the critical areas need focus? > - How to make glusto automated tests become blocker for the release? > - Upgrade tests, need to start early. > - Schedule as called out in the mail > > NOTE: Working backwards on the schedule, here?s what we have: > - Announcement: Week of Mar 4th, 2019 > - GA tagging: Mar-01-2019 > - RC1: On demand before GA > - RC0: Feb-04-2019 > - Late features cut-off: Week of Jan-21st, 2018 > - Branching (feature cutoff date): Jan-14-2018 (~45 days prior > to branching) > - Feature/scope proposal for the release (end date): Dec-12-2018 > - > > Round Table? > > ================= > > Feel free to add your topic into : > https://hackmd.io/yTC-un5XT6KUB9V37LG6OQ?edit > > > -- > Amar Tumballi (amarts) > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Wed Jan 9 06:23:12 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Wed, 9 Jan 2019 11:53:12 +0530 Subject: [Gluster-devel] https://review.gluster.org/#/c/glusterfs/+/19778/ In-Reply-To: References: Message-ID: On Wed, 9 Jan 2019 at 08:28, Amar Tumballi Suryanarayan wrote: > > > On Tue, Jan 8, 2019 at 8:04 PM Shyam Ranganathan > wrote: > >> On 1/8/19 8:33 AM, Nithya Balachandran wrote: >> > Shyam, what is your take on this? >> > An upstream user has tried it out and reported that it seems to fix the >> > issue , however cpu utilization doubles. >> >> We usually do not backport big fixes unless they are critical. My first >> answer would be, can't this wait for rel-6 which is up next? >> >> Considering it may take some more time to get adoption, doing a backport > may surely benefit users, IMO. > > I agree. This is a pain point for several users and I would like to have folks be able to try this out earlier and provide feedback. The change has gone through a good review overall, so from a review >> thoroughness perspective it looks good. >> >> The change has a test case to ensure that the limits are honored, so >> again a plus. >> >> Also, it is a switch, so in the worst case moving back to unlimited >> should be possible with little adverse effects in case the fix has issues. >> >> It hence, comes down to how confident are we that the change is not >> disruptive to an existing branch? If we can answer this with resonable >> confidence we can backport it and release it with the next 5.x update >> release. >> >> > Considering the code which the patch changes has changed very little over > last few years, I feel it is > totally safe to do the backport. Don't see any possible surprises. Will > send a patch today on release-5 branch. > > -Amar > > > >> > >> > Regards, >> > Nithya >> > >> > On Fri, 28 Dec 2018 at 09:17, Amar Tumballi > > > wrote: >> > >> > I feel its good to backport considering glusterfs-6.0 is another 2 >> > months away. >> > >> > On Fri, Dec 28, 2018 at 8:19 AM Nithya Balachandran >> > > wrote: >> > >> > Hi, >> > >> > Can we backport this to release-5 ? We have several reports of >> > high memory usage in fuse clients from users and this is likely >> > to help. >> > >> > Regards, >> > Nithya >> > _______________________________________________ >> > Gluster-devel mailing list >> > Gluster-devel at gluster.org >> > https://lists.gluster.org/mailman/listinfo/gluster-devel >> > >> > >> > >> > -- >> > Amar Tumballi (amarts) >> > >> > >> > _______________________________________________ >> > Gluster-devel mailing list >> > Gluster-devel at gluster.org >> > https://lists.gluster.org/mailman/listinfo/gluster-devel >> > >> > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jstrunk at redhat.com Wed Jan 9 16:21:04 2019 From: jstrunk at redhat.com (John Strunk) Date: Wed, 9 Jan 2019 11:21:04 -0500 Subject: [Gluster-devel] Weekly GCS architecture call Message-ID: We have a weekly 1 hour call to discuss architecture topics related to GCS. This call has been ongoing for several months as an internal meeting. With the new year, we are expanding the invitation for the community to join and hear/contribute to the discussions. Meeting info: - Time/Date: Thursdays at 15:00 UTC (hint: `date -d "15:00 UTC"`) - Location: Bluejeans - https://bluejeans.com/600091070 - Minutes/Agenda/Info: https://hackmd.io/sj9ik9SCTYm81YcQDOOrtw This week's main topic will be a roundtable discussion to highlight the set of remaining tasks for a GCS 1.0 release. We are targeting the 1.0 release for the end of January / early February. See you tomorrow. -John -------------- next part -------------- An HTML attachment was scrubbed... URL: From srangana at redhat.com Wed Jan 9 18:54:13 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Wed, 9 Jan 2019 13:54:13 -0500 Subject: [Gluster-devel] Regression health for release-5.next and release-6 Message-ID: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> Hi, As part of branching preparation next week for release-6, please find test failures and respective test links here [1]. The top tests that are failing/dumping-core are as below and need attention, - ec/bug-1236065.t - glusterd/add-brick-and-validate-replicated-volume-options.t - readdir-ahead/bug-1390050.t - glusterd/brick-mux-validation.t - bug-1432542-mpx-restart-crash.t Others of interest, - replicate/bug-1341650.t Please file a bug if needed against the test case and report the same here, in case a problem is already addressed, then do send back the patch details that addresses this issue as a response to this mail. Thanks, Shyam [1] Regression failures: https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view From manu at netbsd.org Thu Jan 10 02:40:34 2019 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 10 Jan 2019 03:40:34 +0100 Subject: [Gluster-devel] FUSE directory filehandle Message-ID: <1o1643u.1ah4ac9mi3e0M%manu@netbsd.org> Hello This is not strictly a GlusterFS question since I came to it porting LTFS to NetBSD, however I would like to make sure I will not break GlusterFS by fixing NetBSD FUSE implementation for LTFS. Current NetBSD FUSE implementation sends the filehandle in any FUSE requests for an open node, regardless of its type (directory or file). I discovered that libfuse low level code manages filehandle differently for opendir/readdir/syncdir/releasedir than for other operations. As a result, when a getattr is done on a directory, setting the filehandle obtained from opendir can cause a crash in libfuse. The fix for NetBSD FUSE implementation is to avoid setting the filehandle for the following FUSE operations on directories: getattr, setattr, poll, getlk, setlk, setlkw, read, write (only the first two ones are likely to be actually used, though) Does anyone forsee a possible problem for GlusterFS with such a behavior? In other words, will it be fine to always have a FUSE_UNKNOWN_FH (aka null) filehandle for getattr/setattr on directories? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From amukherj at redhat.com Thu Jan 10 10:25:27 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 10 Jan 2019 15:55:27 +0530 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> Message-ID: Mohit, Sanju - request you to investigate the failures related to glusterd and brick-mux and report back to the list. On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan wrote: > Hi, > > As part of branching preparation next week for release-6, please find > test failures and respective test links here [1]. > > The top tests that are failing/dumping-core are as below and need > attention, > - ec/bug-1236065.t > - glusterd/add-brick-and-validate-replicated-volume-options.t > - readdir-ahead/bug-1390050.t > - glusterd/brick-mux-validation.t > - bug-1432542-mpx-restart-crash.t > > Others of interest, > - replicate/bug-1341650.t > > Please file a bug if needed against the test case and report the same > here, in case a problem is already addressed, then do send back the > patch details that addresses this issue as a response to this mail. > > Thanks, > Shyam > > [1] Regression failures: https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From moagrawa at redhat.com Thu Jan 10 11:20:27 2019 From: moagrawa at redhat.com (Mohit Agrawal) Date: Thu, 10 Jan 2019 16:50:27 +0530 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> Message-ID: I think we should consider regression-builds after merged the patch ( https://review.gluster.org/#/c/glusterfs/+/21990/) as we know this patch introduced some delay. Thanks, Mohit Agrawal On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee wrote: > Mohit, Sanju - request you to investigate the failures related to glusterd > and brick-mux and report back to the list. > > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan > wrote: > >> Hi, >> >> As part of branching preparation next week for release-6, please find >> test failures and respective test links here [1]. >> >> The top tests that are failing/dumping-core are as below and need >> attention, >> - ec/bug-1236065.t >> - glusterd/add-brick-and-validate-replicated-volume-options.t >> - readdir-ahead/bug-1390050.t >> - glusterd/brick-mux-validation.t >> - bug-1432542-mpx-restart-crash.t >> >> Others of interest, >> - replicate/bug-1341650.t >> >> Please file a bug if needed against the test case and report the same >> here, in case a problem is already addressed, then do send back the >> patch details that addresses this issue as a response to this mail. >> >> Thanks, >> Shyam >> >> [1] Regression failures: https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Thu Jan 10 11:56:37 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 10 Jan 2019 17:26:37 +0530 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> Message-ID: That is a good point Mohit, but do we know how many of these tests failed because of 'timeout' ? If most of these are due to timeout, then yes, it may be a valid point. -Amar On Thu, Jan 10, 2019 at 4:51 PM Mohit Agrawal wrote: > I think we should consider regression-builds after merged the patch ( > https://review.gluster.org/#/c/glusterfs/+/21990/) > as we know this patch introduced some delay. > > Thanks, > Mohit Agrawal > > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee > wrote: > >> Mohit, Sanju - request you to investigate the failures related to >> glusterd and brick-mux and report back to the list. >> >> On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan >> wrote: >> >>> Hi, >>> >>> As part of branching preparation next week for release-6, please find >>> test failures and respective test links here [1]. >>> >>> The top tests that are failing/dumping-core are as below and need >>> attention, >>> - ec/bug-1236065.t >>> - glusterd/add-brick-and-validate-replicated-volume-options.t >>> - readdir-ahead/bug-1390050.t >>> - glusterd/brick-mux-validation.t >>> - bug-1432542-mpx-restart-crash.t >>> >>> Others of interest, >>> - replicate/bug-1341650.t >>> >>> Please file a bug if needed against the test case and report the same >>> here, in case a problem is already addressed, then do send back the >>> patch details that addresses this issue as a response to this mail. >>> >>> Thanks, >>> Shyam >>> >>> [1] Regression failures: https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> >>> _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Fri Jan 11 03:29:06 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 11 Jan 2019 08:59:06 +0530 Subject: [Gluster-devel] GCS 0.5 release Message-ID: Today, we are announcing the availability of GCS (Gluster Container Storage) 0.5. Highlights and updates since v0.4: - GCS environment updated to kube 1.13 - CSI deployment moved to 1.0 - Integrated Anthill deployment - Kube & etcd metrics added to prometheus - Tuning of etcd to increase stability - GD2 bug fixes from scale testing effort. Included components: - Glusterd2: https://github.com/gluster/glusterd2 - Gluster CSI driver: https://github.com/gluster/gluster-csi-driver - Gluster-prometheus: https://github.com/gluster/gluster-prometheus - Anthill - https://github.com/gluster/anthill/ - Gluster-Mixins - https://github.com/gluster/gluster-mixins/ For more details on the specific content of this release please refer [3]. If you are interested in contributing, please see [4] or contact the gluster-devel mailing list. We?re always interested in any bugs that you find, pull requests for new features and your feedback. Regards, Team GCS [1] https://github.com/gluster/gcs/releases [2] https://github.com/gluster/gcs/tree/master/deploy [3] https://waffle.io/gluster/gcs?label=GCS%2F0.5 - search for ?Done? lane [4] https://github.com/gluster/gcs -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Fri Jan 11 08:37:36 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Fri, 11 Jan 2019 14:07:36 +0530 Subject: [Gluster-devel] FUSE directory filehandle In-Reply-To: <1o1643u.1ah4ac9mi3e0M%manu@netbsd.org> References: <1o1643u.1ah4ac9mi3e0M%manu@netbsd.org> Message-ID: On Thu, Jan 10, 2019 at 8:17 AM Emmanuel Dreyfus wrote: > Hello > > This is not strictly a GlusterFS question since I came to it porting > LTFS to NetBSD, however I would like to make sure I will not break > GlusterFS by fixing NetBSD FUSE implementation for LTFS. > > Current NetBSD FUSE implementation sends the filehandle in any FUSE > requests for an open node, regardless of its type (directory or file). > > I discovered that libfuse low level code manages filehandle differently > for opendir/readdir/syncdir/releasedir than for other operations. As a > result, when a getattr is done on a directory, setting the filehandle > obtained from opendir can cause a crash in libfuse. > > The fix for NetBSD FUSE implementation is to avoid setting the > filehandle for the following FUSE operations on directories: getattr, > setattr, poll, getlk, setlk, setlkw, read, write (only the first two > ones are likely to be actually used, though) > > Does anyone forsee a possible problem for GlusterFS with such a > behavior? In other words, will it be fine to always have a > FUSE_UNKNOWN_FH (aka null) filehandle for getattr/setattr on > directories? > > Below is the code snippet from fuse_getattr(). #if FUSE_KERNEL_MINOR_VERSION >= 9 priv = this->private; if (priv->proto_minor >= 9 && fgi->getattr_flags & FUSE_GETATTR_FH) state->fd = fd_ref((fd_t *)(uintptr_t)fgi->fh); #endif Which means, it may crash if we get fd as NULL, when FUSE_GETATTR_FH is set. > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Fri Jan 11 14:39:22 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Fri, 11 Jan 2019 20:09:22 +0530 Subject: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench In-Reply-To: References: Message-ID: Here is the update of the progress till now: * The client profile attached till now shows the tuple creation is dominated by writes and fstats. Note that fstats are side-effects of writes as writes invalidate attributes of the file from kernel attribute cache. * The rest of the init phase (which is marked by msgs "setting primary key" and "vaccuum") is dominated by reads. Next bigger set of operations are writes followed by fstats. So, only writes, reads and fstats are the operations we need to optimize to reduce the init time latency. As mentioned in my previous mail, I did following tunings: * Enabled only write-behind, md-cache and open-behind. - write-behind was configured with a cache-size/window-size of 20MB - open-behind was configured with read-after-open yes - md-cache was loaded as a child of write-behind in xlator graph. As a parent of write-behind, writes responses of writes cached in write-behind would invalidate stats. But when loaded as a child of write-behind this problem won't be there. Note that in both cases fstat would pass through write-behind (In the former case due to no stats in md-cache). However in the latter case fstats can be served by md-cache. - md-cache used to aggressively invalidate inodes. For the purpose of this test, I just commented out inode-invalidate code in md-cache. We need to fine tune the invalidation invocation logic. - set group-metadata-cache to on. But turned off upcall notifications. Note that since this workload basically accesses all its data through single mount point. So, there is no shared files across mounts and hence its safe to turn off invalidations. * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781 With the above set of tunings I could reduce the init time of scale 8000 from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30% Since the workload is dominated by reads, we think a good read-cache where reads to regions just written are served from cache would greatly improve the performance. Since kernel page-cache already provides that functionality along with read-ahead (which is more intelligent and serves more read patterns than supported by Glusterfs read-ahead), we wanted to try that. But, Manoj found a bug where reads followed by writes are not served from page cache [5]. I am currently waiting for the resolution of this bug. As an alternative, I can modify io-cache to serve reads from the data just written. But, the change involves its challenges and hence would like to get a resolution on [5] (either positive or negative) before proceeding with modifications to io-cache. As to the rpc latency, Krutika had long back identified that reading a single rpc message involves atleast 4 reads to socket. These many number of reads were done to identify the structure of the message on the go. The reason we wanted to discover the rpc message was to identify the part of the rpc message containing read or write payload and make sure that payload is directly read into a buffer different than the one containing rest of the rpc message. This strategy will make sure payloads are not copied again when buffers are moved across caches (read-ahead, io-cache etc) and also the rest of the rpc message can be freed even though the payload outlives the rpc message (when payloads are cached). However, we can experiment an approach where we can either do away with zero-copy requirement or let the entire buffer containing rpc message and payload to live in the cache. >From my observations and discussions with Manoj and Xavi, this workload is very sensitive to latency (than to concurrency). So, I am hopeful the above approaches will give positive results. [5] https://bugzilla.redhat.com/show_bug.cgi?id=1664934 regards, Raghavendra On Fri, Dec 28, 2018 at 12:44 PM Raghavendra Gowdappa wrote: > > > On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa > wrote: > >> >> >> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay < >> sankarshan.mukhopadhyay at gmail.com> wrote: >> >>> [pulling the conclusions up to enable better in-line] >>> >>> > Conclusions: >>> > >>> > We should never have a volume with caching-related xlators disabled. >>> The price we pay for it is too high. We need to make them work consistently >>> and aggressively to avoid as many requests as we can. >>> >>> Are there current issues in terms of behavior which are known/observed >>> when these are enabled? >>> >> >> We did have issues with pgbench in past. But they've have been fixed. >> Please refer to bz [1] for details. On 5.1, it runs successfully with all >> caching related xlators enabled. Having said that the only performance >> xlators which gave improved performance were open-behind and write-behind >> [2] (write-behind had some issues, which will be fixed by [3] and we'll >> have to measure performance again with fix to [3]). >> > > One quick update. Enabling write-behind and md-cache with fix for [3] > reduced the total time taken for pgbench init phase roughly by 20%-25% > (from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge > time (around 12hrs for a db of scale 8000). I'll follow up with a detailed > report once my experiments are complete. Currently trying to optimize the > read path. > > >> For some reason, read-side caching didn't improve transactions per >> second. I am working on this problem currently. Note that these bugs >> measure transaction phase of pgbench, but what xavi measured in his mail is >> init phase. Nevertheless, evaluation of read caching (metadata/data) will >> still be relevant for init phase too. >> >> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691 >> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4 >> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781 >> >> >>> > We need to analyze client/server xlators deeper to see if we can avoid >>> some delays. However optimizing something that is already at the >>> microsecond level can be very hard. >>> >>> That is true - are there any significant gains which can be accrued by >>> putting efforts here or, should this be a lower priority? >>> >> >> The problem identified by xavi is also the one we (Manoj, Krutika, me and >> Milind) had encountered in the past [4]. The solution we used was to have >> multiple rpc connections between single brick and client. The solution >> indeed fixed the bottleneck. So, there is definitely work involved here - >> either to fix the single connection model or go with multiple connection >> model. Its preferred to improve single connection and resort to multiple >> connections only if bottlenecks in single connection are not fixable. >> Personally I think this is high priority along with having appropriate >> client side caching. >> >> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52 >> >> >>> > We need to determine what causes the fluctuations in brick side and >>> avoid them. >>> > This scenario is very similar to a smallfile/metadata workload, so >>> this is probably one important cause of its bad performance. >>> >>> What kind of instrumentation is required to enable the determination? >>> >>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez >>> wrote: >>> > >>> > Hi, >>> > >>> > I've done some tracing of the latency that network layer introduces in >>> gluster. I've made the analysis as part of the pgbench performance issue >>> (in particulat the initialization and scaling phase), so I decided to look >>> at READV for this particular workload, but I think the results can be >>> extrapolated to other operations that also have small latency (cached data >>> from FS for example). >>> > >>> > Note that measuring latencies introduces some latency. It consists in >>> a call to clock_get_time() for each probe point, so the real latency will >>> be a bit lower, but still proportional to these numbers. >>> > >>> >>> [snip] >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pgbench-init-client-profile.tgz Type: application/x-compressed-tar Size: 8962 bytes Desc: not available URL: From srangana at redhat.com Fri Jan 11 15:50:09 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Fri, 11 Jan 2019 10:50:09 -0500 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> Message-ID: <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> We can check health on master post the patch as stated by Mohit below. Release-5 is causing some concerns as we need to tag the release yesterday, but we have the following 2 tests failing or coredumping pretty regularly, need attention on these. ec/bug-1236065.t glusterd/add-brick-and-validate-replicated-volume-options.t Shyam On 1/10/19 6:20 AM, Mohit Agrawal wrote: > I think we should consider regression-builds after merged the patch > (https://review.gluster.org/#/c/glusterfs/+/21990/)? > as we know this patch introduced some delay. > > Thanks, > Mohit Agrawal > > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee > wrote: > > Mohit, Sanju - request you to investigate the failures related to > glusterd and brick-mux and report back to the list. > > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan > > wrote: > > Hi, > > As part of branching preparation next week for release-6, please > find > test failures and respective test links here [1]. > > The top tests that are failing/dumping-core are as below and > need attention, > - ec/bug-1236065.t > - glusterd/add-brick-and-validate-replicated-volume-options.t > - readdir-ahead/bug-1390050.t > - glusterd/brick-mux-validation.t > - bug-1432542-mpx-restart-crash.t > > Others of interest, > - replicate/bug-1341650.t > > Please file a bug if needed against the test case and report the > same > here, in case a problem is already addressed, then do send back the > patch details that addresses this issue as a response to this mail. > > Thanks, > Shyam > > [1] Regression failures: > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > From moagrawa at redhat.com Sat Jan 12 12:59:56 2019 From: moagrawa at redhat.com (Mohit Agrawal) Date: Sat, 12 Jan 2019 18:29:56 +0530 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> Message-ID: For specific to "add-brick-and-validate-replicated-volume-options.t" i have posted a patch https://review.gluster.org/22015. For test case "ec/bug-1236065.t" I think the issue needs to be check by ec team On the brick side, it is showing below logs >>>>>>>>>>>>>>>>> on wire in the future [Invalid argument] The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.ec.dirty' would not be sent on wire in the future [Invalid argument]" repeated 3 times between [2019-01-12 12:25:25.902828] and [2019-01-12 12:25:25.902992] [2019-01-12 12:25:25.903553] W [MSGID: 114031] [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-patchy-client-1: remote operation failed [Bad file descriptor] [2019-01-12 12:25:25.903998] W [MSGID: 122040] [ec-common.c:1181:ec_prepare_update_cbk] 0-patchy-disperse-0: Failed to get size and version : FOP : 'FXATTROP' failed on gfid d91f6331-d394-479d-ab51-6bcf674ac3e0 [Input/output error] [2019-01-12 12:25:25.904059] W [fuse-bridge.c:1907:fuse_unlink_cbk] 0-glusterfs-fuse: 3259: UNLINK() /test/0.o => -1 (Input/output error) >>>>>>>>>>>>>>>>>>> Test case is getting timed out because "volume heal $V0 full" command is stuck, look's like shd is getting stuck at getxattr >>>>>>>>>>>>>>. Thread 8 (Thread 0x7f83777fe700 (LWP 25552)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f83777fdbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030880, child=, loc=0x7f83777fdbb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80094b0, entry=, parent=0x7f83777fdde0, data=0x7f83a8030880) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80094b0, loc=loc at entry=0x7f83777fdde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030880, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030880, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030880) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 7 (Thread 0x7f8376ffd700 (LWP 25553)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f8376ffcbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80308f0, child=, loc=0x7f8376ffcbb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a800d110, entry=, parent=0x7f8376ffcde0, data=0x7f83a80308f0) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a800d110, loc=loc at entry=0x7f8376ffcde0, pid=pid at entry=-6, data=data at entry=0x7f83a80308f0, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80308f0, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80308f0) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 6 (Thread 0x7f83767fc700 (LWP 25554)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f83767fbbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030960, child=, loc=0x7f83767fbbb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8010af0, entry=, parent=0x7f83767fbde0, data=0x7f83a8030960) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8010af0, loc=loc at entry=0x7f83767fbde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030960, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030960, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030960) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 5 (Thread 0x7f8375ffb700 (LWP 25555)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f8375ffabb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80309d0, child=, loc=0x7f8375ffabb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80144d0, entry=, parent=0x7f8375ffade0, data=0x7f83a80309d0) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80144d0, loc=loc at entry=0x7f8375ffade0, pid=pid at entry=-6, data=data at entry=0x7f83a80309d0, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80309d0, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80309d0) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x7f83757fa700 (LWP 25556)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f83757f9bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030a40, child=, loc=0x7f83757f9bb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8017eb0, entry=, parent=0x7f83757f9de0, data=0x7f83a8030a40) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8017eb0, loc=loc at entry=0x7f83757f9de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030a40, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030a40, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030a40) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x7f8374ff9700 (LWP 25557)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f8374ff8bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030ab0, child=, loc=0x7f8374ff8bb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801b890, entry=, parent=0x7f8374ff8de0, data=0x7f83a8030ab0) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801b890, loc=loc at entry=0x7f8374ff8de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030ab0, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030ab0, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030ab0) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x7f8367fff700 (LWP 25558)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f8367ffebb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030b20, child=, loc=0x7f8367ffebb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801f270, entry=, parent=0x7f8367ffede0, data=0x7f83a8030b20) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801f270, loc=loc at entry=0x7f8367ffede0, pid=pid at entry=-6, data=data at entry=0x7f83a8030b20, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030b20, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030b20) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x7f83bcdd1780 (LWP 25383)): #0 0x00007f83bb70af57 in pthread_join () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc92eff8 in event_dispatch_epoll (event_pool=0x55af0a6dd560) at event-epoll.c:846 #2 0x000055af0a4116b8 in main (argc=15, argv=0x7fff75610898) at glusterfsd.c:2848 >>>>>>>>>>>>>>>>>>>>>>>>>>. Thanks, Mohit Agrawal On Fri 11 Jan, 2019, 21:20 Shyam Ranganathan We can check health on master post the patch as stated by Mohit below. > > Release-5 is causing some concerns as we need to tag the release > yesterday, but we have the following 2 tests failing or coredumping > pretty regularly, need attention on these. > > ec/bug-1236065.t > glusterd/add-brick-and-validate-replicated-volume-options.t > > Shyam > On 1/10/19 6:20 AM, Mohit Agrawal wrote: > > I think we should consider regression-builds after merged the patch > > (https://review.gluster.org/#/c/glusterfs/+/21990/) > > as we know this patch introduced some delay. > > > > Thanks, > > Mohit Agrawal > > > > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee > > wrote: > > > > Mohit, Sanju - request you to investigate the failures related to > > glusterd and brick-mux and report back to the list. > > > > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan > > > wrote: > > > > Hi, > > > > As part of branching preparation next week for release-6, please > > find > > test failures and respective test links here [1]. > > > > The top tests that are failing/dumping-core are as below and > > need attention, > > - ec/bug-1236065.t > > - glusterd/add-brick-and-validate-replicated-volume-options.t > > - readdir-ahead/bug-1390050.t > > - glusterd/brick-mux-validation.t > > - bug-1432542-mpx-restart-crash.t > > > > Others of interest, > > - replicate/bug-1341650.t > > > > Please file a bug if needed against the test case and report the > > same > > here, in case a problem is already addressed, then do send back > the > > patch details that addresses this issue as a response to this > mail. > > > > Thanks, > > Shyam > > > > [1] Regression failures: > > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From moagrawa at redhat.com Sat Jan 12 13:16:20 2019 From: moagrawa at redhat.com (Mohit Agrawal) Date: Sat, 12 Jan 2019 18:46:20 +0530 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> Message-ID: Previous logs related to client not bricks, below are the brick logs [2019-01-12 12:25:25.893485]:++++++++++ G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 68 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.ec.size' would not be sent on wire in the future [Invalid argument]" repeated 199 times between [2019-01-12 12:25:25.283989] and [2019-01-12 12:25:25.899532] [2019-01-12 12:25:25.903375] E [MSGID: 113001] [posix-inode-fd-ops.c:4617:_posix_handle_xattr_keyvalue_pair] 8-patchy-posix: fgetxattr failed on gfid=d91f6331-d394-479d-ab51-6bcf674ac3e0 while doing xattrop: Key:trusted.ec.dirty (Bad file descriptor) [Bad file descriptor] [2019-01-12 12:25:25.903468] E [MSGID: 115073] [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-patchy-server: 1486: FXATTROP 2 (d91f6331-d394-479d-ab51-6bcf674ac3e0), client: CTX_ID:b785c2b0-3453-4a03-b129-19e6ceeb5346-GRAPH_ID:0-PID:24147-HOST:softserve-moagrawa-test.1-PC_NAME:patchy-client-1-RECON_NO:-1, error-xlator: patchy-posix [Bad file descriptor] Thanks, Mohit Agrawal On Sat, Jan 12, 2019 at 6:29 PM Mohit Agrawal wrote: > > For specific to "add-brick-and-validate-replicated-volume-options.t" i > have posted a patch https://review.gluster.org/22015. > For test case "ec/bug-1236065.t" I think the issue needs to be check by ec > team > > On the brick side, it is showing below logs > > >>>>>>>>>>>>>>>>> > > on wire in the future [Invalid argument] > The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key > 'trusted.ec.dirty' would not be sent on wire in the future [Invalid > argument]" repeated 3 times between [2019-01-12 12:25:25.902828] and > [2019-01-12 12:25:25.902992] > [2019-01-12 12:25:25.903553] W [MSGID: 114031] > [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-patchy-client-1: > remote operation failed [Bad file descriptor] > [2019-01-12 12:25:25.903998] W [MSGID: 122040] > [ec-common.c:1181:ec_prepare_update_cbk] 0-patchy-disperse-0: Failed to get > size and version : FOP : 'FXATTROP' failed on gfid > d91f6331-d394-479d-ab51-6bcf674ac3e0 [Input/output error] > [2019-01-12 12:25:25.904059] W [fuse-bridge.c:1907:fuse_unlink_cbk] > 0-glusterfs-fuse: 3259: UNLINK() /test/0.o => -1 (Input/output error) > > >>>>>>>>>>>>>>>>>>> > > Test case is getting timed out because "volume heal $V0 full" command is > stuck, look's like shd is getting stuck at getxattr > > >>>>>>>>>>>>>>. > > Thread 8 (Thread 0x7f83777fe700 (LWP 25552)): > #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, > loc=loc at entry=0x7f83777fdbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 > "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) > at syncop.c:1680 > #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030880, > child=, loc=0x7f83777fdbb0, full=) at > ec-heald.c:161 > #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80094b0, > entry=, parent=0x7f83777fdde0, data=0x7f83a8030880) at > ec-heald.c:294 > #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80094b0, loc=loc at entry=0x7f83777fdde0, > pid=pid at entry=-6, data=data at entry=0x7f83a8030880, fn=fn at entry=0x7f83add03140 > ) at syncop-utils.c:125 > #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030880, > inode=) at ec-heald.c:311 > #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030880) at > ec-heald.c:372 > #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 > #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 > Thread 7 (Thread 0x7f8376ffd700 (LWP 25553)): > #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, > loc=loc at entry=0x7f8376ffcbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 > "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) > at syncop.c:1680 > #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80308f0, > child=, loc=0x7f8376ffcbb0, full=) at > ec-heald.c:161 > #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a800d110, > entry=, parent=0x7f8376ffcde0, data=0x7f83a80308f0) at > ec-heald.c:294 > #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a800d110, loc=loc at entry=0x7f8376ffcde0, > pid=pid at entry=-6, data=data at entry=0x7f83a80308f0, fn=fn at entry=0x7f83add03140 > ) at syncop-utils.c:125 > #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80308f0, > inode=) at ec-heald.c:311 > #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80308f0) at > ec-heald.c:372 > #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 > #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 > Thread 6 (Thread 0x7f83767fc700 (LWP 25554)): > #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, > loc=loc at entry=0x7f83767fbbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 > "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) > at syncop.c:1680 > #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030960, > child=, loc=0x7f83767fbbb0, full=) at > ec-heald.c:161 > #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8010af0, > entry=, parent=0x7f83767fbde0, data=0x7f83a8030960) at > ec-heald.c:294 > #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8010af0, loc=loc at entry=0x7f83767fbde0, > pid=pid at entry=-6, data=data at entry=0x7f83a8030960, fn=fn at entry=0x7f83add03140 > ) at syncop-utils.c:125 > #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030960, > inode=) at ec-heald.c:311 > #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030960) at > ec-heald.c:372 > #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 > #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 > Thread 5 (Thread 0x7f8375ffb700 (LWP 25555)): > #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, > loc=loc at entry=0x7f8375ffabb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 > "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) > at syncop.c:1680 > #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80309d0, > child=, loc=0x7f8375ffabb0, full=) at > ec-heald.c:161 > #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80144d0, > entry=, parent=0x7f8375ffade0, data=0x7f83a80309d0) at > ec-heald.c:294 > #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80144d0, loc=loc at entry=0x7f8375ffade0, > pid=pid at entry=-6, data=data at entry=0x7f83a80309d0, fn=fn at entry=0x7f83add03140 > ) at syncop-utils.c:125 > #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80309d0, > inode=) at ec-heald.c:311 > #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80309d0) at > ec-heald.c:372 > #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 > #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 > Thread 4 (Thread 0x7f83757fa700 (LWP 25556)): > #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, > loc=loc at entry=0x7f83757f9bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 > "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) > at syncop.c:1680 > #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030a40, > child=, loc=0x7f83757f9bb0, full=) at > ec-heald.c:161 > #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8017eb0, > entry=, parent=0x7f83757f9de0, data=0x7f83a8030a40) at > ec-heald.c:294 > #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8017eb0, loc=loc at entry=0x7f83757f9de0, > pid=pid at entry=-6, data=data at entry=0x7f83a8030a40, fn=fn at entry=0x7f83add03140 > ) at syncop-utils.c:125 > #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030a40, > inode=) at ec-heald.c:311 > #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030a40) at > ec-heald.c:372 > #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 > #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 > Thread 3 (Thread 0x7f8374ff9700 (LWP 25557)): > #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, > loc=loc at entry=0x7f8374ff8bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 > "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) > at syncop.c:1680 > #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030ab0, > child=, loc=0x7f8374ff8bb0, full=) at > ec-heald.c:161 > #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801b890, > entry=, parent=0x7f8374ff8de0, data=0x7f83a8030ab0) at > ec-heald.c:294 > #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801b890, loc=loc at entry=0x7f8374ff8de0, > pid=pid at entry=-6, data=data at entry=0x7f83a8030ab0, fn=fn at entry=0x7f83add03140 > ) at syncop-utils.c:125 > #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030ab0, > inode=) at ec-heald.c:311 > #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030ab0) at > ec-heald.c:372 > #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 > #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 > Thread 2 (Thread 0x7f8367fff700 (LWP 25558)): > #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, > loc=loc at entry=0x7f8367ffebb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 > "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) > at syncop.c:1680 > #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030b20, > child=, loc=0x7f8367ffebb0, full=) at > ec-heald.c:161 > #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801f270, > entry=, parent=0x7f8367ffede0, data=0x7f83a8030b20) at > ec-heald.c:294 > #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801f270, loc=loc at entry=0x7f8367ffede0, > pid=pid at entry=-6, data=data at entry=0x7f83a8030b20, fn=fn at entry=0x7f83add03140 > ) at syncop-utils.c:125 > #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030b20, > inode=) at ec-heald.c:311 > #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030b20) at > ec-heald.c:372 > #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 > #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 > Thread 1 (Thread 0x7f83bcdd1780 (LWP 25383)): > #0 0x00007f83bb70af57 in pthread_join () from /usr/lib64/libpthread.so.0 > #1 0x00007f83bc92eff8 in event_dispatch_epoll (event_pool=0x55af0a6dd560) > at event-epoll.c:846 > #2 0x000055af0a4116b8 in main (argc=15, argv=0x7fff75610898) at > glusterfsd.c:2848 > > > >>>>>>>>>>>>>>>>>>>>>>>>>>. > > Thanks, > Mohit Agrawal > > On Fri 11 Jan, 2019, 21:20 Shyam Ranganathan >> We can check health on master post the patch as stated by Mohit below. >> >> Release-5 is causing some concerns as we need to tag the release >> yesterday, but we have the following 2 tests failing or coredumping >> pretty regularly, need attention on these. >> >> ec/bug-1236065.t >> glusterd/add-brick-and-validate-replicated-volume-options.t >> >> Shyam >> On 1/10/19 6:20 AM, Mohit Agrawal wrote: >> > I think we should consider regression-builds after merged the patch >> > (https://review.gluster.org/#/c/glusterfs/+/21990/) >> > as we know this patch introduced some delay. >> > >> > Thanks, >> > Mohit Agrawal >> > >> > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee > > > wrote: >> > >> > Mohit, Sanju - request you to investigate the failures related to >> > glusterd and brick-mux and report back to the list. >> > >> > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan >> > > wrote: >> > >> > Hi, >> > >> > As part of branching preparation next week for release-6, please >> > find >> > test failures and respective test links here [1]. >> > >> > The top tests that are failing/dumping-core are as below and >> > need attention, >> > - ec/bug-1236065.t >> > - glusterd/add-brick-and-validate-replicated-volume-options.t >> > - readdir-ahead/bug-1390050.t >> > - glusterd/brick-mux-validation.t >> > - bug-1432542-mpx-restart-crash.t >> > >> > Others of interest, >> > - replicate/bug-1341650.t >> > >> > Please file a bug if needed against the test case and report the >> > same >> > here, in case a problem is already addressed, then do send back >> the >> > patch details that addresses this issue as a response to this >> mail. >> > >> > Thanks, >> > Shyam >> > >> > [1] Regression failures: >> > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view >> > _______________________________________________ >> > Gluster-devel mailing list >> > Gluster-devel at gluster.org >> > https://lists.gluster.org/mailman/listinfo/gluster-devel >> > >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon Jan 14 01:45:02 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 14 Jan 2019 01:45:02 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <172507852.9.1547430303459.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1660404 / core: Conditional freeing of string after returning from dict_set_dynstr function https://bugzilla.redhat.com/1665145 / core: Writes on Gluster 5 volumes fail with EIO when "cluster.consistent-metadata" is set https://bugzilla.redhat.com/1663337 / doc: Gluster documentation on quorum-reads option is incorrect https://bugzilla.redhat.com/1663205 / fuse: List dictionary is too slow https://bugzilla.redhat.com/1664524 / geo-replication: Non-root geo-replication session goes to faulty state, when the session is started https://bugzilla.redhat.com/1662178 / glusterd: Compilation fails for xlators/mgmt/glusterd/src with error "undefined reference to `dlclose'" https://bugzilla.redhat.com/1663247 / glusterd: remove static memory allocations from code https://bugzilla.redhat.com/1663519 / gluster-smb: Memory leak when smb.conf has "store dos attributes = yes" https://bugzilla.redhat.com/1665361 / project-infrastructure: Alerts for offline nodes https://bugzilla.redhat.com/1659934 / project-infrastructure: Cannot unsubscribe the review.gluster.org https://bugzilla.redhat.com/1663780 / project-infrastructure: On docs.gluster.org, we should convert spaces in folder or file names to 301 redirects to hypens https://bugzilla.redhat.com/1665677 / rdma: volume create and transport change with rdma failed https://bugzilla.redhat.com/1664215 / read-ahead: Toggling readdir-ahead translator off causes some clients to umount some of its volumes https://bugzilla.redhat.com/1661895 / replicate: [disperse] Dump respective itables in EC to statedumps. https://bugzilla.redhat.com/1662557 / replicate: glusterfs process crashes, causing "Transport endpoint not connected". https://bugzilla.redhat.com/1664398 / tests: ./tests/00-geo-rep/00-georep-verify-setup.t does not work with ./run-tests-in-vagrant.sh [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 2220 bytes Desc: not available URL: From aspandey at redhat.com Mon Jan 14 10:06:22 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Mon, 14 Jan 2019 05:06:22 -0500 (EST) Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> Message-ID: <2134165472.57578088.1547460382588.JavaMail.zimbra@redhat.com> I downloaded logs of regression runs 1077 and 1073 and tried to investigate it. In both regression ec/bug-1236065.t is hanging on TEST 70 which is trying to get the online brick count I can see that in mount/bricks and glusterd logs it has not move forward after this test. glusterd.log - [2019-01-06 16:27:51.346408]:++++++++++ G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 70 5 online_brick_count ++++++++++ [2019-01-06 16:27:51.645014] I [MSGID: 106499] [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: Received status volume req for volume patchy [2019-01-06 16:27:51.646664] I [dict.c:2745:dict_get_str_boolean] (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x4a6c3) [0x7f4c37fe06c3] -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x43b3a) [0x7f4c37fd9b3a] -->/build/install/lib/libglusterfs.so.0(dict_get_str_boolean+0x170) [0x7f4c433d83fb] ) 0-dict: key nfs.disable, integer type asked, has string type [Invalid argument] [2019-01-06 16:27:51.647177] I [dict.c:2361:dict_get_strn] (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) [0x7f4c38095a32] -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) [0x7f4c37fdd4ac] -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) [0x7f4c433d7673] ) 0-dict: key brick0.rdma_port, string type asked, has integer type [Invalid argument] [2019-01-06 16:27:51.647227] I [dict.c:2361:dict_get_strn] (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) [0x7f4c38095a32] -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) [0x7f4c37fdd4ac] -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) [0x7f4c433d7673] ) 0-dict: key brick1.rdma_port, string type asked, has integer type [Invalid argument] [2019-01-06 16:27:51.647292] I [dict.c:2361:dict_get_strn] (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) [0x7f4c38095a32] -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) [0x7f4c37fdd4ac] -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) [0x7f4c433d7673] ) 0-dict: key brick2.rdma_port, string type asked, has integer type [Invalid argument] [2019-01-06 16:27:51.647333] I [dict.c:2361:dict_get_strn] (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) [0x7f4c38095a32] -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) [0x7f4c37fdd4ac] -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) [0x7f4c433d7673] ) 0-dict: key brick3.rdma_port, string type asked, has integer type [Invalid argument] [2019-01-06 16:27:51.647371] I [dict.c:2361:dict_get_strn] (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) [0x7f4c38095a32] -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) [0x7f4c37fdd4ac] -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) [0x7f4c433d7673] ) 0-dict: key brick4.rdma_port, string type asked, has integer type [Invalid argument] [2019-01-06 16:27:51.647409] I [dict.c:2361:dict_get_strn] (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) [0x7f4c38095a32] -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) [0x7f4c37fdd4ac] -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) [0x7f4c433d7673] ) 0-dict: key brick5.rdma_port, string type asked, has integer type [Invalid argument] [2019-01-06 16:27:51.647447] I [dict.c:2361:dict_get_strn] (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) [0x7f4c38095a32] -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) [0x7f4c37fdd4ac] -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) [0x7f4c433d7673] ) 0-dict: key brick6.rdma_port, string type asked, has integer type [Invalid argument] [2019-01-06 16:27:51.649335] E [MSGID: 101191] [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-01-06 16:27:51.932871] I [MSGID: 106499] [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: Received status volume req for volume patchy It is just taking lot of time to get the status at this point. It looks like there could be some issue with connection or the handing of volume status when some bricks are down. --- Ashish ----- Original Message ----- From: "Mohit Agrawal" To: "Shyam Ranganathan" Cc: "Gluster Devel" Sent: Saturday, January 12, 2019 6:46:20 PM Subject: Re: [Gluster-devel] Regression health for release-5.next and release-6 Previous logs related to client not bricks, below are the brick logs [2019-01-12 12:25:25.893485]:++++++++++ G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 68 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.ec.size' would not be sent on wire in the future [Invalid argument]" repeated 199 times between [2019-01-12 12:25:25.283989] and [2019-01-12 12:25:25.899532] [2019-01-12 12:25:25.903375] E [MSGID: 113001] [posix-inode-fd-ops.c:4617:_posix_handle_xattr_keyvalue_pair] 8-patchy-posix: fgetxattr failed on gfid=d91f6331-d394-479d-ab51-6bcf674ac3e0 while doing xattrop: Key:trusted.ec.dirty (Bad file descriptor) [Bad file descriptor] [2019-01-12 12:25:25.903468] E [MSGID: 115073] [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-patchy-server: 1486: FXATTROP 2 (d91f6331-d394-479d-ab51-6bcf674ac3e0), client: CTX_ID:b785c2b0-3453-4a03-b129-19e6ceeb5346-GRAPH_ID:0-PID:24147-HOST:softserve-moagrawa-test.1-PC_NAME:patchy-client-1-RECON_NO:-1, error-xlator: patchy-posix [Bad file descriptor] Thanks, Mohit Agrawal On Sat, Jan 12, 2019 at 6:29 PM Mohit Agrawal < moagrawa at redhat.com > wrote: For specific to "add-brick-and-validate-replicated-volume-options.t" i have posted a patch https://review.gluster.org/22015 . For test case "ec/bug-1236065.t" I think the issue needs to be check by ec team On the brick side, it is showing below logs >>>>>>>>>>>>>>>>> on wire in the future [Invalid argument] The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.ec.dirty' would not be sent on wire in the future [Invalid argument]" repeated 3 times between [2019-01-12 12:25:25.902828] and [2019-01-12 12:25:25.902992] [2019-01-12 12:25:25.903553] W [MSGID: 114031] [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-patchy-client-1: remote operation failed [Bad file descriptor] [2019-01-12 12:25:25.903998] W [MSGID: 122040] [ec-common.c:1181:ec_prepare_update_cbk] 0-patchy-disperse-0: Failed to get size and version : FOP : 'FXATTROP' failed on gfid d91f6331-d394-479d-ab51-6bcf674ac3e0 [Input/output error] [2019-01-12 12:25:25.904059] W [fuse-bridge.c:1907:fuse_unlink_cbk] 0-glusterfs-fuse: 3259: UNLINK() /test/0.o => -1 (Input/output error) >>>>>>>>>>>>>>>>>>> Test case is getting timed out because "volume heal $V0 full" command is stuck, look's like shd is getting stuck at getxattr >>>>>>>>>>>>>>. Thread 8 (Thread 0x7f83777fe700 (LWP 25552)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f83777fdbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030880, child=, loc=0x7f83777fdbb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80094b0, entry=, parent=0x7f83777fdde0, data=0x7f83a8030880) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80094b0, loc=loc at entry=0x7f83777fdde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030880, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030880, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030880) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 7 (Thread 0x7f8376ffd700 (LWP 25553)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f8376ffcbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80308f0, child=, loc=0x7f8376ffcbb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a800d110, entry=, parent=0x7f8376ffcde0, data=0x7f83a80308f0) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a800d110, loc=loc at entry=0x7f8376ffcde0, pid=pid at entry=-6, data=data at entry=0x7f83a80308f0, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80308f0, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80308f0) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 6 (Thread 0x7f83767fc700 (LWP 25554)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f83767fbbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030960, child=, loc=0x7f83767fbbb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8010af0, entry=, parent=0x7f83767fbde0, data=0x7f83a8030960) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8010af0, loc=loc at entry=0x7f83767fbde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030960, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030960, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030960) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 5 (Thread 0x7f8375ffb700 (LWP 25555)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f8375ffabb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80309d0, child=, loc=0x7f8375ffabb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80144d0, entry=, parent=0x7f8375ffade0, data=0x7f83a80309d0) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80144d0, loc=loc at entry=0x7f8375ffade0, pid=pid at entry=-6, data=data at entry=0x7f83a80309d0, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80309d0, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80309d0) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x7f83757fa700 (LWP 25556)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f83757f9bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030a40, child=, loc=0x7f83757f9bb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8017eb0, entry=, parent=0x7f83757f9de0, data=0x7f83a8030a40) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8017eb0, loc=loc at entry=0x7f83757f9de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030a40, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030a40, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030a40) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x7f8374ff9700 (LWP 25557)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f8374ff8bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030ab0, child=, loc=0x7f8374ff8bb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801b890, entry=, parent=0x7f8374ff8de0, data=0x7f83a8030ab0) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801b890, loc=loc at entry=0x7f8374ff8de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030ab0, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030ab0, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030ab0) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x7f8367fff700 (LWP 25558)): #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, loc=loc at entry=0x7f8367ffebb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) at syncop.c:1680 #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030b20, child=, loc=0x7f8367ffebb0, full=) at ec-heald.c:161 #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801f270, entry=, parent=0x7f8367ffede0, data=0x7f83a8030b20) at ec-heald.c:294 #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801f270, loc=loc at entry=0x7f8367ffede0, pid=pid at entry=-6, data=data at entry=0x7f83a8030b20, fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030b20, inode=) at ec-heald.c:311 #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030b20) at ec-heald.c:372 #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x7f83bcdd1780 (LWP 25383)): #0 0x00007f83bb70af57 in pthread_join () from /usr/lib64/libpthread.so.0 #1 0x00007f83bc92eff8 in event_dispatch_epoll (event_pool=0x55af0a6dd560) at event-epoll.c:846 #2 0x000055af0a4116b8 in main (argc=15, argv=0x7fff75610898) at glusterfsd.c:2848 >>>>>>>>>>>>>>>>>>>>>>>>>>. Thanks, Mohit Agrawal On Fri 11 Jan, 2019, 21:20 Shyam Ranganathan < srangana at redhat.com wrote:
We can check health on master post the patch as stated by Mohit below. Release-5 is causing some concerns as we need to tag the release yesterday, but we have the following 2 tests failing or coredumping pretty regularly, need attention on these. ec/bug-1236065.t glusterd/add-brick-and-validate-replicated-volume-options.t Shyam On 1/10/19 6:20 AM, Mohit Agrawal wrote: > I think we should consider regression-builds after merged the patch > ( https://review.gluster.org/#/c/glusterfs/+/21990/ ) > as we know this patch introduced some delay. > > Thanks, > Mohit Agrawal > > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee < amukherj at redhat.com > > wrote: > > Mohit, Sanju - request you to investigate the failures related to > glusterd and brick-mux and report back to the list. > > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan > < srangana at redhat.com > wrote: > > Hi, > > As part of branching preparation next week for release-6, please > find > test failures and respective test links here [1]. > > The top tests that are failing/dumping-core are as below and > need attention, > - ec/bug-1236065.t > - glusterd/add-brick-and-validate-replicated-volume-options.t > - readdir-ahead/bug-1390050.t > - glusterd/brick-mux-validation.t > - bug-1432542-mpx-restart-crash.t > > Others of interest, > - replicate/bug-1341650.t > > Please file a bug if needed against the test case and report the > same > here, in case a problem is already addressed, then do send back the > patch details that addresses this issue as a response to this mail. > > Thanks, > Shyam > > [1] Regression failures: > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > >
_______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From jahernan at redhat.com Tue Jan 15 08:35:20 2019 From: jahernan at redhat.com (Xavi Hernandez) Date: Tue, 15 Jan 2019 09:35:20 +0100 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: <2134165472.57578088.1547460382588.JavaMail.zimbra@redhat.com> References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> <2134165472.57578088.1547460382588.JavaMail.zimbra@redhat.com> Message-ID: On Mon, Jan 14, 2019 at 11:08 AM Ashish Pandey wrote: > > I downloaded logs of regression runs 1077 and 1073 and tried to > investigate it. > In both regression ec/bug-1236065.t is hanging on TEST 70 which is trying > to get the online brick count > > I can see that in mount/bricks and glusterd logs it has not move forward > after this test. > glusterd.log - > > [2019-01-06 16:27:51.346408]:++++++++++ > G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 70 5 online_brick_count > ++++++++++ > [2019-01-06 16:27:51.645014] I [MSGID: 106499] > [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume patchy > [2019-01-06 16:27:51.646664] I [dict.c:2745:dict_get_str_boolean] > (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x4a6c3) > [0x7f4c37fe06c3] > -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x43b3a) > [0x7f4c37fd9b3a] > -->/build/install/lib/libglusterfs.so.0(dict_get_str_boolean+0x170) > [0x7f4c433d83fb] ) 0-dict: key nfs.disable, integer type asked, has string > type [Invalid argument] > [2019-01-06 16:27:51.647177] I [dict.c:2361:dict_get_strn] > (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) > [0x7f4c38095a32] > -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) > [0x7f4c37fdd4ac] > -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) > [0x7f4c433d7673] ) 0-dict: key brick0.rdma_port, string type asked, has > integer type [Invalid argument] > [2019-01-06 16:27:51.647227] I [dict.c:2361:dict_get_strn] > (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) > [0x7f4c38095a32] > -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) > [0x7f4c37fdd4ac] > -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) > [0x7f4c433d7673] ) 0-dict: key brick1.rdma_port, string type asked, has > integer type [Invalid argument] > [2019-01-06 16:27:51.647292] I [dict.c:2361:dict_get_strn] > (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) > [0x7f4c38095a32] > -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) > [0x7f4c37fdd4ac] > -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) > [0x7f4c433d7673] ) 0-dict: key brick2.rdma_port, string type asked, has > integer type [Invalid argument] > [2019-01-06 16:27:51.647333] I [dict.c:2361:dict_get_strn] > (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) > [0x7f4c38095a32] > -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) > [0x7f4c37fdd4ac] > -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) > [0x7f4c433d7673] ) 0-dict: key brick3.rdma_port, string type asked, has > integer type [Invalid argument] > [2019-01-06 16:27:51.647371] I [dict.c:2361:dict_get_strn] > (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) > [0x7f4c38095a32] > -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) > [0x7f4c37fdd4ac] > -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) > [0x7f4c433d7673] ) 0-dict: key brick4.rdma_port, string type asked, has > integer type [Invalid argument] > [2019-01-06 16:27:51.647409] I [dict.c:2361:dict_get_strn] > (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) > [0x7f4c38095a32] > -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) > [0x7f4c37fdd4ac] > -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) > [0x7f4c433d7673] ) 0-dict: key brick5.rdma_port, string type asked, has > integer type [Invalid argument] > [2019-01-06 16:27:51.647447] I [dict.c:2361:dict_get_strn] > (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) > [0x7f4c38095a32] > -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) > [0x7f4c37fdd4ac] > -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) > [0x7f4c433d7673] ) 0-dict: key brick6.rdma_port, string type asked, has > integer type [Invalid argument] > [2019-01-06 16:27:51.649335] E [MSGID: 101191] > [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > [2019-01-06 16:27:51.932871] I [MSGID: 106499] > [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume patchy > > It is just taking lot of time to get the status at this point. > It looks like there could be some issue with connection or the handing of > volume status when some bricks are down. > The 'online_brick_count' check uses 'gluster volume status' to get some information, and it does that several times (currently 7). Looking at cmd_history.log, I see that after the 'online_brick_count' at line 70, only one 'gluster volume status' has completed. Apparently the second 'gluster volume status' is hung. In cli.log I see that the second 'gluster volume status' seems to have started, but not finished: Normal run: [2019-01-08 16:36:43.628821] I [cli.c:834:main] 0-cli: Started running gluster with version 6dev [2019-01-08 16:36:43.808182] I [MSGID: 101190] [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2019-01-08 16:36:43.808287] I [MSGID: 101190] [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-08 16:36:43.808432] E [MSGID: 101191] [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-01-08 16:36:43.816534] I [dict.c:1947:dict_get_uint32] (-->gluster(cli_cmd_process+0x1e4) [0x40db50] -->gluster(cli_cmd_volume_status_cbk+0x90) [0x415bec] -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) [0x7fefe569456 9] ) 0-dict: key cmd, unsigned integer type asked, has integer type [Invalid argument] [2019-01-08 16:36:43.816716] I [dict.c:1947:dict_get_uint32] (-->gluster(cli_cmd_volume_status_cbk+0x1cb) [0x415d27] -->gluster(gf_cli_status_volume_all+0xc8) [0x42fa94] -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) [0x7f efe5694569] ) 0-dict: key cmd, unsigned integer type asked, has integer type [Invalid argument] [2019-01-08 16:36:43.824437] I [input.c:31:cli_batch] 0-: Exiting with: 0 Bad run: [2019-01-08 16:36:43.940361] I [cli.c:834:main] 0-cli: Started running gluster with version 6dev [2019-01-08 16:36:44.147364] I [MSGID: 101190] [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2019-01-08 16:36:44.147477] I [MSGID: 101190] [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-08 16:36:44.147583] E [MSGID: 101191] [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler In glusterd.log it seems as if it hasn't received any status request. It looks like the cli has not even connected to glusterd. Xavi > --- > Ashish > > > > ------------------------------ > *From: *"Mohit Agrawal" > *To: *"Shyam Ranganathan" > *Cc: *"Gluster Devel" > *Sent: *Saturday, January 12, 2019 6:46:20 PM > *Subject: *Re: [Gluster-devel] Regression health for release-5.next > and release-6 > > Previous logs related to client not bricks, below are the brick logs > > [2019-01-12 12:25:25.893485]:++++++++++ > G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 68 rm -f 0.o 10.o 11.o 12.o 13.o > 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ > The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key > 'trusted.ec.size' would not be sent on wire in the future [Invalid > argument]" repeated 199 times between [2019-01-12 12:25:25.283989] and > [2019-01-12 12:25:25.899532] > [2019-01-12 12:25:25.903375] E [MSGID: 113001] > [posix-inode-fd-ops.c:4617:_posix_handle_xattr_keyvalue_pair] > 8-patchy-posix: fgetxattr failed on > gfid=d91f6331-d394-479d-ab51-6bcf674ac3e0 while doing xattrop: > Key:trusted.ec.dirty (Bad file descriptor) [Bad file descriptor] > [2019-01-12 12:25:25.903468] E [MSGID: 115073] > [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-patchy-server: 1486: > FXATTROP 2 (d91f6331-d394-479d-ab51-6bcf674ac3e0), client: > CTX_ID:b785c2b0-3453-4a03-b129-19e6ceeb5346-GRAPH_ID:0-PID:24147-HOST:softserve-moagrawa-test.1-PC_NAME:patchy-client-1-RECON_NO:-1, > error-xlator: patchy-posix [Bad file descriptor] > > > Thanks, > Mohit Agrawal > > On Sat, Jan 12, 2019 at 6:29 PM Mohit Agrawal wrote: > >> >> For specific to "add-brick-and-validate-replicated-volume-options.t" i >> have posted a patch https://review.gluster.org/22015. >> For test case "ec/bug-1236065.t" I think the issue needs to be check by >> ec team >> >> On the brick side, it is showing below logs >> >> >>>>>>>>>>>>>>>>> >> >> on wire in the future [Invalid argument] >> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key >> 'trusted.ec.dirty' would not be sent on wire in the future [Invalid >> argument]" repeated 3 times between [2019-01-12 12:25:25.902828] and >> [2019-01-12 12:25:25.902992] >> [2019-01-12 12:25:25.903553] W [MSGID: 114031] >> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-patchy-client-1: >> remote operation failed [Bad file descriptor] >> [2019-01-12 12:25:25.903998] W [MSGID: 122040] >> [ec-common.c:1181:ec_prepare_update_cbk] 0-patchy-disperse-0: Failed to get >> size and version : FOP : 'FXATTROP' failed on gfid >> d91f6331-d394-479d-ab51-6bcf674ac3e0 [Input/output error] >> [2019-01-12 12:25:25.904059] W [fuse-bridge.c:1907:fuse_unlink_cbk] >> 0-glusterfs-fuse: 3259: UNLINK() /test/0.o => -1 (Input/output error) >> >> >>>>>>>>>>>>>>>>>>> >> >> Test case is getting timed out because "volume heal $V0 full" command is >> stuck, look's like shd is getting stuck at getxattr >> >> >>>>>>>>>>>>>>. >> >> Thread 8 (Thread 0x7f83777fe700 (LWP 25552)): >> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >> /usr/lib64/libpthread.so.0 >> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >> loc=loc at entry=0x7f83777fdbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) >> at syncop.c:1680 >> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030880, >> child=, loc=0x7f83777fdbb0, full=) at >> ec-heald.c:161 >> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80094b0, >> entry=, parent=0x7f83777fdde0, data=0x7f83a8030880) at >> ec-heald.c:294 >> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80094b0, loc=loc at entry=0x7f83777fdde0, >> pid=pid at entry=-6, data=data at entry=0x7f83a8030880, fn=fn at entry=0x7f83add03140 >> ) at syncop-utils.c:125 >> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030880, >> inode=) at ec-heald.c:311 >> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030880) at >> ec-heald.c:372 >> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >> Thread 7 (Thread 0x7f8376ffd700 (LWP 25553)): >> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >> /usr/lib64/libpthread.so.0 >> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >> loc=loc at entry=0x7f8376ffcbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) >> at syncop.c:1680 >> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80308f0, >> child=, loc=0x7f8376ffcbb0, full=) at >> ec-heald.c:161 >> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a800d110, >> entry=, parent=0x7f8376ffcde0, data=0x7f83a80308f0) at >> ec-heald.c:294 >> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a800d110, loc=loc at entry=0x7f8376ffcde0, >> pid=pid at entry=-6, data=data at entry=0x7f83a80308f0, fn=fn at entry=0x7f83add03140 >> ) at syncop-utils.c:125 >> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80308f0, >> inode=) at ec-heald.c:311 >> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80308f0) at >> ec-heald.c:372 >> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >> Thread 6 (Thread 0x7f83767fc700 (LWP 25554)): >> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >> /usr/lib64/libpthread.so.0 >> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >> loc=loc at entry=0x7f83767fbbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) >> at syncop.c:1680 >> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030960, >> child=, loc=0x7f83767fbbb0, full=) at >> ec-heald.c:161 >> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8010af0, >> entry=, parent=0x7f83767fbde0, data=0x7f83a8030960) at >> ec-heald.c:294 >> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8010af0, loc=loc at entry=0x7f83767fbde0, >> pid=pid at entry=-6, data=data at entry=0x7f83a8030960, fn=fn at entry=0x7f83add03140 >> ) at syncop-utils.c:125 >> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030960, >> inode=) at ec-heald.c:311 >> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030960) at >> ec-heald.c:372 >> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >> Thread 5 (Thread 0x7f8375ffb700 (LWP 25555)): >> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >> /usr/lib64/libpthread.so.0 >> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >> loc=loc at entry=0x7f8375ffabb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) >> at syncop.c:1680 >> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80309d0, >> child=, loc=0x7f8375ffabb0, full=) at >> ec-heald.c:161 >> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80144d0, >> entry=, parent=0x7f8375ffade0, data=0x7f83a80309d0) at >> ec-heald.c:294 >> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80144d0, loc=loc at entry=0x7f8375ffade0, >> pid=pid at entry=-6, data=data at entry=0x7f83a80309d0, fn=fn at entry=0x7f83add03140 >> ) at syncop-utils.c:125 >> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80309d0, >> inode=) at ec-heald.c:311 >> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80309d0) at >> ec-heald.c:372 >> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >> Thread 4 (Thread 0x7f83757fa700 (LWP 25556)): >> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >> /usr/lib64/libpthread.so.0 >> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >> loc=loc at entry=0x7f83757f9bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) >> at syncop.c:1680 >> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030a40, >> child=, loc=0x7f83757f9bb0, full=) at >> ec-heald.c:161 >> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8017eb0, >> entry=, parent=0x7f83757f9de0, data=0x7f83a8030a40) at >> ec-heald.c:294 >> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8017eb0, loc=loc at entry=0x7f83757f9de0, >> pid=pid at entry=-6, data=data at entry=0x7f83a8030a40, fn=fn at entry=0x7f83add03140 >> ) at syncop-utils.c:125 >> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030a40, >> inode=) at ec-heald.c:311 >> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030a40) at >> ec-heald.c:372 >> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >> Thread 3 (Thread 0x7f8374ff9700 (LWP 25557)): >> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >> /usr/lib64/libpthread.so.0 >> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >> loc=loc at entry=0x7f8374ff8bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) >> at syncop.c:1680 >> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030ab0, >> child=, loc=0x7f8374ff8bb0, full=) at >> ec-heald.c:161 >> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801b890, >> entry=, parent=0x7f8374ff8de0, data=0x7f83a8030ab0) at >> ec-heald.c:294 >> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801b890, loc=loc at entry=0x7f8374ff8de0, >> pid=pid at entry=-6, data=data at entry=0x7f83a8030ab0, fn=fn at entry=0x7f83add03140 >> ) at syncop-utils.c:125 >> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030ab0, >> inode=) at ec-heald.c:311 >> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030ab0) at >> ec-heald.c:372 >> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >> Thread 2 (Thread 0x7f8367fff700 (LWP 25558)): >> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >> /usr/lib64/libpthread.so.0 >> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >> loc=loc at entry=0x7f8367ffebb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, xdata_out=xdata_out at entry=0x0) >> at syncop.c:1680 >> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030b20, >> child=, loc=0x7f8367ffebb0, full=) at >> ec-heald.c:161 >> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801f270, >> entry=, parent=0x7f8367ffede0, data=0x7f83a8030b20) at >> ec-heald.c:294 >> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801f270, loc=loc at entry=0x7f8367ffede0, >> pid=pid at entry=-6, data=data at entry=0x7f83a8030b20, fn=fn at entry=0x7f83add03140 >> ) at syncop-utils.c:125 >> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030b20, >> inode=) at ec-heald.c:311 >> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030b20) at >> ec-heald.c:372 >> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >> Thread 1 (Thread 0x7f83bcdd1780 (LWP 25383)): >> #0 0x00007f83bb70af57 in pthread_join () from /usr/lib64/libpthread.so.0 >> #1 0x00007f83bc92eff8 in event_dispatch_epoll >> (event_pool=0x55af0a6dd560) at event-epoll.c:846 >> #2 0x000055af0a4116b8 in main (argc=15, argv=0x7fff75610898) at >> glusterfsd.c:2848 >> >> >> >>>>>>>>>>>>>>>>>>>>>>>>>>. >> >> Thanks, >> Mohit Agrawal >> >> On Fri 11 Jan, 2019, 21:20 Shyam Ranganathan > >>> We can check health on master post the patch as stated by Mohit below. >>> >>> Release-5 is causing some concerns as we need to tag the release >>> yesterday, but we have the following 2 tests failing or coredumping >>> pretty regularly, need attention on these. >>> >>> ec/bug-1236065.t >>> glusterd/add-brick-and-validate-replicated-volume-options.t >>> >>> Shyam >>> On 1/10/19 6:20 AM, Mohit Agrawal wrote: >>> > I think we should consider regression-builds after merged the patch >>> > (https://review.gluster.org/#/c/glusterfs/+/21990/) >>> > as we know this patch introduced some delay. >>> > >>> > Thanks, >>> > Mohit Agrawal >>> > >>> > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee >> > > wrote: >>> > >>> > Mohit, Sanju - request you to investigate the failures related to >>> > glusterd and brick-mux and report back to the list. >>> > >>> > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan >>> > > wrote: >>> > >>> > Hi, >>> > >>> > As part of branching preparation next week for release-6, >>> please >>> > find >>> > test failures and respective test links here [1]. >>> > >>> > The top tests that are failing/dumping-core are as below and >>> > need attention, >>> > - ec/bug-1236065.t >>> > - glusterd/add-brick-and-validate-replicated-volume-options.t >>> > - readdir-ahead/bug-1390050.t >>> > - glusterd/brick-mux-validation.t >>> > - bug-1432542-mpx-restart-crash.t >>> > >>> > Others of interest, >>> > - replicate/bug-1341650.t >>> > >>> > Please file a bug if needed against the test case and report >>> the >>> > same >>> > here, in case a problem is already addressed, then do send >>> back the >>> > patch details that addresses this issue as a response to this >>> mail. >>> > >>> > Thanks, >>> > Shyam >>> > >>> > [1] Regression failures: >>> > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view >>> > _______________________________________________ >>> > Gluster-devel mailing list >>> > Gluster-devel at gluster.org >>> > https://lists.gluster.org/mailman/listinfo/gluster-devel >>> > >>> > >>> >> > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From atin.mukherjee83 at gmail.com Tue Jan 15 08:42:20 2019 From: atin.mukherjee83 at gmail.com (Atin Mukherjee) Date: Tue, 15 Jan 2019 14:12:20 +0530 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> <2134165472.57578088.1547460382588.JavaMail.zimbra@redhat.com> Message-ID: Interesting. I?ll do a deep dive at it sometime this week. On Tue, 15 Jan 2019 at 14:05, Xavi Hernandez wrote: > On Mon, Jan 14, 2019 at 11:08 AM Ashish Pandey > wrote: > >> >> I downloaded logs of regression runs 1077 and 1073 and tried to >> investigate it. >> In both regression ec/bug-1236065.t is hanging on TEST 70 which is >> trying to get the online brick count >> >> I can see that in mount/bricks and glusterd logs it has not move forward >> after this test. >> glusterd.log - >> >> [2019-01-06 16:27:51.346408]:++++++++++ >> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 70 5 online_brick_count >> ++++++++++ >> [2019-01-06 16:27:51.645014] I [MSGID: 106499] >> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: >> Received status volume req for volume patchy >> [2019-01-06 16:27:51.646664] I [dict.c:2745:dict_get_str_boolean] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x4a6c3) >> [0x7f4c37fe06c3] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x43b3a) >> [0x7f4c37fd9b3a] >> -->/build/install/lib/libglusterfs.so.0(dict_get_str_boolean+0x170) >> [0x7f4c433d83fb] ) 0-dict: key nfs.disable, integer type asked, has string >> type [Invalid argument] >> [2019-01-06 16:27:51.647177] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick0.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647227] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick1.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647292] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick2.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647333] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick3.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647371] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick4.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647409] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick5.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647447] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick6.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.649335] E [MSGID: 101191] >> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler >> [2019-01-06 16:27:51.932871] I [MSGID: 106499] >> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: >> Received status volume req for volume patchy >> >> It is just taking lot of time to get the status at this point. >> It looks like there could be some issue with connection or the handing of >> volume status when some bricks are down. >> > > The 'online_brick_count' check uses 'gluster volume status' to get some > information, and it does that several times (currently 7). Looking at > cmd_history.log, I see that after the 'online_brick_count' at line 70, only > one 'gluster volume status' has completed. Apparently the second 'gluster > volume status' is hung. > > In cli.log I see that the second 'gluster volume status' seems to have > started, but not finished: > > Normal run: > > [2019-01-08 16:36:43.628821] I [cli.c:834:main] 0-cli: Started running > gluster with version 6dev > [2019-01-08 16:36:43.808182] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 0 > [2019-01-08 16:36:43.808287] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-08 16:36:43.808432] E [MSGID: 101191] > [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > [2019-01-08 16:36:43.816534] I [dict.c:1947:dict_get_uint32] > (-->gluster(cli_cmd_process+0x1e4) [0x40db50] > -->gluster(cli_cmd_volume_status_cbk+0x90) [0x415bec] > -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) > [0x7fefe569456 > 9] ) 0-dict: key cmd, unsigned integer type asked, has integer type > [Invalid argument] > [2019-01-08 16:36:43.816716] I [dict.c:1947:dict_get_uint32] > (-->gluster(cli_cmd_volume_status_cbk+0x1cb) [0x415d27] > -->gluster(gf_cli_status_volume_all+0xc8) [0x42fa94] > -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) [0x7f > efe5694569] ) 0-dict: key cmd, unsigned integer type asked, has integer > type [Invalid argument] > [2019-01-08 16:36:43.824437] I [input.c:31:cli_batch] 0-: Exiting with: 0 > > > Bad run: > > [2019-01-08 16:36:43.940361] I [cli.c:834:main] 0-cli: Started running > gluster with version 6dev > [2019-01-08 16:36:44.147364] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 0 > [2019-01-08 16:36:44.147477] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-08 16:36:44.147583] E [MSGID: 101191] > [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > > > In glusterd.log it seems as if it hasn't received any status request. It > looks like the cli has not even connected to glusterd. > > Xavi > > >> --- >> Ashish >> >> >> >> ------------------------------ >> *From: *"Mohit Agrawal" >> *To: *"Shyam Ranganathan" >> *Cc: *"Gluster Devel" >> *Sent: *Saturday, January 12, 2019 6:46:20 PM >> *Subject: *Re: [Gluster-devel] Regression health for release-5.next >> and release-6 >> >> Previous logs related to client not bricks, below are the brick logs >> >> [2019-01-12 12:25:25.893485]:++++++++++ >> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 68 rm -f 0.o 10.o 11.o 12.o 13.o >> 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ >> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key >> 'trusted.ec.size' would not be sent on wire in the future [Invalid >> argument]" repeated 199 times between [2019-01-12 12:25:25.283989] and >> [2019-01-12 12:25:25.899532] >> [2019-01-12 12:25:25.903375] E [MSGID: 113001] >> [posix-inode-fd-ops.c:4617:_posix_handle_xattr_keyvalue_pair] >> 8-patchy-posix: fgetxattr failed on >> gfid=d91f6331-d394-479d-ab51-6bcf674ac3e0 while doing xattrop: >> Key:trusted.ec.dirty (Bad file descriptor) [Bad file descriptor] >> [2019-01-12 12:25:25.903468] E [MSGID: 115073] >> [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-patchy-server: 1486: >> FXATTROP 2 (d91f6331-d394-479d-ab51-6bcf674ac3e0), client: >> CTX_ID:b785c2b0-3453-4a03-b129-19e6ceeb5346-GRAPH_ID:0-PID:24147-HOST:softserve-moagrawa-test.1-PC_NAME:patchy-client-1-RECON_NO:-1, >> error-xlator: patchy-posix [Bad file descriptor] >> >> >> Thanks, >> Mohit Agrawal >> >> On Sat, Jan 12, 2019 at 6:29 PM Mohit Agrawal >> wrote: >> >>> >>> For specific to "add-brick-and-validate-replicated-volume-options.t" i >>> have posted a patch https://review.gluster.org/22015. >>> For test case "ec/bug-1236065.t" I think the issue needs to be check by >>> ec team >>> >>> On the brick side, it is showing below logs >>> >>> >>>>>>>>>>>>>>>>> >>> >>> on wire in the future [Invalid argument] >>> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: >>> key 'trusted.ec.dirty' would not be sent on wire in the future [Invalid >>> argument]" repeated 3 times between [2019-01-12 12:25:25.902828] and >>> [2019-01-12 12:25:25.902992] >>> [2019-01-12 12:25:25.903553] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-patchy-client-1: >>> remote operation failed [Bad file descriptor] >>> [2019-01-12 12:25:25.903998] W [MSGID: 122040] >>> [ec-common.c:1181:ec_prepare_update_cbk] 0-patchy-disperse-0: Failed to get >>> size and version : FOP : 'FXATTROP' failed on gfid >>> d91f6331-d394-479d-ab51-6bcf674ac3e0 [Input/output error] >>> [2019-01-12 12:25:25.904059] W [fuse-bridge.c:1907:fuse_unlink_cbk] >>> 0-glusterfs-fuse: 3259: UNLINK() /test/0.o => -1 (Input/output error) >>> >>> >>>>>>>>>>>>>>>>>>> >>> >>> Test case is getting timed out because "volume heal $V0 full" command is >>> stuck, look's like shd is getting stuck at getxattr >>> >>> >>>>>>>>>>>>>>. >>> >>> Thread 8 (Thread 0x7f83777fe700 (LWP 25552)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f83777fdbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030880, >>> child=, loc=0x7f83777fdbb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80094b0, >>> entry=, parent=0x7f83777fdde0, data=0x7f83a8030880) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80094b0, >>> loc=loc at entry=0x7f83777fdde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030880, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030880, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030880) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 7 (Thread 0x7f8376ffd700 (LWP 25553)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f8376ffcbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80308f0, >>> child=, loc=0x7f8376ffcbb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a800d110, >>> entry=, parent=0x7f8376ffcde0, data=0x7f83a80308f0) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a800d110, >>> loc=loc at entry=0x7f8376ffcde0, pid=pid at entry=-6, data=data at entry=0x7f83a80308f0, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80308f0, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80308f0) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 6 (Thread 0x7f83767fc700 (LWP 25554)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f83767fbbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030960, >>> child=, loc=0x7f83767fbbb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8010af0, >>> entry=, parent=0x7f83767fbde0, data=0x7f83a8030960) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8010af0, >>> loc=loc at entry=0x7f83767fbde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030960, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030960, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030960) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 5 (Thread 0x7f8375ffb700 (LWP 25555)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f8375ffabb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80309d0, >>> child=, loc=0x7f8375ffabb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80144d0, >>> entry=, parent=0x7f8375ffade0, data=0x7f83a80309d0) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80144d0, >>> loc=loc at entry=0x7f8375ffade0, pid=pid at entry=-6, data=data at entry=0x7f83a80309d0, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80309d0, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80309d0) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 4 (Thread 0x7f83757fa700 (LWP 25556)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f83757f9bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030a40, >>> child=, loc=0x7f83757f9bb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8017eb0, >>> entry=, parent=0x7f83757f9de0, data=0x7f83a8030a40) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8017eb0, >>> loc=loc at entry=0x7f83757f9de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030a40, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030a40, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030a40) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 3 (Thread 0x7f8374ff9700 (LWP 25557)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f8374ff8bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030ab0, >>> child=, loc=0x7f8374ff8bb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801b890, >>> entry=, parent=0x7f8374ff8de0, data=0x7f83a8030ab0) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801b890, >>> loc=loc at entry=0x7f8374ff8de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030ab0, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030ab0, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030ab0) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 2 (Thread 0x7f8367fff700 (LWP 25558)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f8367ffebb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030b20, >>> child=, loc=0x7f8367ffebb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801f270, >>> entry=, parent=0x7f8367ffede0, data=0x7f83a8030b20) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801f270, >>> loc=loc at entry=0x7f8367ffede0, pid=pid at entry=-6, data=data at entry=0x7f83a8030b20, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030b20, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030b20) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 1 (Thread 0x7f83bcdd1780 (LWP 25383)): >>> #0 0x00007f83bb70af57 in pthread_join () from /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc92eff8 in event_dispatch_epoll >>> (event_pool=0x55af0a6dd560) at event-epoll.c:846 >>> #2 0x000055af0a4116b8 in main (argc=15, argv=0x7fff75610898) at >>> glusterfsd.c:2848 >>> >>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>. >>> >>> Thanks, >>> Mohit Agrawal >>> >>> On Fri 11 Jan, 2019, 21:20 Shyam Ranganathan >> >>>> We can check health on master post the patch as stated by Mohit below. >>>> >>>> Release-5 is causing some concerns as we need to tag the release >>>> yesterday, but we have the following 2 tests failing or coredumping >>>> pretty regularly, need attention on these. >>>> >>>> ec/bug-1236065.t >>>> glusterd/add-brick-and-validate-replicated-volume-options.t >>>> >>>> Shyam >>>> On 1/10/19 6:20 AM, Mohit Agrawal wrote: >>>> > I think we should consider regression-builds after merged the patch >>>> > (https://review.gluster.org/#/c/glusterfs/+/21990/) >>>> > as we know this patch introduced some delay. >>>> > >>>> > Thanks, >>>> > Mohit Agrawal >>>> > >>>> > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee >>> > > wrote: >>>> > >>>> > Mohit, Sanju - request you to investigate the failures related to >>>> > glusterd and brick-mux and report back to the list. >>>> > >>>> > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan >>>> > > wrote: >>>> > >>>> > Hi, >>>> > >>>> > As part of branching preparation next week for release-6, >>>> please >>>> > find >>>> > test failures and respective test links here [1]. >>>> > >>>> > The top tests that are failing/dumping-core are as below and >>>> > need attention, >>>> > - ec/bug-1236065.t >>>> > - glusterd/add-brick-and-validate-replicated-volume-options.t >>>> > - readdir-ahead/bug-1390050.t >>>> > - glusterd/brick-mux-validation.t >>>> > - bug-1432542-mpx-restart-crash.t >>>> > >>>> > Others of interest, >>>> > - replicate/bug-1341650.t >>>> > >>>> > Please file a bug if needed against the test case and report >>>> the >>>> > same >>>> > here, in case a problem is already addressed, then do send >>>> back the >>>> > patch details that addresses this issue as a response to this >>>> mail. >>>> > >>>> > Thanks, >>>> > Shyam >>>> > >>>> > [1] Regression failures: >>>> > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view >>>> > _______________________________________________ >>>> > Gluster-devel mailing list >>>> > Gluster-devel at gluster.org >>>> > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> > >>>> > >>>> >>> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- --Atin -------------- next part -------------- An HTML attachment was scrubbed... URL: From ykaul at redhat.com Tue Jan 15 10:15:51 2019 From: ykaul at redhat.com (Yaniv Kaul) Date: Tue, 15 Jan 2019 12:15:51 +0200 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> <2134165472.57578088.1547460382588.JavaMail.zimbra@redhat.com> Message-ID: On Tue, Jan 15, 2019 at 10:35 AM Xavi Hernandez wrote: > On Mon, Jan 14, 2019 at 11:08 AM Ashish Pandey > wrote: > >> >> I downloaded logs of regression runs 1077 and 1073 and tried to >> investigate it. >> In both regression ec/bug-1236065.t is hanging on TEST 70 which is >> trying to get the online brick count >> >> I can see that in mount/bricks and glusterd logs it has not move forward >> after this test. >> glusterd.log - >> >> [2019-01-06 16:27:51.346408]:++++++++++ >> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 70 5 online_brick_count >> ++++++++++ >> [2019-01-06 16:27:51.645014] I [MSGID: 106499] >> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: >> Received status volume req for volume patchy >> [2019-01-06 16:27:51.646664] I [dict.c:2745:dict_get_str_boolean] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x4a6c3) >> [0x7f4c37fe06c3] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x43b3a) >> [0x7f4c37fd9b3a] >> -->/build/install/lib/libglusterfs.so.0(dict_get_str_boolean+0x170) >> [0x7f4c433d83fb] ) 0-dict: key nfs.disable, integer type asked, has string >> type [Invalid argument] >> [2019-01-06 16:27:51.647177] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick0.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647227] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick1.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647292] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick2.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647333] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick3.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647371] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick4.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647409] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick5.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.647447] I [dict.c:2361:dict_get_strn] >> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >> [0x7f4c38095a32] >> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >> [0x7f4c37fdd4ac] >> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >> [0x7f4c433d7673] ) 0-dict: key brick6.rdma_port, string type asked, has >> integer type [Invalid argument] >> [2019-01-06 16:27:51.649335] E [MSGID: 101191] >> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler >> [2019-01-06 16:27:51.932871] I [MSGID: 106499] >> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: >> Received status volume req for volume patchy >> >> It is just taking lot of time to get the status at this point. >> It looks like there could be some issue with connection or the handing of >> volume status when some bricks are down. >> > > The 'online_brick_count' check uses 'gluster volume status' to get some > information, and it does that several times (currently 7). Looking at > cmd_history.log, I see that after the 'online_brick_count' at line 70, only > one 'gluster volume status' has completed. Apparently the second 'gluster > volume status' is hung. > > In cli.log I see that the second 'gluster volume status' seems to have > started, but not finished: > > Normal run: > > [2019-01-08 16:36:43.628821] I [cli.c:834:main] 0-cli: Started running > gluster with version 6dev > [2019-01-08 16:36:43.808182] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 0 > [2019-01-08 16:36:43.808287] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-08 16:36:43.808432] E [MSGID: 101191] > [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > [2019-01-08 16:36:43.816534] I [dict.c:1947:dict_get_uint32] > (-->gluster(cli_cmd_process+0x1e4) [0x40db50] > -->gluster(cli_cmd_volume_status_cbk+0x90) [0x415bec] > -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) > [0x7fefe569456 > 9] ) 0-dict: key cmd, unsigned integer type asked, has integer type > [Invalid argument] > [2019-01-08 16:36:43.816716] I [dict.c:1947:dict_get_uint32] > (-->gluster(cli_cmd_volume_status_cbk+0x1cb) [0x415d27] > -->gluster(gf_cli_status_volume_all+0xc8) [0x42fa94] > -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) [0x7f > efe5694569] ) 0-dict: key cmd, unsigned integer type asked, has integer > type [Invalid argument] > > While most likely unrelated to this specific issue, we should clean up all those issues. They are adding noise to debugging the real issue(s) and pollute our logs. Y. > [2019-01-08 16:36:43.824437] I [input.c:31:cli_batch] 0-: Exiting with: 0 > > > Bad run: > > [2019-01-08 16:36:43.940361] I [cli.c:834:main] 0-cli: Started running > gluster with version 6dev > [2019-01-08 16:36:44.147364] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 0 > [2019-01-08 16:36:44.147477] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-08 16:36:44.147583] E [MSGID: 101191] > [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > > > In glusterd.log it seems as if it hasn't received any status request. It > looks like the cli has not even connected to glusterd. > > Xavi > > >> --- >> Ashish >> >> >> >> ------------------------------ >> *From: *"Mohit Agrawal" >> *To: *"Shyam Ranganathan" >> *Cc: *"Gluster Devel" >> *Sent: *Saturday, January 12, 2019 6:46:20 PM >> *Subject: *Re: [Gluster-devel] Regression health for release-5.next >> and release-6 >> >> Previous logs related to client not bricks, below are the brick logs >> >> [2019-01-12 12:25:25.893485]:++++++++++ >> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 68 rm -f 0.o 10.o 11.o 12.o 13.o >> 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ >> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key >> 'trusted.ec.size' would not be sent on wire in the future [Invalid >> argument]" repeated 199 times between [2019-01-12 12:25:25.283989] and >> [2019-01-12 12:25:25.899532] >> [2019-01-12 12:25:25.903375] E [MSGID: 113001] >> [posix-inode-fd-ops.c:4617:_posix_handle_xattr_keyvalue_pair] >> 8-patchy-posix: fgetxattr failed on >> gfid=d91f6331-d394-479d-ab51-6bcf674ac3e0 while doing xattrop: >> Key:trusted.ec.dirty (Bad file descriptor) [Bad file descriptor] >> [2019-01-12 12:25:25.903468] E [MSGID: 115073] >> [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-patchy-server: 1486: >> FXATTROP 2 (d91f6331-d394-479d-ab51-6bcf674ac3e0), client: >> CTX_ID:b785c2b0-3453-4a03-b129-19e6ceeb5346-GRAPH_ID:0-PID:24147-HOST:softserve-moagrawa-test.1-PC_NAME:patchy-client-1-RECON_NO:-1, >> error-xlator: patchy-posix [Bad file descriptor] >> >> >> Thanks, >> Mohit Agrawal >> >> On Sat, Jan 12, 2019 at 6:29 PM Mohit Agrawal >> wrote: >> >>> >>> For specific to "add-brick-and-validate-replicated-volume-options.t" i >>> have posted a patch https://review.gluster.org/22015. >>> For test case "ec/bug-1236065.t" I think the issue needs to be check by >>> ec team >>> >>> On the brick side, it is showing below logs >>> >>> >>>>>>>>>>>>>>>>> >>> >>> on wire in the future [Invalid argument] >>> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: >>> key 'trusted.ec.dirty' would not be sent on wire in the future [Invalid >>> argument]" repeated 3 times between [2019-01-12 12:25:25.902828] and >>> [2019-01-12 12:25:25.902992] >>> [2019-01-12 12:25:25.903553] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-patchy-client-1: >>> remote operation failed [Bad file descriptor] >>> [2019-01-12 12:25:25.903998] W [MSGID: 122040] >>> [ec-common.c:1181:ec_prepare_update_cbk] 0-patchy-disperse-0: Failed to get >>> size and version : FOP : 'FXATTROP' failed on gfid >>> d91f6331-d394-479d-ab51-6bcf674ac3e0 [Input/output error] >>> [2019-01-12 12:25:25.904059] W [fuse-bridge.c:1907:fuse_unlink_cbk] >>> 0-glusterfs-fuse: 3259: UNLINK() /test/0.o => -1 (Input/output error) >>> >>> >>>>>>>>>>>>>>>>>>> >>> >>> Test case is getting timed out because "volume heal $V0 full" command is >>> stuck, look's like shd is getting stuck at getxattr >>> >>> >>>>>>>>>>>>>>. >>> >>> Thread 8 (Thread 0x7f83777fe700 (LWP 25552)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f83777fdbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030880, >>> child=, loc=0x7f83777fdbb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80094b0, >>> entry=, parent=0x7f83777fdde0, data=0x7f83a8030880) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80094b0, >>> loc=loc at entry=0x7f83777fdde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030880, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030880, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030880) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 7 (Thread 0x7f8376ffd700 (LWP 25553)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f8376ffcbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80308f0, >>> child=, loc=0x7f8376ffcbb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a800d110, >>> entry=, parent=0x7f8376ffcde0, data=0x7f83a80308f0) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a800d110, >>> loc=loc at entry=0x7f8376ffcde0, pid=pid at entry=-6, data=data at entry=0x7f83a80308f0, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80308f0, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80308f0) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 6 (Thread 0x7f83767fc700 (LWP 25554)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f83767fbbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030960, >>> child=, loc=0x7f83767fbbb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8010af0, >>> entry=, parent=0x7f83767fbde0, data=0x7f83a8030960) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8010af0, >>> loc=loc at entry=0x7f83767fbde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030960, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030960, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030960) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 5 (Thread 0x7f8375ffb700 (LWP 25555)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f8375ffabb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80309d0, >>> child=, loc=0x7f8375ffabb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80144d0, >>> entry=, parent=0x7f8375ffade0, data=0x7f83a80309d0) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80144d0, >>> loc=loc at entry=0x7f8375ffade0, pid=pid at entry=-6, data=data at entry=0x7f83a80309d0, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80309d0, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80309d0) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 4 (Thread 0x7f83757fa700 (LWP 25556)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f83757f9bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030a40, >>> child=, loc=0x7f83757f9bb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8017eb0, >>> entry=, parent=0x7f83757f9de0, data=0x7f83a8030a40) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8017eb0, >>> loc=loc at entry=0x7f83757f9de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030a40, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030a40, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030a40) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 3 (Thread 0x7f8374ff9700 (LWP 25557)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f8374ff8bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030ab0, >>> child=, loc=0x7f8374ff8bb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801b890, >>> entry=, parent=0x7f8374ff8de0, data=0x7f83a8030ab0) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801b890, >>> loc=loc at entry=0x7f8374ff8de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030ab0, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030ab0, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030ab0) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 2 (Thread 0x7f8367fff700 (LWP 25558)): >>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>> loc=loc at entry=0x7f8367ffebb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030b20, >>> child=, loc=0x7f8367ffebb0, full=) at >>> ec-heald.c:161 >>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801f270, >>> entry=, parent=0x7f8367ffede0, data=0x7f83a8030b20) at >>> ec-heald.c:294 >>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801f270, >>> loc=loc at entry=0x7f8367ffede0, pid=pid at entry=-6, data=data at entry=0x7f83a8030b20, >>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030b20, >>> inode=) at ec-heald.c:311 >>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030b20) at >>> ec-heald.c:372 >>> #7 0x00007f83bb709e25 in start_thread () from /usr/lib64/libpthread.so.0 >>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>> Thread 1 (Thread 0x7f83bcdd1780 (LWP 25383)): >>> #0 0x00007f83bb70af57 in pthread_join () from /usr/lib64/libpthread.so.0 >>> #1 0x00007f83bc92eff8 in event_dispatch_epoll >>> (event_pool=0x55af0a6dd560) at event-epoll.c:846 >>> #2 0x000055af0a4116b8 in main (argc=15, argv=0x7fff75610898) at >>> glusterfsd.c:2848 >>> >>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>. >>> >>> Thanks, >>> Mohit Agrawal >>> >>> On Fri 11 Jan, 2019, 21:20 Shyam Ranganathan >> >>>> We can check health on master post the patch as stated by Mohit below. >>>> >>>> Release-5 is causing some concerns as we need to tag the release >>>> yesterday, but we have the following 2 tests failing or coredumping >>>> pretty regularly, need attention on these. >>>> >>>> ec/bug-1236065.t >>>> glusterd/add-brick-and-validate-replicated-volume-options.t >>>> >>>> Shyam >>>> On 1/10/19 6:20 AM, Mohit Agrawal wrote: >>>> > I think we should consider regression-builds after merged the patch >>>> > (https://review.gluster.org/#/c/glusterfs/+/21990/) >>>> > as we know this patch introduced some delay. >>>> > >>>> > Thanks, >>>> > Mohit Agrawal >>>> > >>>> > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee >>> > > wrote: >>>> > >>>> > Mohit, Sanju - request you to investigate the failures related to >>>> > glusterd and brick-mux and report back to the list. >>>> > >>>> > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan >>>> > > wrote: >>>> > >>>> > Hi, >>>> > >>>> > As part of branching preparation next week for release-6, >>>> please >>>> > find >>>> > test failures and respective test links here [1]. >>>> > >>>> > The top tests that are failing/dumping-core are as below and >>>> > need attention, >>>> > - ec/bug-1236065.t >>>> > - glusterd/add-brick-and-validate-replicated-volume-options.t >>>> > - readdir-ahead/bug-1390050.t >>>> > - glusterd/brick-mux-validation.t >>>> > - bug-1432542-mpx-restart-crash.t >>>> > >>>> > Others of interest, >>>> > - replicate/bug-1341650.t >>>> > >>>> > Please file a bug if needed against the test case and report >>>> the >>>> > same >>>> > here, in case a problem is already addressed, then do send >>>> back the >>>> > patch details that addresses this issue as a response to this >>>> mail. >>>> > >>>> > Thanks, >>>> > Shyam >>>> > >>>> > [1] Regression failures: >>>> > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view >>>> > _______________________________________________ >>>> > Gluster-devel mailing list >>>> > Gluster-devel at gluster.org >>>> > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> > >>>> > >>>> >>> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Thu Jan 17 04:28:50 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 17 Jan 2019 09:58:50 +0530 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> <2134165472.57578088.1547460382588.JavaMail.zimbra@redhat.com> Message-ID: On Tue, Jan 15, 2019 at 2:13 PM Atin Mukherjee wrote: > Interesting. I?ll do a deep dive at it sometime this week. > > On Tue, 15 Jan 2019 at 14:05, Xavi Hernandez wrote: > >> On Mon, Jan 14, 2019 at 11:08 AM Ashish Pandey >> wrote: >> >>> >>> I downloaded logs of regression runs 1077 and 1073 and tried to >>> investigate it. >>> In both regression ec/bug-1236065.t is hanging on TEST 70 which is >>> trying to get the online brick count >>> >>> I can see that in mount/bricks and glusterd logs it has not move forward >>> after this test. >>> glusterd.log - >>> >>> [2019-01-06 16:27:51.346408]:++++++++++ >>> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 70 5 online_brick_count >>> ++++++++++ >>> [2019-01-06 16:27:51.645014] I [MSGID: 106499] >>> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: >>> Received status volume req for volume patchy >>> [2019-01-06 16:27:51.646664] I [dict.c:2745:dict_get_str_boolean] >>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x4a6c3) >>> [0x7f4c37fe06c3] >>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x43b3a) >>> [0x7f4c37fd9b3a] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_str_boolean+0x170) >>> [0x7f4c433d83fb] ) 0-dict: key nfs.disable, integer type asked, has string >>> type [Invalid argument] >>> [2019-01-06 16:27:51.647177] I [dict.c:2361:dict_get_strn] >>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>> [0x7f4c38095a32] >>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>> [0x7f4c37fdd4ac] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>> [0x7f4c433d7673] ) 0-dict: key brick0.rdma_port, string type asked, has >>> integer type [Invalid argument] >>> [2019-01-06 16:27:51.647227] I [dict.c:2361:dict_get_strn] >>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>> [0x7f4c38095a32] >>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>> [0x7f4c37fdd4ac] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>> [0x7f4c433d7673] ) 0-dict: key brick1.rdma_port, string type asked, has >>> integer type [Invalid argument] >>> [2019-01-06 16:27:51.647292] I [dict.c:2361:dict_get_strn] >>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>> [0x7f4c38095a32] >>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>> [0x7f4c37fdd4ac] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>> [0x7f4c433d7673] ) 0-dict: key brick2.rdma_port, string type asked, has >>> integer type [Invalid argument] >>> [2019-01-06 16:27:51.647333] I [dict.c:2361:dict_get_strn] >>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>> [0x7f4c38095a32] >>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>> [0x7f4c37fdd4ac] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>> [0x7f4c433d7673] ) 0-dict: key brick3.rdma_port, string type asked, has >>> integer type [Invalid argument] >>> [2019-01-06 16:27:51.647371] I [dict.c:2361:dict_get_strn] >>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>> [0x7f4c38095a32] >>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>> [0x7f4c37fdd4ac] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>> [0x7f4c433d7673] ) 0-dict: key brick4.rdma_port, string type asked, has >>> integer type [Invalid argument] >>> [2019-01-06 16:27:51.647409] I [dict.c:2361:dict_get_strn] >>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>> [0x7f4c38095a32] >>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>> [0x7f4c37fdd4ac] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>> [0x7f4c433d7673] ) 0-dict: key brick5.rdma_port, string type asked, has >>> integer type [Invalid argument] >>> [2019-01-06 16:27:51.647447] I [dict.c:2361:dict_get_strn] >>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>> [0x7f4c38095a32] >>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>> [0x7f4c37fdd4ac] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>> [0x7f4c433d7673] ) 0-dict: key brick6.rdma_port, string type asked, has >>> integer type [Invalid argument] >>> [2019-01-06 16:27:51.649335] E [MSGID: 101191] >>> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>> handler >>> [2019-01-06 16:27:51.932871] I [MSGID: 106499] >>> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: >>> Received status volume req for volume patchy >>> >>> It is just taking lot of time to get the status at this point. >>> It looks like there could be some issue with connection or the handing >>> of volume status when some bricks are down. >>> >> >> The 'online_brick_count' check uses 'gluster volume status' to get some >> information, and it does that several times (currently 7). Looking at >> cmd_history.log, I see that after the 'online_brick_count' at line 70, only >> one 'gluster volume status' has completed. Apparently the second 'gluster >> volume status' is hung. >> >> In cli.log I see that the second 'gluster volume status' seems to have >> started, but not finished: >> >> Normal run: >> >> [2019-01-08 16:36:43.628821] I [cli.c:834:main] 0-cli: Started running >> gluster with version 6dev >> [2019-01-08 16:36:43.808182] I [MSGID: 101190] >> [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 0 >> [2019-01-08 16:36:43.808287] I [MSGID: 101190] >> [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 1 >> [2019-01-08 16:36:43.808432] E [MSGID: 101191] >> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler >> [2019-01-08 16:36:43.816534] I [dict.c:1947:dict_get_uint32] >> (-->gluster(cli_cmd_process+0x1e4) [0x40db50] >> -->gluster(cli_cmd_volume_status_cbk+0x90) [0x415bec] >> -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) >> [0x7fefe569456 >> 9] ) 0-dict: key cmd, unsigned integer type asked, has integer type >> [Invalid argument] >> [2019-01-08 16:36:43.816716] I [dict.c:1947:dict_get_uint32] >> (-->gluster(cli_cmd_volume_status_cbk+0x1cb) [0x415d27] >> -->gluster(gf_cli_status_volume_all+0xc8) [0x42fa94] >> -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) [0x7f >> efe5694569] ) 0-dict: key cmd, unsigned integer type asked, has integer >> type [Invalid argument] >> [2019-01-08 16:36:43.824437] I [input.c:31:cli_batch] 0-: Exiting with: 0 >> >> >> Bad run: >> >> [2019-01-08 16:36:43.940361] I [cli.c:834:main] 0-cli: Started running >> gluster with version 6dev >> [2019-01-08 16:36:44.147364] I [MSGID: 101190] >> [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 0 >> [2019-01-08 16:36:44.147477] I [MSGID: 101190] >> [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 1 >> [2019-01-08 16:36:44.147583] E [MSGID: 101191] >> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler >> >> >> In glusterd.log it seems as if it hasn't received any status request. It >> looks like the cli has not even connected to glusterd. >> > Downloaded the logs for the recent failure from https://build.gluster.org/job/regression-test-with-multiplex/1092/ and based on the log scanning this is what I see: 1. The test executes with out any issues till line no 74 i.e. "TEST $CLI volume start $V0 force" and cli.log along with cmd_history.log confirm the same: cli.log ==== [2019-01-16 16:28:46.871877]:++++++++++ G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 73 gluster --mode=script --wignore volume start patchy force ++++++++++ [2019-01-16 16:28:46.980780] I [cli.c:834:main] 0-cli: Started running gluster with version 6dev [2019-01-16 16:28:47.185996] I [MSGID: 101190] [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2019-01-16 16:28:47.186113] I [MSGID: 101190] [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-16 16:28:47.186234] E [MSGID: 101191] [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-01-16 16:28:49.223376] I [cli-rpc-ops.c:1448:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume <=== successfully processed the callback [2019-01-16 16:28:49.223668] I [input.c:31:cli_batch] 0-: Exiting with: 0 cmd_history.log ============ [2019-01-16 16:28:49.220491] : volume start patchy force : SUCCESS However, in both cli and cmd_history log files these are the last set of logs I see which indicates either the test script is completely paused. There's no possibility I see that cli receiving this command and dropping it completely as otherwise we should have atleast seen the "Started running gluster with version 6dev" and "Exiting with" log entries. I could manage to reproduce this once locally in my system and then when I ran command from another prompt, volume status and all other gluster basic commands go through. I also inspected the processes and I don't see any suspect of processes being hung. So the mystery continues and we need to see why the test script is not all moving forward. >> Xavi >> >> >>> --- >>> Ashish >>> >>> >>> >>> ------------------------------ >>> *From: *"Mohit Agrawal" >>> *To: *"Shyam Ranganathan" >>> *Cc: *"Gluster Devel" >>> *Sent: *Saturday, January 12, 2019 6:46:20 PM >>> *Subject: *Re: [Gluster-devel] Regression health for release-5.next >>> and release-6 >>> >>> Previous logs related to client not bricks, below are the brick logs >>> >>> [2019-01-12 12:25:25.893485]:++++++++++ >>> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 68 rm -f 0.o 10.o 11.o 12.o 13.o >>> 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ >>> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: >>> key 'trusted.ec.size' would not be sent on wire in the future [Invalid >>> argument]" repeated 199 times between [2019-01-12 12:25:25.283989] and >>> [2019-01-12 12:25:25.899532] >>> [2019-01-12 12:25:25.903375] E [MSGID: 113001] >>> [posix-inode-fd-ops.c:4617:_posix_handle_xattr_keyvalue_pair] >>> 8-patchy-posix: fgetxattr failed on >>> gfid=d91f6331-d394-479d-ab51-6bcf674ac3e0 while doing xattrop: >>> Key:trusted.ec.dirty (Bad file descriptor) [Bad file descriptor] >>> [2019-01-12 12:25:25.903468] E [MSGID: 115073] >>> [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-patchy-server: 1486: >>> FXATTROP 2 (d91f6331-d394-479d-ab51-6bcf674ac3e0), client: >>> CTX_ID:b785c2b0-3453-4a03-b129-19e6ceeb5346-GRAPH_ID:0-PID:24147-HOST:softserve-moagrawa-test.1-PC_NAME:patchy-client-1-RECON_NO:-1, >>> error-xlator: patchy-posix [Bad file descriptor] >>> >>> >>> Thanks, >>> Mohit Agrawal >>> >>> On Sat, Jan 12, 2019 at 6:29 PM Mohit Agrawal >>> wrote: >>> >>>> >>>> For specific to "add-brick-and-validate-replicated-volume-options.t" i >>>> have posted a patch https://review.gluster.org/22015. >>>> For test case "ec/bug-1236065.t" I think the issue needs to be check by >>>> ec team >>>> >>>> On the brick side, it is showing below logs >>>> >>>> >>>>>>>>>>>>>>>>> >>>> >>>> on wire in the future [Invalid argument] >>>> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: >>>> key 'trusted.ec.dirty' would not be sent on wire in the future [Invalid >>>> argument]" repeated 3 times between [2019-01-12 12:25:25.902828] and >>>> [2019-01-12 12:25:25.902992] >>>> [2019-01-12 12:25:25.903553] W [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-patchy-client-1: >>>> remote operation failed [Bad file descriptor] >>>> [2019-01-12 12:25:25.903998] W [MSGID: 122040] >>>> [ec-common.c:1181:ec_prepare_update_cbk] 0-patchy-disperse-0: Failed to get >>>> size and version : FOP : 'FXATTROP' failed on gfid >>>> d91f6331-d394-479d-ab51-6bcf674ac3e0 [Input/output error] >>>> [2019-01-12 12:25:25.904059] W [fuse-bridge.c:1907:fuse_unlink_cbk] >>>> 0-glusterfs-fuse: 3259: UNLINK() /test/0.o => -1 (Input/output error) >>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>> Test case is getting timed out because "volume heal $V0 full" command >>>> is stuck, look's like shd is getting stuck at getxattr >>>> >>>> >>>>>>>>>>>>>>. >>>> >>>> Thread 8 (Thread 0x7f83777fe700 (LWP 25552)): >>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>> /usr/lib64/libpthread.so.0 >>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>> loc=loc at entry=0x7f83777fdbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030880, >>>> child=, loc=0x7f83777fdbb0, full=) at >>>> ec-heald.c:161 >>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80094b0, >>>> entry=, parent=0x7f83777fdde0, data=0x7f83a8030880) at >>>> ec-heald.c:294 >>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80094b0, >>>> loc=loc at entry=0x7f83777fdde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030880, >>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030880, >>>> inode=) at ec-heald.c:311 >>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030880) at >>>> ec-heald.c:372 >>>> #7 0x00007f83bb709e25 in start_thread () from >>>> /usr/lib64/libpthread.so.0 >>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>> Thread 7 (Thread 0x7f8376ffd700 (LWP 25553)): >>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>> /usr/lib64/libpthread.so.0 >>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>> loc=loc at entry=0x7f8376ffcbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80308f0, >>>> child=, loc=0x7f8376ffcbb0, full=) at >>>> ec-heald.c:161 >>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a800d110, >>>> entry=, parent=0x7f8376ffcde0, data=0x7f83a80308f0) at >>>> ec-heald.c:294 >>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a800d110, >>>> loc=loc at entry=0x7f8376ffcde0, pid=pid at entry=-6, data=data at entry=0x7f83a80308f0, >>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80308f0, >>>> inode=) at ec-heald.c:311 >>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80308f0) at >>>> ec-heald.c:372 >>>> #7 0x00007f83bb709e25 in start_thread () from >>>> /usr/lib64/libpthread.so.0 >>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>> Thread 6 (Thread 0x7f83767fc700 (LWP 25554)): >>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>> /usr/lib64/libpthread.so.0 >>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>> loc=loc at entry=0x7f83767fbbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030960, >>>> child=, loc=0x7f83767fbbb0, full=) at >>>> ec-heald.c:161 >>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8010af0, >>>> entry=, parent=0x7f83767fbde0, data=0x7f83a8030960) at >>>> ec-heald.c:294 >>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8010af0, >>>> loc=loc at entry=0x7f83767fbde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030960, >>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030960, >>>> inode=) at ec-heald.c:311 >>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030960) at >>>> ec-heald.c:372 >>>> #7 0x00007f83bb709e25 in start_thread () from >>>> /usr/lib64/libpthread.so.0 >>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>> Thread 5 (Thread 0x7f8375ffb700 (LWP 25555)): >>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>> /usr/lib64/libpthread.so.0 >>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>> loc=loc at entry=0x7f8375ffabb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80309d0, >>>> child=, loc=0x7f8375ffabb0, full=) at >>>> ec-heald.c:161 >>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80144d0, >>>> entry=, parent=0x7f8375ffade0, data=0x7f83a80309d0) at >>>> ec-heald.c:294 >>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80144d0, >>>> loc=loc at entry=0x7f8375ffade0, pid=pid at entry=-6, data=data at entry=0x7f83a80309d0, >>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80309d0, >>>> inode=) at ec-heald.c:311 >>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80309d0) at >>>> ec-heald.c:372 >>>> #7 0x00007f83bb709e25 in start_thread () from >>>> /usr/lib64/libpthread.so.0 >>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>> Thread 4 (Thread 0x7f83757fa700 (LWP 25556)): >>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>> /usr/lib64/libpthread.so.0 >>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>> loc=loc at entry=0x7f83757f9bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030a40, >>>> child=, loc=0x7f83757f9bb0, full=) at >>>> ec-heald.c:161 >>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8017eb0, >>>> entry=, parent=0x7f83757f9de0, data=0x7f83a8030a40) at >>>> ec-heald.c:294 >>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8017eb0, >>>> loc=loc at entry=0x7f83757f9de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030a40, >>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030a40, >>>> inode=) at ec-heald.c:311 >>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030a40) at >>>> ec-heald.c:372 >>>> #7 0x00007f83bb709e25 in start_thread () from >>>> /usr/lib64/libpthread.so.0 >>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>> Thread 3 (Thread 0x7f8374ff9700 (LWP 25557)): >>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>> /usr/lib64/libpthread.so.0 >>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>> loc=loc at entry=0x7f8374ff8bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030ab0, >>>> child=, loc=0x7f8374ff8bb0, full=) at >>>> ec-heald.c:161 >>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801b890, >>>> entry=, parent=0x7f8374ff8de0, data=0x7f83a8030ab0) at >>>> ec-heald.c:294 >>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801b890, >>>> loc=loc at entry=0x7f8374ff8de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030ab0, >>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030ab0, >>>> inode=) at ec-heald.c:311 >>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030ab0) at >>>> ec-heald.c:372 >>>> #7 0x00007f83bb709e25 in start_thread () from >>>> /usr/lib64/libpthread.so.0 >>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>> Thread 2 (Thread 0x7f8367fff700 (LWP 25558)): >>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>> /usr/lib64/libpthread.so.0 >>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>> loc=loc at entry=0x7f8367ffebb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030b20, >>>> child=, loc=0x7f8367ffebb0, full=) at >>>> ec-heald.c:161 >>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801f270, >>>> entry=, parent=0x7f8367ffede0, data=0x7f83a8030b20) at >>>> ec-heald.c:294 >>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801f270, >>>> loc=loc at entry=0x7f8367ffede0, pid=pid at entry=-6, data=data at entry=0x7f83a8030b20, >>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030b20, >>>> inode=) at ec-heald.c:311 >>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030b20) at >>>> ec-heald.c:372 >>>> #7 0x00007f83bb709e25 in start_thread () from >>>> /usr/lib64/libpthread.so.0 >>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>> Thread 1 (Thread 0x7f83bcdd1780 (LWP 25383)): >>>> #0 0x00007f83bb70af57 in pthread_join () from >>>> /usr/lib64/libpthread.so.0 >>>> #1 0x00007f83bc92eff8 in event_dispatch_epoll >>>> (event_pool=0x55af0a6dd560) at event-epoll.c:846 >>>> #2 0x000055af0a4116b8 in main (argc=15, argv=0x7fff75610898) at >>>> glusterfsd.c:2848 >>>> >>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>. >>>> >>>> Thanks, >>>> Mohit Agrawal >>>> >>>> On Fri 11 Jan, 2019, 21:20 Shyam Ranganathan >>> wrote: >>>> >>>>> We can check health on master post the patch as stated by Mohit below. >>>>> >>>>> Release-5 is causing some concerns as we need to tag the release >>>>> yesterday, but we have the following 2 tests failing or coredumping >>>>> pretty regularly, need attention on these. >>>>> >>>>> ec/bug-1236065.t >>>>> glusterd/add-brick-and-validate-replicated-volume-options.t >>>>> >>>>> Shyam >>>>> On 1/10/19 6:20 AM, Mohit Agrawal wrote: >>>>> > I think we should consider regression-builds after merged the patch >>>>> > (https://review.gluster.org/#/c/glusterfs/+/21990/) >>>>> > as we know this patch introduced some delay. >>>>> > >>>>> > Thanks, >>>>> > Mohit Agrawal >>>>> > >>>>> > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee >>>> > > wrote: >>>>> > >>>>> > Mohit, Sanju - request you to investigate the failures related to >>>>> > glusterd and brick-mux and report back to the list. >>>>> > >>>>> > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan >>>>> > > wrote: >>>>> > >>>>> > Hi, >>>>> > >>>>> > As part of branching preparation next week for release-6, >>>>> please >>>>> > find >>>>> > test failures and respective test links here [1]. >>>>> > >>>>> > The top tests that are failing/dumping-core are as below and >>>>> > need attention, >>>>> > - ec/bug-1236065.t >>>>> > - glusterd/add-brick-and-validate-replicated-volume-options.t >>>>> > - readdir-ahead/bug-1390050.t >>>>> > - glusterd/brick-mux-validation.t >>>>> > - bug-1432542-mpx-restart-crash.t >>>>> > >>>>> > Others of interest, >>>>> > - replicate/bug-1341650.t >>>>> > >>>>> > Please file a bug if needed against the test case and report >>>>> the >>>>> > same >>>>> > here, in case a problem is already addressed, then do send >>>>> back the >>>>> > patch details that addresses this issue as a response to >>>>> this mail. >>>>> > >>>>> > Thanks, >>>>> > Shyam >>>>> > >>>>> > [1] Regression failures: >>>>> > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view >>>>> > _______________________________________________ >>>>> > Gluster-devel mailing list >>>>> > Gluster-devel at gluster.org >>>>> > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>> > >>>>> > >>>>> >>>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- > --Atin > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From jahernan at redhat.com Thu Jan 17 08:21:47 2019 From: jahernan at redhat.com (Xavi Hernandez) Date: Thu, 17 Jan 2019 09:21:47 +0100 Subject: [Gluster-devel] Regression health for release-5.next and release-6 In-Reply-To: References: <89f54d02-8c78-a507-416a-c1ca1d7b4be2@redhat.com> <65f1c892-a3c4-2401-4827-dbe0277875b2@redhat.com> <2134165472.57578088.1547460382588.JavaMail.zimbra@redhat.com> Message-ID: On Thu, Jan 17, 2019 at 5:29 AM Atin Mukherjee wrote: > > > On Tue, Jan 15, 2019 at 2:13 PM Atin Mukherjee > wrote: > >> Interesting. I?ll do a deep dive at it sometime this week. >> >> On Tue, 15 Jan 2019 at 14:05, Xavi Hernandez wrote: >> >>> On Mon, Jan 14, 2019 at 11:08 AM Ashish Pandey >>> wrote: >>> >>>> >>>> I downloaded logs of regression runs 1077 and 1073 and tried to >>>> investigate it. >>>> In both regression ec/bug-1236065.t is hanging on TEST 70 which is >>>> trying to get the online brick count >>>> >>>> I can see that in mount/bricks and glusterd logs it has not move >>>> forward after this test. >>>> glusterd.log - >>>> >>>> [2019-01-06 16:27:51.346408]:++++++++++ >>>> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 70 5 online_brick_count >>>> ++++++++++ >>>> [2019-01-06 16:27:51.645014] I [MSGID: 106499] >>>> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: >>>> Received status volume req for volume patchy >>>> [2019-01-06 16:27:51.646664] I [dict.c:2745:dict_get_str_boolean] >>>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x4a6c3) >>>> [0x7f4c37fe06c3] >>>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x43b3a) >>>> [0x7f4c37fd9b3a] >>>> -->/build/install/lib/libglusterfs.so.0(dict_get_str_boolean+0x170) >>>> [0x7f4c433d83fb] ) 0-dict: key nfs.disable, integer type asked, has string >>>> type [Invalid argument] >>>> [2019-01-06 16:27:51.647177] I [dict.c:2361:dict_get_strn] >>>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>>> [0x7f4c38095a32] >>>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>>> [0x7f4c37fdd4ac] >>>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>>> [0x7f4c433d7673] ) 0-dict: key brick0.rdma_port, string type asked, has >>>> integer type [Invalid argument] >>>> [2019-01-06 16:27:51.647227] I [dict.c:2361:dict_get_strn] >>>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>>> [0x7f4c38095a32] >>>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>>> [0x7f4c37fdd4ac] >>>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>>> [0x7f4c433d7673] ) 0-dict: key brick1.rdma_port, string type asked, has >>>> integer type [Invalid argument] >>>> [2019-01-06 16:27:51.647292] I [dict.c:2361:dict_get_strn] >>>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>>> [0x7f4c38095a32] >>>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>>> [0x7f4c37fdd4ac] >>>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>>> [0x7f4c433d7673] ) 0-dict: key brick2.rdma_port, string type asked, has >>>> integer type [Invalid argument] >>>> [2019-01-06 16:27:51.647333] I [dict.c:2361:dict_get_strn] >>>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>>> [0x7f4c38095a32] >>>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>>> [0x7f4c37fdd4ac] >>>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>>> [0x7f4c433d7673] ) 0-dict: key brick3.rdma_port, string type asked, has >>>> integer type [Invalid argument] >>>> [2019-01-06 16:27:51.647371] I [dict.c:2361:dict_get_strn] >>>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>>> [0x7f4c38095a32] >>>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>>> [0x7f4c37fdd4ac] >>>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>>> [0x7f4c433d7673] ) 0-dict: key brick4.rdma_port, string type asked, has >>>> integer type [Invalid argument] >>>> [2019-01-06 16:27:51.647409] I [dict.c:2361:dict_get_strn] >>>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>>> [0x7f4c38095a32] >>>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>>> [0x7f4c37fdd4ac] >>>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>>> [0x7f4c433d7673] ) 0-dict: key brick5.rdma_port, string type asked, has >>>> integer type [Invalid argument] >>>> [2019-01-06 16:27:51.647447] I [dict.c:2361:dict_get_strn] >>>> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32) >>>> [0x7f4c38095a32] >>>> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac) >>>> [0x7f4c37fdd4ac] >>>> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179) >>>> [0x7f4c433d7673] ) 0-dict: key brick6.rdma_port, string type asked, has >>>> integer type [Invalid argument] >>>> [2019-01-06 16:27:51.649335] E [MSGID: 101191] >>>> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>> handler >>>> [2019-01-06 16:27:51.932871] I [MSGID: 106499] >>>> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management: >>>> Received status volume req for volume patchy >>>> >>>> It is just taking lot of time to get the status at this point. >>>> It looks like there could be some issue with connection or the handing >>>> of volume status when some bricks are down. >>>> >>> >>> The 'online_brick_count' check uses 'gluster volume status' to get some >>> information, and it does that several times (currently 7). Looking at >>> cmd_history.log, I see that after the 'online_brick_count' at line 70, only >>> one 'gluster volume status' has completed. Apparently the second 'gluster >>> volume status' is hung. >>> >>> In cli.log I see that the second 'gluster volume status' seems to have >>> started, but not finished: >>> >>> Normal run: >>> >>> [2019-01-08 16:36:43.628821] I [cli.c:834:main] 0-cli: Started running >>> gluster with version 6dev >>> [2019-01-08 16:36:43.808182] I [MSGID: 101190] >>> [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 0 >>> [2019-01-08 16:36:43.808287] I [MSGID: 101190] >>> [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 1 >>> [2019-01-08 16:36:43.808432] E [MSGID: 101191] >>> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>> handler >>> [2019-01-08 16:36:43.816534] I [dict.c:1947:dict_get_uint32] >>> (-->gluster(cli_cmd_process+0x1e4) [0x40db50] >>> -->gluster(cli_cmd_volume_status_cbk+0x90) [0x415bec] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) >>> [0x7fefe569456 >>> 9] ) 0-dict: key cmd, unsigned integer type asked, has integer type >>> [Invalid argument] >>> [2019-01-08 16:36:43.816716] I [dict.c:1947:dict_get_uint32] >>> (-->gluster(cli_cmd_volume_status_cbk+0x1cb) [0x415d27] >>> -->gluster(gf_cli_status_volume_all+0xc8) [0x42fa94] >>> -->/build/install/lib/libglusterfs.so.0(dict_get_uint32+0x176) [0x7f >>> efe5694569] ) 0-dict: key cmd, unsigned integer type asked, has integer >>> type [Invalid argument] >>> [2019-01-08 16:36:43.824437] I [input.c:31:cli_batch] 0-: Exiting with: 0 >>> >>> >>> Bad run: >>> >>> [2019-01-08 16:36:43.940361] I [cli.c:834:main] 0-cli: Started running >>> gluster with version 6dev >>> [2019-01-08 16:36:44.147364] I [MSGID: 101190] >>> [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 0 >>> [2019-01-08 16:36:44.147477] I [MSGID: 101190] >>> [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 1 >>> [2019-01-08 16:36:44.147583] E [MSGID: 101191] >>> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>> handler >>> >>> >>> In glusterd.log it seems as if it hasn't received any status request. It >>> looks like the cli has not even connected to glusterd. >>> >> > Downloaded the logs for the recent failure from > https://build.gluster.org/job/regression-test-with-multiplex/1092/ and > based on the log scanning this is what I see: > > 1. The test executes with out any issues till line no 74 i.e. "TEST $CLI > volume start $V0 force" and cli.log along with cmd_history.log confirm the > same: > > cli.log > ==== > [2019-01-16 16:28:46.871877]:++++++++++ > G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 73 gluster --mode=script > --wignore volume start patchy force ++++++++++ > [2019-01-16 16:28:46.980780] I [cli.c:834:main] 0-cli: Started running > gluster with version 6dev > [2019-01-16 16:28:47.185996] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 0 > [2019-01-16 16:28:47.186113] I [MSGID: 101190] > [event-epoll.c:675:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-16 16:28:47.186234] E [MSGID: 101191] > [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > [2019-01-16 16:28:49.223376] I > [cli-rpc-ops.c:1448:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume <=== successfully processed the callback > [2019-01-16 16:28:49.223668] I [input.c:31:cli_batch] 0-: Exiting with: 0 > > cmd_history.log > ============ > [2019-01-16 16:28:49.220491] : volume start patchy force : SUCCESS > > However, in both cli and cmd_history log files these are the last set of > logs I see which indicates either the test script is completely paused. > There's no possibility I see that cli receiving this command and dropping > it completely as otherwise we should have atleast seen the "Started running > gluster with version 6dev" and "Exiting with" log entries. > > I could manage to reproduce this once locally in my system and then when I > ran command from another prompt, volume status and all other gluster basic > commands go through. I also inspected the processes and I don't see any > suspect of processes being hung. > > So the mystery continues and we need to see why the test script is not all > moving forward. > An additional thing that could be interesting: in all cases I've seen this test to hang, the next test shows an error during cleanup: Aborting. /mnt/nfs/1 could not be deleted, here are the left over items drwxr-xr-x. 2 root root 6 Jan 16 16:41 /d/backends drwxr-xr-x. 2 root root 4096 Jan 16 16:41 /mnt/glusterfs/0 drwxr-xr-x. 2 root root 4096 Jan 16 16:41 /mnt/glusterfs/1 drwxr-xr-x. 2 root root 4096 Jan 16 16:41 /mnt/glusterfs/2 drwxr-xr-x. 2 root root 4096 Jan 16 16:41 /mnt/glusterfs/3 drwxr-xr-x. 2 root root 4096 Jan 16 16:41 /mnt/nfs/0 drwxr-xr-x. 2 root root 4096 Jan 16 16:41 /mnt/nfs/1 Please correct the problem and try again. This is a bit weird, since this only happens after having removed all these directories with an 'rm -rf', and this command doesn't exit on the first error, so at least some of these directories should have been removed, even is the mount process is hung (all nfs mounts and fuse mounts 1, 2 and 3 are not used by the test). The only explanation I have is that the cleanup function is being executed twice concurrently (probably from two different scripts). The first cleanup is blocked (or is taking a lot of time) removing one of the directories. Meantime the other cleanup has completed and recreated the directories, so when the first one finally finishes, it finds all directories still there, writing the above messages. This would also mean that something is not properly killed between tests. Not sure if that's possible. This could match with your findings, since some commands executed on the second script could "unblock" whatever is blocked in the first one, causing it to progress and show the final error. Could this explain something ? > > >>> Xavi >>> >>> >>>> --- >>>> Ashish >>>> >>>> >>>> >>>> ------------------------------ >>>> *From: *"Mohit Agrawal" >>>> *To: *"Shyam Ranganathan" >>>> *Cc: *"Gluster Devel" >>>> *Sent: *Saturday, January 12, 2019 6:46:20 PM >>>> *Subject: *Re: [Gluster-devel] Regression health for release-5.next >>>> and release-6 >>>> >>>> Previous logs related to client not bricks, below are the brick logs >>>> >>>> [2019-01-12 12:25:25.893485]:++++++++++ >>>> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 68 rm -f 0.o 10.o 11.o 12.o 13.o >>>> 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ >>>> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: >>>> key 'trusted.ec.size' would not be sent on wire in the future [Invalid >>>> argument]" repeated 199 times between [2019-01-12 12:25:25.283989] and >>>> [2019-01-12 12:25:25.899532] >>>> [2019-01-12 12:25:25.903375] E [MSGID: 113001] >>>> [posix-inode-fd-ops.c:4617:_posix_handle_xattr_keyvalue_pair] >>>> 8-patchy-posix: fgetxattr failed on >>>> gfid=d91f6331-d394-479d-ab51-6bcf674ac3e0 while doing xattrop: >>>> Key:trusted.ec.dirty (Bad file descriptor) [Bad file descriptor] >>>> [2019-01-12 12:25:25.903468] E [MSGID: 115073] >>>> [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-patchy-server: 1486: >>>> FXATTROP 2 (d91f6331-d394-479d-ab51-6bcf674ac3e0), client: >>>> CTX_ID:b785c2b0-3453-4a03-b129-19e6ceeb5346-GRAPH_ID:0-PID:24147-HOST:softserve-moagrawa-test.1-PC_NAME:patchy-client-1-RECON_NO:-1, >>>> error-xlator: patchy-posix [Bad file descriptor] >>>> >>>> >>>> Thanks, >>>> Mohit Agrawal >>>> >>>> On Sat, Jan 12, 2019 at 6:29 PM Mohit Agrawal >>>> wrote: >>>> >>>>> >>>>> For specific to "add-brick-and-validate-replicated-volume-options.t" i >>>>> have posted a patch https://review.gluster.org/22015. >>>>> For test case "ec/bug-1236065.t" I think the issue needs to be check >>>>> by ec team >>>>> >>>>> On the brick side, it is showing below logs >>>>> >>>>> >>>>>>>>>>>>>>>>> >>>>> >>>>> on wire in the future [Invalid argument] >>>>> The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: >>>>> key 'trusted.ec.dirty' would not be sent on wire in the future [Invalid >>>>> argument]" repeated 3 times between [2019-01-12 12:25:25.902828] and >>>>> [2019-01-12 12:25:25.902992] >>>>> [2019-01-12 12:25:25.903553] W [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-patchy-client-1: >>>>> remote operation failed [Bad file descriptor] >>>>> [2019-01-12 12:25:25.903998] W [MSGID: 122040] >>>>> [ec-common.c:1181:ec_prepare_update_cbk] 0-patchy-disperse-0: Failed to get >>>>> size and version : FOP : 'FXATTROP' failed on gfid >>>>> d91f6331-d394-479d-ab51-6bcf674ac3e0 [Input/output error] >>>>> [2019-01-12 12:25:25.904059] W [fuse-bridge.c:1907:fuse_unlink_cbk] >>>>> 0-glusterfs-fuse: 3259: UNLINK() /test/0.o => -1 (Input/output error) >>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> >>>>> Test case is getting timed out because "volume heal $V0 full" command >>>>> is stuck, look's like shd is getting stuck at getxattr >>>>> >>>>> >>>>>>>>>>>>>>. >>>>> >>>>> Thread 8 (Thread 0x7f83777fe700 (LWP 25552)): >>>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>>> loc=loc at entry=0x7f83777fdbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030880, >>>>> child=, loc=0x7f83777fdbb0, full=) at >>>>> ec-heald.c:161 >>>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80094b0, >>>>> entry=, parent=0x7f83777fdde0, data=0x7f83a8030880) at >>>>> ec-heald.c:294 >>>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80094b0, >>>>> loc=loc at entry=0x7f83777fdde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030880, >>>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030880, >>>>> inode=) at ec-heald.c:311 >>>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030880) at >>>>> ec-heald.c:372 >>>>> #7 0x00007f83bb709e25 in start_thread () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>>> Thread 7 (Thread 0x7f8376ffd700 (LWP 25553)): >>>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>>> loc=loc at entry=0x7f8376ffcbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80308f0, >>>>> child=, loc=0x7f8376ffcbb0, full=) at >>>>> ec-heald.c:161 >>>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a800d110, >>>>> entry=, parent=0x7f8376ffcde0, data=0x7f83a80308f0) at >>>>> ec-heald.c:294 >>>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a800d110, >>>>> loc=loc at entry=0x7f8376ffcde0, pid=pid at entry=-6, data=data at entry=0x7f83a80308f0, >>>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80308f0, >>>>> inode=) at ec-heald.c:311 >>>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80308f0) at >>>>> ec-heald.c:372 >>>>> #7 0x00007f83bb709e25 in start_thread () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>>> Thread 6 (Thread 0x7f83767fc700 (LWP 25554)): >>>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>>> loc=loc at entry=0x7f83767fbbb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030960, >>>>> child=, loc=0x7f83767fbbb0, full=) at >>>>> ec-heald.c:161 >>>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8010af0, >>>>> entry=, parent=0x7f83767fbde0, data=0x7f83a8030960) at >>>>> ec-heald.c:294 >>>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8010af0, >>>>> loc=loc at entry=0x7f83767fbde0, pid=pid at entry=-6, data=data at entry=0x7f83a8030960, >>>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030960, >>>>> inode=) at ec-heald.c:311 >>>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030960) at >>>>> ec-heald.c:372 >>>>> #7 0x00007f83bb709e25 in start_thread () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>>> Thread 5 (Thread 0x7f8375ffb700 (LWP 25555)): >>>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>>> loc=loc at entry=0x7f8375ffabb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a80309d0, >>>>> child=, loc=0x7f8375ffabb0, full=) at >>>>> ec-heald.c:161 >>>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a80144d0, >>>>> entry=, parent=0x7f8375ffade0, data=0x7f83a80309d0) at >>>>> ec-heald.c:294 >>>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a80144d0, >>>>> loc=loc at entry=0x7f8375ffade0, pid=pid at entry=-6, data=data at entry=0x7f83a80309d0, >>>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a80309d0, >>>>> inode=) at ec-heald.c:311 >>>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a80309d0) at >>>>> ec-heald.c:372 >>>>> #7 0x00007f83bb709e25 in start_thread () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>>> Thread 4 (Thread 0x7f83757fa700 (LWP 25556)): >>>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>>> loc=loc at entry=0x7f83757f9bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030a40, >>>>> child=, loc=0x7f83757f9bb0, full=) at >>>>> ec-heald.c:161 >>>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a8017eb0, >>>>> entry=, parent=0x7f83757f9de0, data=0x7f83a8030a40) at >>>>> ec-heald.c:294 >>>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a8017eb0, >>>>> loc=loc at entry=0x7f83757f9de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030a40, >>>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030a40, >>>>> inode=) at ec-heald.c:311 >>>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030a40) at >>>>> ec-heald.c:372 >>>>> #7 0x00007f83bb709e25 in start_thread () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>>> Thread 3 (Thread 0x7f8374ff9700 (LWP 25557)): >>>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>>> loc=loc at entry=0x7f8374ff8bb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030ab0, >>>>> child=, loc=0x7f8374ff8bb0, full=) at >>>>> ec-heald.c:161 >>>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801b890, >>>>> entry=, parent=0x7f8374ff8de0, data=0x7f83a8030ab0) at >>>>> ec-heald.c:294 >>>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801b890, >>>>> loc=loc at entry=0x7f8374ff8de0, pid=pid at entry=-6, data=data at entry=0x7f83a8030ab0, >>>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030ab0, >>>>> inode=) at ec-heald.c:311 >>>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030ab0) at >>>>> ec-heald.c:372 >>>>> #7 0x00007f83bb709e25 in start_thread () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>>> Thread 2 (Thread 0x7f8367fff700 (LWP 25558)): >>>>> #0 0x00007f83bb70d945 in pthread_cond_wait@@GLIBC_2.3.2 () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #1 0x00007f83bc910e5b in syncop_getxattr (subvol=, >>>>> loc=loc at entry=0x7f8367ffebb0, dict=dict at entry=0x0, key=key at entry=0x7f83add06a28 >>>>> "trusted.ec.heal", xdata_in=xdata_in at entry=0x0, >>>>> xdata_out=xdata_out at entry=0x0) at syncop.c:1680 >>>>> #2 0x00007f83add02f27 in ec_shd_selfheal (healer=0x7f83a8030b20, >>>>> child=, loc=0x7f8367ffebb0, full=) at >>>>> ec-heald.c:161 >>>>> #3 0x00007f83add0325b in ec_shd_full_heal (subvol=0x7f83a801f270, >>>>> entry=, parent=0x7f8367ffede0, data=0x7f83a8030b20) at >>>>> ec-heald.c:294 >>>>> #4 0x00007f83bc930ac2 in syncop_ftw (subvol=0x7f83a801f270, >>>>> loc=loc at entry=0x7f8367ffede0, pid=pid at entry=-6, data=data at entry=0x7f83a8030b20, >>>>> fn=fn at entry=0x7f83add03140 ) at syncop-utils.c:125 >>>>> #5 0x00007f83add03534 in ec_shd_full_sweep (healer=healer at entry=0x7f83a8030b20, >>>>> inode=) at ec-heald.c:311 >>>>> #6 0x00007f83add0367b in ec_shd_full_healer (data=0x7f83a8030b20) at >>>>> ec-heald.c:372 >>>>> #7 0x00007f83bb709e25 in start_thread () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #8 0x00007f83bafd634d in clone () from /usr/lib64/libc.so.6 >>>>> Thread 1 (Thread 0x7f83bcdd1780 (LWP 25383)): >>>>> #0 0x00007f83bb70af57 in pthread_join () from >>>>> /usr/lib64/libpthread.so.0 >>>>> #1 0x00007f83bc92eff8 in event_dispatch_epoll >>>>> (event_pool=0x55af0a6dd560) at event-epoll.c:846 >>>>> #2 0x000055af0a4116b8 in main (argc=15, argv=0x7fff75610898) at >>>>> glusterfsd.c:2848 >>>>> >>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>. >>>>> >>>>> Thanks, >>>>> Mohit Agrawal >>>>> >>>>> On Fri 11 Jan, 2019, 21:20 Shyam Ranganathan >>>> wrote: >>>>> >>>>>> We can check health on master post the patch as stated by Mohit below. >>>>>> >>>>>> Release-5 is causing some concerns as we need to tag the release >>>>>> yesterday, but we have the following 2 tests failing or coredumping >>>>>> pretty regularly, need attention on these. >>>>>> >>>>>> ec/bug-1236065.t >>>>>> glusterd/add-brick-and-validate-replicated-volume-options.t >>>>>> >>>>>> Shyam >>>>>> On 1/10/19 6:20 AM, Mohit Agrawal wrote: >>>>>> > I think we should consider regression-builds after merged the patch >>>>>> > (https://review.gluster.org/#/c/glusterfs/+/21990/) >>>>>> > as we know this patch introduced some delay. >>>>>> > >>>>>> > Thanks, >>>>>> > Mohit Agrawal >>>>>> > >>>>>> > On Thu, Jan 10, 2019 at 3:55 PM Atin Mukherjee >>>>> > > wrote: >>>>>> > >>>>>> > Mohit, Sanju - request you to investigate the failures related >>>>>> to >>>>>> > glusterd and brick-mux and report back to the list. >>>>>> > >>>>>> > On Thu, Jan 10, 2019 at 12:25 AM Shyam Ranganathan >>>>>> > > wrote: >>>>>> > >>>>>> > Hi, >>>>>> > >>>>>> > As part of branching preparation next week for release-6, >>>>>> please >>>>>> > find >>>>>> > test failures and respective test links here [1]. >>>>>> > >>>>>> > The top tests that are failing/dumping-core are as below and >>>>>> > need attention, >>>>>> > - ec/bug-1236065.t >>>>>> > - >>>>>> glusterd/add-brick-and-validate-replicated-volume-options.t >>>>>> > - readdir-ahead/bug-1390050.t >>>>>> > - glusterd/brick-mux-validation.t >>>>>> > - bug-1432542-mpx-restart-crash.t >>>>>> > >>>>>> > Others of interest, >>>>>> > - replicate/bug-1341650.t >>>>>> > >>>>>> > Please file a bug if needed against the test case and >>>>>> report the >>>>>> > same >>>>>> > here, in case a problem is already addressed, then do send >>>>>> back the >>>>>> > patch details that addresses this issue as a response to >>>>>> this mail. >>>>>> > >>>>>> > Thanks, >>>>>> > Shyam >>>>>> > >>>>>> > [1] Regression failures: >>>>>> > https://hackmd.io/wsPgKjfJRWCP8ixHnYGqcA?view >>>>>> > _______________________________________________ >>>>>> > Gluster-devel mailing list >>>>>> > Gluster-devel at gluster.org >>>>> > >>>>>> > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>>> > >>>>>> > >>>>>> >>>>> >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -- >> --Atin >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srangana at redhat.com Fri Jan 18 20:21:09 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Fri, 18 Jan 2019 15:21:09 -0500 Subject: [Gluster-devel] Release 6: Kick off! In-Reply-To: References: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> Message-ID: <19f19863-61c4-9f2b-98d2-19a435b8ec78@redhat.com> On 12/6/18 9:34 AM, Shyam Ranganathan wrote: > On 11/6/18 11:34 AM, Shyam Ranganathan wrote: >> ## Schedule > > We have decided to postpone release-6 by a month, to accommodate for > late enhancements and the drive towards getting what is required for the > GCS project [1] done in core glusterfs. > > This puts the (modified) schedule for Release-6 as below, > > Working backwards on the schedule, here's what we have: > - Announcement: Week of Mar 4th, 2019 > - GA tagging: Mar-01-2019 > - RC1: On demand before GA > - RC0: Feb-04-2019 > - Late features cut-off: Week of Jan-21st, 2018 > - Branching (feature cutoff date): Jan-14-2018 > (~45 days prior to branching) We are slightly past the branching date, I would like to branch early next week, so please respond with a list of patches that need to be part of the release and are still pending a merge, will help address review focus on the same and also help track it down and branch the release. Thanks, Shyam From emteeoh at gmail.com Sun Jan 20 18:43:55 2019 From: emteeoh at gmail.com (Richard Betel) Date: Sun, 20 Jan 2019 13:43:55 -0500 Subject: [Gluster-devel] Building Glusterfs-5.3 on armhf Message-ID: I've got some odroid HC2's running debian 9 that i'd like to run gluster on, but I want to run something current, not 3.8! So I'm trying to build 5.3, but I can't get through the./configure. At first, I forgot to run autogen, so i was using whatever configure I had, and it would error out on sqlite, even though I have the sqlite3 dev libraries installed. Anyhow, I realized my mistake, and ran autogen.sh . Now configure dies on libuuid which is also installed. before autogen it got well past it. here's the last few lines: checking sys/extattr.h usability... no checking sys/extattr.h presence... no checking for sys/extattr.h... no checking openssl/dh.h usability... yes checking openssl/dh.h presence... yes checking for openssl/dh.h... yes checking openssl/ecdh.h usability... yes checking openssl/ecdh.h presence... yes checking for openssl/ecdh.h... yes checking for pow in -lm... yes ./configure: line 13788: syntax error near unexpected token `UUID,' ./configure: line 13788: `PKG_CHECK_MODULES(UUID, uuid,' Here's the config line that fails (with some: PKG_CHECK_MODULES(UUID, uuid, have_uuid=yes AC_DEFINE(HAVE_LIBUUID, 1, [have libuuid.so]) PKGCONFIG_UUID=uuid, have_uuid=no) if test x$have_uuid = xyes; then HAVE_LIBUUID_TRUE= HAVE_LIBUUID_FALSE='#' else HAVE_LIBUUID_TRUE='#' HAVE_LIBUUID_FALSE= fi I tried putting "echo FOO" before the PKG_CHECK_MODULES and it outputs correctly, so I'm pretty sure the problem isn't a dropped quote or parenthesis. Any suggestions on what to look for to debug this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon Jan 21 01:45:03 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 21 Jan 2019 01:45:03 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <1672141306.8.1548035103788.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1667168 / arbiter: Thin Arbiter documentation refers commands don't exist "glustercli' https://bugzilla.redhat.com/1665145 / core: Writes on Gluster 5 volumes fail with EIO when "cluster.consistent-metadata" is set https://bugzilla.redhat.com/1663337 / doc: Gluster documentation on quorum-reads option is incorrect https://bugzilla.redhat.com/1663205 / fuse: List dictionary is too slow https://bugzilla.redhat.com/1664524 / geo-replication: Non-root geo-replication session goes to faulty state, when the session is started https://bugzilla.redhat.com/1662178 / glusterd: Compilation fails for xlators/mgmt/glusterd/src with error "undefined reference to `dlclose'" https://bugzilla.redhat.com/1663247 / glusterd: remove static memory allocations from code https://bugzilla.redhat.com/1663519 / gluster-smb: Memory leak when smb.conf has "store dos attributes = yes" https://bugzilla.redhat.com/1666326 / open-behind: reopening bug 1405147: Failed to dispatch handler: glusterfs seems to check for "write permission" instead for "file owner" during open() when writing to a file https://bugzilla.redhat.com/1665361 / project-infrastructure: Alerts for offline nodes https://bugzilla.redhat.com/1663780 / project-infrastructure: On docs.gluster.org, we should convert spaces in folder or file names to 301 redirects to hypens https://bugzilla.redhat.com/1666634 / protocol: nfs client cannot compile files on dispersed volume https://bugzilla.redhat.com/1665677 / rdma: volume create and transport change with rdma failed https://bugzilla.redhat.com/1664215 / read-ahead: Toggling readdir-ahead translator off causes some clients to umount some of its volumes https://bugzilla.redhat.com/1661895 / replicate: [disperse] Dump respective itables in EC to statedumps. https://bugzilla.redhat.com/1662557 / replicate: glusterfs process crashes, causing "Transport endpoint not connected". https://bugzilla.redhat.com/1664398 / tests: ./tests/00-geo-rep/00-georep-verify-setup.t does not work with ./run-tests-in-vagrant.sh [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 2423 bytes Desc: not available URL: From ndevos at redhat.com Mon Jan 21 13:49:00 2019 From: ndevos at redhat.com (Niels de Vos) Date: Mon, 21 Jan 2019 14:49:00 +0100 Subject: [Gluster-devel] Building Glusterfs-5.3 on armhf In-Reply-To: References: Message-ID: <20190121134900.GE2361@ndevos-x270> On Sun, Jan 20, 2019 at 01:43:55PM -0500, Richard Betel wrote: > I've got some odroid HC2's running debian 9 that i'd like to run gluster > on, but I want to run something current, not 3.8! So I'm trying to build > 5.3, but I can't get through the./configure. > > At first, I forgot to run autogen, so i was using whatever configure I had, > and it would error out on sqlite, even though I have the sqlite3 dev > libraries installed. Anyhow, I realized my mistake, and ran autogen.sh . > Now configure dies on libuuid which is also installed. before autogen it > got well past it. here's the last few lines: > checking sys/extattr.h usability... no > checking sys/extattr.h presence... no > checking for sys/extattr.h... no > checking openssl/dh.h usability... yes > checking openssl/dh.h presence... yes > checking for openssl/dh.h... yes > checking openssl/ecdh.h usability... yes > checking openssl/ecdh.h presence... yes > checking for openssl/ecdh.h... yes > checking for pow in -lm... yes > ./configure: line 13788: syntax error near unexpected token `UUID,' > ./configure: line 13788: `PKG_CHECK_MODULES(UUID, uuid,' > > Here's the config line that fails (with some: > PKG_CHECK_MODULES(UUID, uuid, > have_uuid=yes > AC_DEFINE(HAVE_LIBUUID, 1, [have libuuid.so]) > PKGCONFIG_UUID=uuid, > have_uuid=no) > if test x$have_uuid = xyes; then > HAVE_LIBUUID_TRUE= > HAVE_LIBUUID_FALSE='#' > else > HAVE_LIBUUID_TRUE='#' > HAVE_LIBUUID_FALSE= > fi > > I tried putting "echo FOO" before the PKG_CHECK_MODULES and it outputs > correctly, so I'm pretty sure the problem isn't a dropped quote or > parenthesis. > > Any suggestions on what to look for to debug this? You might be missing the PKG_CHECK_MODULES macro. Can you make sure you have pkg-config installed? Niels From emteeoh at gmail.com Mon Jan 21 21:30:23 2019 From: emteeoh at gmail.com (Richard Betel) Date: Mon, 21 Jan 2019 16:30:23 -0500 Subject: [Gluster-devel] Building Glusterfs-5.3 on armhf In-Reply-To: <20190121134900.GE2361@ndevos-x270> References: <20190121134900.GE2361@ndevos-x270> Message-ID: On Mon, 21 Jan 2019 at 08:49, Niels de Vos wrote: > > > You might be missing the PKG_CHECK_MODULES macro. Can you make sure you > have pkg-config installed? > > Niels > That did it.Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 09:46:26 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 15:16:26 +0530 Subject: [Gluster-devel] Maintainer's meeting: Jan 21st, 2019 Message-ID: BJ Link - Bridge: https://bluejeans.com/217609845 - Watch: https://bluejeans.com/s/PAnE5 Attendance - Nigel Babu, Amar, Nithya, Shyam, Sunny, Milind (joined late). Agenda - GlusterFS - v6.0 - Are we ready for branching? - Can we consider getting https://review.gluster.org/20636 (lock free thread pool) as an option in the code, so we can have it? - Lets try to keep it as an option, and backport it, if not ready by end of this week. - Fencing? - Most probable to make it. - python3 support for glusterfind - https://review.gluster.org/#/c/glusterfs/+/21845/ - Self-heal daemon multiplexing? - Reflink? - Any other performance enhancements? - Infra Updates - Moving to new cloud vendor this week. Expect some flakiness. This is on a timeline we do not control and already quite significantly delayed. - Going to delete old master builds from http://artifacts.ci.centos.org/gluster/nightly/ - Not deleting the release branch artifacts. - Performance regression test bed - Have machines, can we get started with bare minimum tests - All we need is the result to be out in public - Basic tests are present. Some more test failures, so resolving that should be good enough. - Will be picked up after above changes. - Round Table - Have a look at website and suggest what more is required. -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ravishankar at redhat.com Tue Jan 22 10:04:30 2019 From: ravishankar at redhat.com (Ravishankar N) Date: Tue, 22 Jan 2019 15:34:30 +0530 Subject: [Gluster-devel] [Gluster-users] Self/Healing process after node maintenance In-Reply-To: <82F3A23F-7C04-45C4-9685-0904032333A7@gmail.com> References: <82F3A23F-7C04-45C4-9685-0904032333A7@gmail.com> Message-ID: <640b8701-7bb6-9a4e-f713-bd606ffd2f47@redhat.com> On 01/22/2019 02:57 PM, Martin Toth wrote: > Hi all, > > I just want to ensure myself how self-healing process exactly works, because I need to turn one of my nodes down for maintenance. > I have replica 3 setup. Nothing complicated. 3 nodes, 1 volume, 1 brick per node (ZFS pool). All nodes running Qemu VMs and disks of VMs are on Gluster volume. > > I want to turn off node1 for maintenance. If I will migrate all VMs to node2 and node3 and shutdown node1, I suppose everything will be running without downtime. (2 nodes of 3 will be online) Yes it should. Before you `shutdown` a node, kill all the gluster processes on it. i.e. `pkill gluster`. > > My question is if I will start up node1 after maintenance and node1 will be done back online in running state, this will trigger self-healing process on all disk files of all VMs.. will this healing process be only and only on node1? The list of files needing heal on node1 are captured on the other 2 nodes that were up, so the selfheal daemons on those nodes will do the heals. > Can node2 and node3 run VMs without problem while node1 will be healing these files? Yes. You might notice some performance drop if there are a lot of heals happening though. > I want to ensure myself this files (VM disks) will not get ?locked? on node2 and node3 while self-healing will be in process on node1. Heal won't block I/O from clients indefinitely. If both are writing to overlapping offset, one of them (i.e either heal or client I/O)? will get the lock, do its job and release the lock so that the other can acquire it and continue. HTH, Ravi > > Thanks for clarification in advance. > > BR! > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From george.lian at nokia-sbell.com Wed Jan 23 05:53:19 2019 From: george.lian at nokia-sbell.com (Lian, George (NSB - CN/Hangzhou)) Date: Wed, 23 Jan 2019 05:53:19 +0000 Subject: [Gluster-devel] glusterfs coredump Message-ID: <1d67e7634d5c453085fb80f0d7e86ee1@nokia-sbell.com> Hi, GlusterFS expert, We have encounter a coredump of client process ?glusterfs? issue recently, and it could be reproduced more easy when the IO load and CPU/memory load is highly during stability testing. Our glusterfs release is 3.12.2 I have copy the call trace of core dump as the below, and have some question, wish can get help from you. 1) Do you have encounter related issue? From the call trace, we could see the fd variable seems abnoram with the field ?refcount? and ?inode?, For wb_inode->this, it become invalid value with 0xffffffffffffff00, did the value ?0xffffffffffffff00?is some meaningful value? Because every coredump occurred, the value of inode->this is same with ?0xffffffffffffff00? 2) When I check the source code, in function wb_enqueue_common, could find function __wb_request_unref used instead of wb_request_unref, and we could see though function wb_request_unref is defined, but never used! Firstly it seems some strange, secondly, in wb_request_unref, there are lock mechanism to avoid race condition, but __wb_request_unref without those mechanism, and we could see there are more occurrence called from __wb_request_unref, will it lead to race issue? [Current thread is 1 (Thread 0x7f54e82a3700 (LWP 6078))] (gdb) bt #0 0x00007f54e623197c in wb_fulfill (wb_inode=0x7f54d4066bd0, liabilities=0x7f54d0824440) at write-behind.c:1155 #1 0x00007f54e6233662 in wb_process_queue (wb_inode=0x7f54d4066bd0) at write-behind.c:1728 #2 0x00007f54e6234039 in wb_writev (frame=0x7f54d406d6c0, this=0x7f54e0014b10, fd=0x7f54d8019d70, vector=0x7f54d0018000, count=1, offset=33554431, flags=32770, iobref=0x7f54d021ec20, xdata=0x0) at write-behind.c:1842 #3 0x00007f54e6026fcb in du_writev_resume (ret=0, frame=0x7f54d0002260, opaque=0x7f54d0002260) at disk-usage.c:490 #4 0x00007f54ece07160 in synctask_wrap () at syncop.c:377 #5 0x00007f54eb3a2660 in ?? () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) p wb_inode $6 = (wb_inode_t *) 0x7f54d4066bd0 (gdb) p wb_inode->this $1 = (xlator_t *) 0xffffffffffffff00 (gdb) frame 1 #1 0x00007f54e6233662 in wb_process_queue (wb_inode=0x7f54d4066bd0) at write-behind.c:1728 1728 in write-behind.c (gdb) p wind_failure $2 = 0 (gdb) p *wb_inode $3 = {window_conf = 35840637416824320, window_current = 35840643167805440, transit = 35839681019027968, all = {next = 0xb000, prev = 0x7f54d4066bd000}, todo = {next = 0x7f54deadc0de00, prev = 0x7f54e00489e000}, liability = {next = 0x7f54000000a200, prev = 0xb000}, temptation = {next = 0x7f54d4066bd000, prev = 0x7f54deadc0de00}, wip = {next = 0x7f54e00489e000, prev = 0x7f54000000a200}, gen = 45056, size = 35840591659782144, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 8344798, __owner = 0, __nusers = 8344799, __kind = 41472, __spins = 21504, __elision = 127, __list = { __prev = 0xb000, __next = 0x7f54d4066bd000}}, __size = "\000\000\000\000\336T\177\000\000\000\000\000\337T\177\000\000\242\000\000\000T\177\000\000\260\000\000\000\000\000\000\000\320k\006\324T\177", __align = 35840634501726208}}, this = 0xffffffffffffff00, dontsync = -1} (gdb) frame 2 #2 0x00007f54e6234039 in wb_writev (frame=0x7f54d406d6c0, this=0x7f54e0014b10, fd=0x7f54d8019d70, vector=0x7f54d0018000, count=1, offset=33554431, flags=32770, iobref=0x7f54d021ec20, xdata=0x0) at write-behind.c:1842 1842 in write-behind.c (gdb) p fd $4 = (fd_t *) 0x7f54d8019d70 (gdb) p *fd $5 = {pid = 140002378149040, flags = -670836240, refcount = 32596, inode_list = {next = 0x7f54d8019d80, prev = 0x7f54d8019d80}, inode = 0x0, lock = {spinlock = -536740032, mutex = {__data = { __lock = -536740032, __count = 32596, __owner = -453505333, __nusers = 32596, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "@\377\001\340T\177\000\000\313\016\370\344T\177", '\000' , __align = 140002512207680}}, _ctx = 0xffffffff, xl_count = 0, lk_ctx = 0x0, anonymous = (unknown: 3623984496)} (gdb) Thanks & Best Regards, George -------------- next part -------------- An HTML attachment was scrubbed... URL: From rkavunga at redhat.com Wed Jan 23 10:52:42 2019 From: rkavunga at redhat.com (RAFI KC) Date: Wed, 23 Jan 2019 16:22:42 +0530 Subject: [Gluster-devel] Release 6: Kick off! In-Reply-To: <19f19863-61c4-9f2b-98d2-19a435b8ec78@redhat.com> References: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> <19f19863-61c4-9f2b-98d2-19a435b8ec78@redhat.com> Message-ID: There are three patches that I'm working for Gluster-6. [1] : https://review.gluster.org/#/c/glusterfs/+/22075/ [2] : https://review.gluster.org/#/c/glusterfs/+/21333/ [3] : https://review.gluster.org/#/c/glusterfs/+/21720/ Regards Rafi KC On 1/19/19 1:51 AM, Shyam Ranganathan wrote: > On 12/6/18 9:34 AM, Shyam Ranganathan wrote: >> On 11/6/18 11:34 AM, Shyam Ranganathan wrote: >>> ## Schedule >> We have decided to postpone release-6 by a month, to accommodate for >> late enhancements and the drive towards getting what is required for the >> GCS project [1] done in core glusterfs. >> >> This puts the (modified) schedule for Release-6 as below, >> >> Working backwards on the schedule, here's what we have: >> - Announcement: Week of Mar 4th, 2019 >> - GA tagging: Mar-01-2019 >> - RC1: On demand before GA >> - RC0: Feb-04-2019 >> - Late features cut-off: Week of Jan-21st, 2018 >> - Branching (feature cutoff date): Jan-14-2018 >> (~45 days prior to branching) > We are slightly past the branching date, I would like to branch early > next week, so please respond with a list of patches that need to be part > of the release and are still pending a merge, will help address review > focus on the same and also help track it down and branch the release. > > Thanks, Shyam > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel From aspandey at redhat.com Wed Jan 23 11:03:19 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Wed, 23 Jan 2019 06:03:19 -0500 (EST) Subject: [Gluster-devel] Release 6: Kick off! In-Reply-To: References: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> <19f19863-61c4-9f2b-98d2-19a435b8ec78@redhat.com> Message-ID: <1959334041.59187316.1548241399641.JavaMail.zimbra@redhat.com> Following is the patch I am working and targeting - https://review.gluster.org/#/c/glusterfs/+/21933/ It is under review phase and yet to be merged. -- Ashish ----- Original Message ----- From: "RAFI KC" To: "Shyam Ranganathan" , "GlusterFS Maintainers" , "Gluster Devel" Sent: Wednesday, January 23, 2019 4:22:42 PM Subject: Re: [Gluster-devel] Release 6: Kick off! There are three patches that I'm working for Gluster-6. [1] : https://review.gluster.org/#/c/glusterfs/+/22075/ [2] : https://review.gluster.org/#/c/glusterfs/+/21333/ [3] : https://review.gluster.org/#/c/glusterfs/+/21720/ Regards Rafi KC On 1/19/19 1:51 AM, Shyam Ranganathan wrote: > On 12/6/18 9:34 AM, Shyam Ranganathan wrote: >> On 11/6/18 11:34 AM, Shyam Ranganathan wrote: >>> ## Schedule >> We have decided to postpone release-6 by a month, to accommodate for >> late enhancements and the drive towards getting what is required for the >> GCS project [1] done in core glusterfs. >> >> This puts the (modified) schedule for Release-6 as below, >> >> Working backwards on the schedule, here's what we have: >> - Announcement: Week of Mar 4th, 2019 >> - GA tagging: Mar-01-2019 >> - RC1: On demand before GA >> - RC0: Feb-04-2019 >> - Late features cut-off: Week of Jan-21st, 2018 >> - Branching (feature cutoff date): Jan-14-2018 >> (~45 days prior to branching) > We are slightly past the branching date, I would like to branch early > next week, so please respond with a list of patches that need to be part > of the release and are still pending a merge, will help address review > focus on the same and also help track it down and branch the release. > > Thanks, Shyam > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From srangana at redhat.com Wed Jan 23 15:12:29 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Wed, 23 Jan 2019 10:12:29 -0500 Subject: [Gluster-devel] Release 6: Kick off! In-Reply-To: References: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> <19f19863-61c4-9f2b-98d2-19a435b8ec78@redhat.com> Message-ID: On 1/23/19 5:52 AM, RAFI KC wrote: > There are three patches that I'm working for Gluster-6. > > [1] : https://review.gluster.org/#/c/glusterfs/+/22075/ We discussed mux for shd in the maintainers meeting, and decided that this would be for the next release, as the patchset is not ready (branching is today, if I get the time to get it done). > > [2] : https://review.gluster.org/#/c/glusterfs/+/21333/ Ack! in case this is not in by branching we can backport the same > > [3] : https://review.gluster.org/#/c/glusterfs/+/21720/ Bug fix, can be backported post branching as well, so again ack! Thanks for responding. From srangana at redhat.com Wed Jan 23 15:13:32 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Wed, 23 Jan 2019 10:13:32 -0500 Subject: [Gluster-devel] Release 6: Kick off! In-Reply-To: <1959334041.59187316.1548241399641.JavaMail.zimbra@redhat.com> References: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> <19f19863-61c4-9f2b-98d2-19a435b8ec78@redhat.com> <1959334041.59187316.1548241399641.JavaMail.zimbra@redhat.com> Message-ID: <75cb1e52-fa8b-2a4e-67b3-6c9deb6d9909@redhat.com> On 1/23/19 6:03 AM, Ashish Pandey wrote: > > Following is the patch I am working and targeting -? > https://review.gluster.org/#/c/glusterfs/+/21933/ This is a bug fix, and the patch size at the moment is also small in lines changed. Hence, even if it misses branching the fix can be backported. Thanks for the heads up! From skoduri at redhat.com Thu Jan 24 08:23:07 2019 From: skoduri at redhat.com (Soumya Koduri) Date: Thu, 24 Jan 2019 13:53:07 +0530 Subject: [Gluster-devel] Release 6: Kick off! In-Reply-To: <75cb1e52-fa8b-2a4e-67b3-6c9deb6d9909@redhat.com> References: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> <19f19863-61c4-9f2b-98d2-19a435b8ec78@redhat.com> <1959334041.59187316.1548241399641.JavaMail.zimbra@redhat.com> <75cb1e52-fa8b-2a4e-67b3-6c9deb6d9909@redhat.com> Message-ID: <9d9f14cd-fa1b-de3d-ac48-7720d02b99b2@redhat.com> Hi Shyam, Sorry for the late response. I just realized that we had two more new APIs glfs_setattr/fsetattr which uses 'struct stat' made public [1]. As mentioned in one of the patchset review comments, since the goal is to move to glfs_stat in release-6, do we need to update these APIs as well to use the new struct? Or shall we retain them in FUTURE for now and address in next minor release? Please suggest. Thanks, Soumya [1] https://review.gluster.org/#/c/glusterfs/+/21734/ On 1/23/19 8:43 PM, Shyam Ranganathan wrote: > On 1/23/19 6:03 AM, Ashish Pandey wrote: >> >> Following is the patch I am working and targeting - >> https://review.gluster.org/#/c/glusterfs/+/21933/ > > This is a bug fix, and the patch size at the moment is also small in > lines changed. Hence, even if it misses branching the fix can be backported. > > Thanks for the heads up! > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > From srangana at redhat.com Thu Jan 24 14:43:27 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Thu, 24 Jan 2019 09:43:27 -0500 Subject: [Gluster-devel] Release 6: Kick off! In-Reply-To: <9d9f14cd-fa1b-de3d-ac48-7720d02b99b2@redhat.com> References: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> <19f19863-61c4-9f2b-98d2-19a435b8ec78@redhat.com> <1959334041.59187316.1548241399641.JavaMail.zimbra@redhat.com> <75cb1e52-fa8b-2a4e-67b3-6c9deb6d9909@redhat.com> <9d9f14cd-fa1b-de3d-ac48-7720d02b99b2@redhat.com> Message-ID: <86e0b84b-cfd3-54e4-22c7-848faac6f487@redhat.com> On 1/24/19 3:23 AM, Soumya Koduri wrote: > Hi Shyam, > > Sorry for the late response. I just realized that we had two more new > APIs glfs_setattr/fsetattr which uses 'struct stat' made public [1]. As > mentioned in one of the patchset review comments, since the goal is to > move to glfs_stat in release-6, do we need to update these APIs as well > to use the new struct? Or shall we retain them in FUTURE for now and > address in next minor release? Please suggest. So the goal in 6 is to not return stat but glfs_stat in the modified pre/post stat return APIs (instead of making this a 2-step for application consumers). To reach glfs_stat everywhere, we have a few more things to do. I had this patch in my radar, but just like pub_glfs_stat returns stat (hence we made glfs_statx as private), I am seeing this as "fine for now". In the future we only want to return glfs_stat. So for now, we let this API be. The next round of converting stat to glfs_stat would take into account clearing up all such instances. So that all application consumers will need to modify code as required in one shot. Does this answer the concern? and, thanks for bringing this to notice. > > Thanks, > Soumya > > [1] https://review.gluster.org/#/c/glusterfs/+/21734/ > > > On 1/23/19 8:43 PM, Shyam Ranganathan wrote: >> On 1/23/19 6:03 AM, Ashish Pandey wrote: >>> >>> Following is the patch I am working and targeting - >>> https://review.gluster.org/#/c/glusterfs/+/21933/ >> >> This is a bug fix, and the patch size at the moment is also small in >> lines changed. Hence, even if it misses branching the fix can be >> backported. >> >> Thanks for the heads up! >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> From xhernandez at redhat.com Thu Jan 24 15:47:26 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 24 Jan 2019 16:47:26 +0100 Subject: [Gluster-devel] Performance improvements Message-ID: Hi all, I've just updated a patch [1] that implements a new thread pool based on a wait-free queue provided by userspace-rcu library. The patch also includes an auto scaling mechanism that only keeps running the needed amount of threads for the current workload. This new approach has some advantages: - It's provided globally inside libglusterfs instead of inside an xlator This makes it possible that fuse thread and epoll threads transfer the received request to another thread sooner, wating less CPU and reacting sooner to other incoming requests. - Adding jobs to the queue used by the thread pool only requires an atomic operation This makes the producer side of the queue really fast, almost with no delay. - Contention is reduced The producer side has negligible contention thanks to the wait-free enqueue operation based on an atomic access. The consumer side requires a mutex, but the duration is very small and the scaling mechanism makes sure that there are no more threads than needed contending for the mutex. This change disables io-threads, since it replaces part of its functionality. However there are two things that could be needed from io-threads: - Prioritization of fops Currently, io-threads assigns priorities to each fop, so that some fops are handled before than others. - Fair distribution of execution slots between clients Currently, io-threads processes requests from each client in round-robin. These features are not implemented right now. If they are needed, probably the best thing to do would be to keep them inside io-threads, but change its implementation so that it uses the global threads from the thread pool instead of its own threads. If this change proves it's performing better and is merged, I have some more ideas to improve other areas of gluster: - Integrate synctask threads into the new thread pool I think there is some contention in these threads because during some tests I've seen they were consuming most of the CPU. Probably they suffer from the same problem than io-threads, so replacing them could improve things. - Integrate timers into the new thread pool My idea is to create a per-thread timer where code executed in one thread will create timer events in the same thread. This makes it possible to use structures that don't require any mutex to be modified. Since the thread pool is basically executing computing tasks, which are fast, I think it's feasible to implement a timer in the main loop of each worker thread with a resolution of few millisecond, which I think is good enough for gluster needs. - Integrate with userspace-rcu library in QSBR mode This will make it possible to use some RCU-based structures for anything gluster uses (inodes, fd's, ...). These structures have very fast read operations, which should reduce contention and improve performance in many places. - Integrate I/O threads into the thread pool and reduce context switches The idea here is a bit more complex. Basically I would like to have a function that does an I/O on some device (for example reading fuse requests or waiting for epoll events). We could send a request to the thread pool to execute that function, so it would be executed inside one of the working threads. When the I/O terminates (i.e. it has received a request), the idea is that a call to the same function is added to the thread pool, so that another thread could continue waiting for requests, but the current thread will start processing the received request without a context switch. Note that with all these changes, all dedicated threads that we currently have in gluster could be replaced by the features provided by this new thread pool, so these would be the only threads present in gluster. This is specially important when brick-multiplex is used. I've done some simple tests using a replica 3 volume and a diserse 4+2 volume. These tests are executed on a single machine using an HDD for each brick (not the best scenario, but it should be fine for comparison). The machine is quite powerful (dual Intel Xeon Silver 4114 @2.2 GHz, with 128 GiB RAM). These tests have shown that the limiting factor has been the disk in most cases, so it's hard to tell if the change has really improved things. There is only one clear exception: self-heal on a dispersed volume completes 12.7% faster. The utilization of CPU has also dropped drastically: Old implementation: 12.30 user, 41.78 sys, 43.16 idle, 0.73 wait New implementation: 4.91 user, 5.52 sys, 81.60 idle, 5.91 wait Now I'm running some more tests on NVMe to try to see the effects of the change when disk is not limiting performance. I'll update once I've more data. Xavi [1] https://review.gluster.org/c/glusterfs/+/20636 -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbellur at redhat.com Fri Jan 25 07:53:12 2019 From: vbellur at redhat.com (Vijay Bellur) Date: Thu, 24 Jan 2019 23:53:12 -0800 Subject: [Gluster-devel] Performance improvements In-Reply-To: References: Message-ID: Thank you for the detailed update, Xavi! This looks very interesting. On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez wrote: > Hi all, > > I've just updated a patch [1] that implements a new thread pool based on a > wait-free queue provided by userspace-rcu library. The patch also includes > an auto scaling mechanism that only keeps running the needed amount of > threads for the current workload. > > This new approach has some advantages: > > - It's provided globally inside libglusterfs instead of inside an > xlator > > This makes it possible that fuse thread and epoll threads transfer the > received request to another thread sooner, wating less CPU and reacting > sooner to other incoming requests. > > > - Adding jobs to the queue used by the thread pool only requires an > atomic operation > > This makes the producer side of the queue really fast, almost with no > delay. > > > - Contention is reduced > > The producer side has negligible contention thanks to the wait-free > enqueue operation based on an atomic access. The consumer side requires a > mutex, but the duration is very small and the scaling mechanism makes sure > that there are no more threads than needed contending for the mutex. > > > This change disables io-threads, since it replaces part of its > functionality. However there are two things that could be needed from > io-threads: > > - Prioritization of fops > > Currently, io-threads assigns priorities to each fop, so that some fops > are handled before than others. > > > - Fair distribution of execution slots between clients > > Currently, io-threads processes requests from each client in round-robin. > > > These features are not implemented right now. If they are needed, probably > the best thing to do would be to keep them inside io-threads, but change > its implementation so that it uses the global threads from the thread pool > instead of its own threads. > These features are indeed useful to have and hence modifying the implementation of io-threads to provide this behavior would be welcome. > > > These tests have shown that the limiting factor has been the disk in most > cases, so it's hard to tell if the change has really improved things. There > is only one clear exception: self-heal on a dispersed volume completes > 12.7% faster. The utilization of CPU has also dropped drastically: > > Old implementation: 12.30 user, 41.78 sys, 43.16 idle, 0.73 wait > > New implementation: 4.91 user, 5.52 sys, 81.60 idle, 5.91 wait > > > Now I'm running some more tests on NVMe to try to see the effects of the > change when disk is not limiting performance. I'll update once I've more > data. > > Will look forward to these numbers. Regards, Vijay -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Sat Jan 26 02:33:06 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Sat, 26 Jan 2019 08:03:06 +0530 Subject: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench In-Reply-To: References: Message-ID: On Fri, Jan 11, 2019 at 8:09 PM Raghavendra Gowdappa wrote: > Here is the update of the progress till now: > * The client profile attached till now shows the tuple creation is > dominated by writes and fstats. Note that fstats are side-effects of writes > as writes invalidate attributes of the file from kernel attribute cache. > * The rest of the init phase (which is marked by msgs "setting primary > key" and "vaccuum") is dominated by reads. Next bigger set of operations > are writes followed by fstats. > > So, only writes, reads and fstats are the operations we need to optimize > to reduce the init time latency. As mentioned in my previous mail, I did > following tunings: > * Enabled only write-behind, md-cache and open-behind. > - write-behind was configured with a cache-size/window-size of 20MB > - open-behind was configured with read-after-open yes > - md-cache was loaded as a child of write-behind in xlator graph. As a > parent of write-behind, writes responses of writes cached in write-behind > would invalidate stats. But when loaded as a child of write-behind this > problem won't be there. Note that in both cases fstat would pass through > write-behind (In the former case due to no stats in md-cache). However in > the latter case fstats can be served by md-cache. > - md-cache used to aggressively invalidate inodes. For the purpose of > this test, I just commented out inode-invalidate code in md-cache. We need > to fine tune the invalidation invocation logic. > - set group-metadata-cache to on. But turned off upcall notifications. > Note that since this workload basically accesses all its data through > single mount point. So, there is no shared files across mounts and hence > its safe to turn off invalidations. > * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781 > > With the above set of tunings I could reduce the init time of scale 8000 > from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30% > > Since the workload is dominated by reads, we think a good read-cache where > reads to regions just written are served from cache would greatly improve > the performance. Since kernel page-cache already provides that > functionality along with read-ahead (which is more intelligent and serves > more read patterns than supported by Glusterfs read-ahead), we wanted to > try that. But, Manoj found a bug where reads followed by writes are not > served from page cache [5]. I am currently waiting for the resolution of > this bug. As an alternative, I can modify io-cache to serve reads from the > data just written. But, the change involves its challenges and hence would > like to get a resolution on [5] (either positive or negative) before > proceeding with modifications to io-cache. > > As to the rpc latency, Krutika had long back identified that reading a > single rpc message involves atleast 4 reads to socket. These many number of > reads were done to identify the structure of the message on the go. The > reason we wanted to discover the rpc message was to identify the part of > the rpc message containing read or write payload and make sure that payload > is directly read into a buffer different than the one containing rest of > the rpc message. This strategy will make sure payloads are not copied again > when buffers are moved across caches (read-ahead, io-cache etc) and also > the rest of the rpc message can be freed even though the payload outlives > the rpc message (when payloads are cached). However, we can experiment an > approach where we can either do away with zero-copy requirement or let the > entire buffer containing rpc message and payload to live in the cache. > > From my observations and discussions with Manoj and Xavi, this workload is > very sensitive to latency (than to concurrency). So, I am hopeful the above > approaches will give positive results. > Me, Manoj and Csaba figured out that invalidations by md-cache and Fuse auto-invalidations were dropping the kernel page-cache (more details on [5]). Changes to stats by writes from same client (local writes) were triggering both these codepaths dropping the cache. Since all the I/O done by this workload goes through the caches of single client, the invalidations are not necessary and I made code changes to fuse-bridge to disable auto-invalidations completely and commented out inode-invalidations in md-cache. Note that this doesn't regress the consistency/coherency of data seen in the caches as its a single client use-case. With these two changes coupled with earlier optimizations (client-io-threads=on, server/client-event-threads=4, md-cache as a child of write-behind in xlator graph, performance.md-cache-timeout=600), pgbench init of scale 8000 on a volume with NVMe backend completed in 54m25s. This is a whopping 94% improvement to the time we started out with (59280s vs 3360s). [root at shakthi4 ~]# gluster volume info Volume Name: nvme-r3 Type: Replicate Volume ID: d1490bcc-bcf1-4e09-91e8-ab01d9781263 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-1 Brick2: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-2 Brick3: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-3 Options Reconfigured: server.event-threads: 4 client.event-threads: 4 diagnostics.client-log-level: INFO performance.md-cache-timeout: 600 performance.io-cache: off performance.read-ahead: off diagnostics.count-fop-hits: on diagnostics.latency-measurement: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on performance.stat-prefetch: on I'll be concentrating on how to disable fuse-auto-invalidations without regressing on the consistency model we've been providing till now. The consistency model Glusterfs has been providing till now is close to open consistency similar to what NFS provides [6][7]. But the initial thoughts are, at least for the pgbench test-case there is no harm in totally disabling fuse-auto-invalidations and md-cache invalidations as this workload totally runs on single mount point and hence invalidations itself are not necessary as all I/O goes through caches and hence caches are in sync with the state of the file on backend. [6] http://nfs.sourceforge.net/#faq_a8 [7] https://lists.gluster.org/pipermail/gluster-users/2013-March/012805.html > [5] https://bugzilla.redhat.com/show_bug.cgi?id=1664934 > > regards, > Raghavendra > > On Fri, Dec 28, 2018 at 12:44 PM Raghavendra Gowdappa > wrote: > >> >> >> On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay < >>> sankarshan.mukhopadhyay at gmail.com> wrote: >>> >>>> [pulling the conclusions up to enable better in-line] >>>> >>>> > Conclusions: >>>> > >>>> > We should never have a volume with caching-related xlators disabled. >>>> The price we pay for it is too high. We need to make them work consistently >>>> and aggressively to avoid as many requests as we can. >>>> >>>> Are there current issues in terms of behavior which are known/observed >>>> when these are enabled? >>>> >>> >>> We did have issues with pgbench in past. But they've have been fixed. >>> Please refer to bz [1] for details. On 5.1, it runs successfully with all >>> caching related xlators enabled. Having said that the only performance >>> xlators which gave improved performance were open-behind and write-behind >>> [2] (write-behind had some issues, which will be fixed by [3] and we'll >>> have to measure performance again with fix to [3]). >>> >> >> One quick update. Enabling write-behind and md-cache with fix for [3] >> reduced the total time taken for pgbench init phase roughly by 20%-25% >> (from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge >> time (around 12hrs for a db of scale 8000). I'll follow up with a detailed >> report once my experiments are complete. Currently trying to optimize the >> read path. >> >> >>> For some reason, read-side caching didn't improve transactions per >>> second. I am working on this problem currently. Note that these bugs >>> measure transaction phase of pgbench, but what xavi measured in his mail is >>> init phase. Nevertheless, evaluation of read caching (metadata/data) will >>> still be relevant for init phase too. >>> >>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691 >>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4 >>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781 >>> >>> >>>> > We need to analyze client/server xlators deeper to see if we can >>>> avoid some delays. However optimizing something that is already at the >>>> microsecond level can be very hard. >>>> >>>> That is true - are there any significant gains which can be accrued by >>>> putting efforts here or, should this be a lower priority? >>>> >>> >>> The problem identified by xavi is also the one we (Manoj, Krutika, me >>> and Milind) had encountered in the past [4]. The solution we used was to >>> have multiple rpc connections between single brick and client. The solution >>> indeed fixed the bottleneck. So, there is definitely work involved here - >>> either to fix the single connection model or go with multiple connection >>> model. Its preferred to improve single connection and resort to multiple >>> connections only if bottlenecks in single connection are not fixable. >>> Personally I think this is high priority along with having appropriate >>> client side caching. >>> >>> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52 >>> >>> >>>> > We need to determine what causes the fluctuations in brick side and >>>> avoid them. >>>> > This scenario is very similar to a smallfile/metadata workload, so >>>> this is probably one important cause of its bad performance. >>>> >>>> What kind of instrumentation is required to enable the determination? >>>> >>>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez >>>> wrote: >>>> > >>>> > Hi, >>>> > >>>> > I've done some tracing of the latency that network layer introduces >>>> in gluster. I've made the analysis as part of the pgbench performance issue >>>> (in particulat the initialization and scaling phase), so I decided to look >>>> at READV for this particular workload, but I think the results can be >>>> extrapolated to other operations that also have small latency (cached data >>>> from FS for example). >>>> > >>>> > Note that measuring latencies introduces some latency. It consists in >>>> a call to clock_get_time() for each probe point, so the real latency will >>>> be a bit lower, but still proportional to these numbers. >>>> > >>>> >>>> [snip] >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Sat Jan 26 02:36:24 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Sat, 26 Jan 2019 08:06:24 +0530 Subject: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench In-Reply-To: References: Message-ID: On Sat, Jan 26, 2019 at 8:03 AM Raghavendra Gowdappa wrote: > > > On Fri, Jan 11, 2019 at 8:09 PM Raghavendra Gowdappa > wrote: > >> Here is the update of the progress till now: >> * The client profile attached till now shows the tuple creation is >> dominated by writes and fstats. Note that fstats are side-effects of writes >> as writes invalidate attributes of the file from kernel attribute cache. >> * The rest of the init phase (which is marked by msgs "setting primary >> key" and "vaccuum") is dominated by reads. Next bigger set of operations >> are writes followed by fstats. >> >> So, only writes, reads and fstats are the operations we need to optimize >> to reduce the init time latency. As mentioned in my previous mail, I did >> following tunings: >> * Enabled only write-behind, md-cache and open-behind. >> - write-behind was configured with a cache-size/window-size of 20MB >> - open-behind was configured with read-after-open yes >> - md-cache was loaded as a child of write-behind in xlator graph. As >> a parent of write-behind, writes responses of writes cached in write-behind >> would invalidate stats. But when loaded as a child of write-behind this >> problem won't be there. Note that in both cases fstat would pass through >> write-behind (In the former case due to no stats in md-cache). However in >> the latter case fstats can be served by md-cache. >> - md-cache used to aggressively invalidate inodes. For the purpose of >> this test, I just commented out inode-invalidate code in md-cache. We need >> to fine tune the invalidation invocation logic. >> - set group-metadata-cache to on. But turned off upcall >> notifications. Note that since this workload basically accesses all its >> data through single mount point. So, there is no shared files across mounts >> and hence its safe to turn off invalidations. >> * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781 >> >> With the above set of tunings I could reduce the init time of scale 8000 >> from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30% >> >> Since the workload is dominated by reads, we think a good read-cache >> where reads to regions just written are served from cache would greatly >> improve the performance. Since kernel page-cache already provides that >> functionality along with read-ahead (which is more intelligent and serves >> more read patterns than supported by Glusterfs read-ahead), we wanted to >> try that. But, Manoj found a bug where reads followed by writes are not >> served from page cache [5]. I am currently waiting for the resolution of >> this bug. As an alternative, I can modify io-cache to serve reads from the >> data just written. But, the change involves its challenges and hence would >> like to get a resolution on [5] (either positive or negative) before >> proceeding with modifications to io-cache. >> >> As to the rpc latency, Krutika had long back identified that reading a >> single rpc message involves atleast 4 reads to socket. These many number of >> reads were done to identify the structure of the message on the go. The >> reason we wanted to discover the rpc message was to identify the part of >> the rpc message containing read or write payload and make sure that payload >> is directly read into a buffer different than the one containing rest of >> the rpc message. This strategy will make sure payloads are not copied again >> when buffers are moved across caches (read-ahead, io-cache etc) and also >> the rest of the rpc message can be freed even though the payload outlives >> the rpc message (when payloads are cached). However, we can experiment an >> approach where we can either do away with zero-copy requirement or let the >> entire buffer containing rpc message and payload to live in the cache. >> >> From my observations and discussions with Manoj and Xavi, this workload >> is very sensitive to latency (than to concurrency). So, I am hopeful the >> above approaches will give positive results. >> > > Me, Manoj and Csaba figured out that invalidations by md-cache and Fuse > auto-invalidations were dropping the kernel page-cache (more details on > [5]). > Thanks to Miklos for the pointer on auto-invalidations. > Changes to stats by writes from same client (local writes) were triggering > both these codepaths dropping the cache. Since all the I/O done by this > workload goes through the caches of single client, the invalidations are > not necessary and I made code changes to fuse-bridge to disable > auto-invalidations completely and commented out inode-invalidations in > md-cache. Note that this doesn't regress the consistency/coherency of data > seen in the caches as its a single client use-case. With these two changes > coupled with earlier optimizations (client-io-threads=on, > server/client-event-threads=4, md-cache as a child of write-behind in > xlator graph, performance.md-cache-timeout=600), pgbench init of scale 8000 > on a volume with NVMe backend completed in 54m25s. This is a whopping 94% > improvement to the time we started out with (59280s vs 3360s). > > [root at shakthi4 ~]# gluster volume info > > Volume Name: nvme-r3 > Type: Replicate > Volume ID: d1490bcc-bcf1-4e09-91e8-ab01d9781263 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-1 > Brick2: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-2 > Brick3: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-3 > Options Reconfigured: > server.event-threads: 4 > client.event-threads: 4 > diagnostics.client-log-level: INFO > performance.md-cache-timeout: 600 > performance.io-cache: off > performance.read-ahead: off > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > transport.address-family: inet > nfs.disable: on > performance.client-io-threads: on > performance.stat-prefetch: on > > I'll be concentrating on how to disable fuse-auto-invalidations without > regressing on the consistency model we've been providing till now. The > consistency model Glusterfs has been providing till now is close to open > consistency similar to what NFS provides [6][7]. > > But the initial thoughts are, at least for the pgbench test-case there is > no harm in totally disabling fuse-auto-invalidations and md-cache > invalidations as this workload totally runs on single mount point and hence > invalidations itself are not necessary as all I/O goes through caches and > hence caches are in sync with the state of the file on backend. > > [6] http://nfs.sourceforge.net/#faq_a8 > [7] > https://lists.gluster.org/pipermail/gluster-users/2013-March/012805.html > > >> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1664934 >> >> regards, >> Raghavendra >> >> On Fri, Dec 28, 2018 at 12:44 PM Raghavendra Gowdappa < >> rgowdapp at redhat.com> wrote: >> >>> >>> >>> On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa < >>> rgowdapp at redhat.com> wrote: >>> >>>> >>>> >>>> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay < >>>> sankarshan.mukhopadhyay at gmail.com> wrote: >>>> >>>>> [pulling the conclusions up to enable better in-line] >>>>> >>>>> > Conclusions: >>>>> > >>>>> > We should never have a volume with caching-related xlators disabled. >>>>> The price we pay for it is too high. We need to make them work consistently >>>>> and aggressively to avoid as many requests as we can. >>>>> >>>>> Are there current issues in terms of behavior which are known/observed >>>>> when these are enabled? >>>>> >>>> >>>> We did have issues with pgbench in past. But they've have been fixed. >>>> Please refer to bz [1] for details. On 5.1, it runs successfully with all >>>> caching related xlators enabled. Having said that the only performance >>>> xlators which gave improved performance were open-behind and write-behind >>>> [2] (write-behind had some issues, which will be fixed by [3] and we'll >>>> have to measure performance again with fix to [3]). >>>> >>> >>> One quick update. Enabling write-behind and md-cache with fix for [3] >>> reduced the total time taken for pgbench init phase roughly by 20%-25% >>> (from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge >>> time (around 12hrs for a db of scale 8000). I'll follow up with a detailed >>> report once my experiments are complete. Currently trying to optimize the >>> read path. >>> >>> >>>> For some reason, read-side caching didn't improve transactions per >>>> second. I am working on this problem currently. Note that these bugs >>>> measure transaction phase of pgbench, but what xavi measured in his mail is >>>> init phase. Nevertheless, evaluation of read caching (metadata/data) will >>>> still be relevant for init phase too. >>>> >>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691 >>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4 >>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781 >>>> >>>> >>>>> > We need to analyze client/server xlators deeper to see if we can >>>>> avoid some delays. However optimizing something that is already at the >>>>> microsecond level can be very hard. >>>>> >>>>> That is true - are there any significant gains which can be accrued by >>>>> putting efforts here or, should this be a lower priority? >>>>> >>>> >>>> The problem identified by xavi is also the one we (Manoj, Krutika, me >>>> and Milind) had encountered in the past [4]. The solution we used was to >>>> have multiple rpc connections between single brick and client. The solution >>>> indeed fixed the bottleneck. So, there is definitely work involved here - >>>> either to fix the single connection model or go with multiple connection >>>> model. Its preferred to improve single connection and resort to multiple >>>> connections only if bottlenecks in single connection are not fixable. >>>> Personally I think this is high priority along with having appropriate >>>> client side caching. >>>> >>>> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52 >>>> >>>> >>>>> > We need to determine what causes the fluctuations in brick side and >>>>> avoid them. >>>>> > This scenario is very similar to a smallfile/metadata workload, so >>>>> this is probably one important cause of its bad performance. >>>>> >>>>> What kind of instrumentation is required to enable the determination? >>>>> >>>>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez >>>>> wrote: >>>>> > >>>>> > Hi, >>>>> > >>>>> > I've done some tracing of the latency that network layer introduces >>>>> in gluster. I've made the analysis as part of the pgbench performance issue >>>>> (in particulat the initialization and scaling phase), so I decided to look >>>>> at READV for this particular workload, but I think the results can be >>>>> extrapolated to other operations that also have small latency (cached data >>>>> from FS for example). >>>>> > >>>>> > Note that measuring latencies introduces some latency. It consists >>>>> in a call to clock_get_time() for each probe point, so the real latency >>>>> will be a bit lower, but still proportional to these numbers. >>>>> > >>>>> >>>>> [snip] >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Sat Jan 26 03:38:43 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Sat, 26 Jan 2019 09:08:43 +0530 Subject: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench In-Reply-To: References: Message-ID: On Sat, Jan 26, 2019 at 8:03 AM Raghavendra Gowdappa wrote: > > > On Fri, Jan 11, 2019 at 8:09 PM Raghavendra Gowdappa > wrote: > >> Here is the update of the progress till now: >> * The client profile attached till now shows the tuple creation is >> dominated by writes and fstats. Note that fstats are side-effects of writes >> as writes invalidate attributes of the file from kernel attribute cache. >> * The rest of the init phase (which is marked by msgs "setting primary >> key" and "vaccuum") is dominated by reads. Next bigger set of operations >> are writes followed by fstats. >> >> So, only writes, reads and fstats are the operations we need to optimize >> to reduce the init time latency. As mentioned in my previous mail, I did >> following tunings: >> * Enabled only write-behind, md-cache and open-behind. >> - write-behind was configured with a cache-size/window-size of 20MB >> - open-behind was configured with read-after-open yes >> - md-cache was loaded as a child of write-behind in xlator graph. As >> a parent of write-behind, writes responses of writes cached in write-behind >> would invalidate stats. But when loaded as a child of write-behind this >> problem won't be there. Note that in both cases fstat would pass through >> write-behind (In the former case due to no stats in md-cache). However in >> the latter case fstats can be served by md-cache. >> - md-cache used to aggressively invalidate inodes. For the purpose of >> this test, I just commented out inode-invalidate code in md-cache. We need >> to fine tune the invalidation invocation logic. >> - set group-metadata-cache to on. But turned off upcall >> notifications. Note that since this workload basically accesses all its >> data through single mount point. So, there is no shared files across mounts >> and hence its safe to turn off invalidations. >> * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781 >> >> With the above set of tunings I could reduce the init time of scale 8000 >> from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30% >> >> Since the workload is dominated by reads, we think a good read-cache >> where reads to regions just written are served from cache would greatly >> improve the performance. Since kernel page-cache already provides that >> functionality along with read-ahead (which is more intelligent and serves >> more read patterns than supported by Glusterfs read-ahead), we wanted to >> try that. But, Manoj found a bug where reads followed by writes are not >> served from page cache [5]. I am currently waiting for the resolution of >> this bug. As an alternative, I can modify io-cache to serve reads from the >> data just written. But, the change involves its challenges and hence would >> like to get a resolution on [5] (either positive or negative) before >> proceeding with modifications to io-cache. >> >> As to the rpc latency, Krutika had long back identified that reading a >> single rpc message involves atleast 4 reads to socket. These many number of >> reads were done to identify the structure of the message on the go. The >> reason we wanted to discover the rpc message was to identify the part of >> the rpc message containing read or write payload and make sure that payload >> is directly read into a buffer different than the one containing rest of >> the rpc message. This strategy will make sure payloads are not copied again >> when buffers are moved across caches (read-ahead, io-cache etc) and also >> the rest of the rpc message can be freed even though the payload outlives >> the rpc message (when payloads are cached). However, we can experiment an >> approach where we can either do away with zero-copy requirement or let the >> entire buffer containing rpc message and payload to live in the cache. >> >> From my observations and discussions with Manoj and Xavi, this workload >> is very sensitive to latency (than to concurrency). So, I am hopeful the >> above approaches will give positive results. >> > > Me, Manoj and Csaba figured out that invalidations by md-cache and Fuse > auto-invalidations were dropping the kernel page-cache (more details on > [5]). Changes to stats by writes from same client (local writes) were > triggering both these codepaths dropping the cache. Since all the I/O done > by this workload goes through the caches of single client, the > invalidations are not necessary and I made code changes to fuse-bridge to > disable auto-invalidations completely and commented out inode-invalidations > in md-cache. Note that this doesn't regress the consistency/coherency of > data seen in the caches as its a single client use-case. With these two > changes coupled with earlier optimizations (client-io-threads=on, > server/client-event-threads=4, md-cache as a child of write-behind in > xlator graph, performance.md-cache-timeout=600), pgbench init of scale 8000 > on a volume with NVMe backend completed in 54m25s. This is a whopping 94% > improvement to the time we started out with (59280s vs 3360s). > These numbers were taken from the latest run I had scheduled. However, I didn't notice that the test had failed midway. From another test that had completed successfully, the numbers are 139m7s. That will be an improvement of 86% (59280s vs 8340s). I've scheduled another run just to be sure. The improvement is 86% and not 94%. > [root at shakthi4 ~]# gluster volume info > > Volume Name: nvme-r3 > Type: Replicate > Volume ID: d1490bcc-bcf1-4e09-91e8-ab01d9781263 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-1 > Brick2: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-2 > Brick3: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-3 > Options Reconfigured: > server.event-threads: 4 > client.event-threads: 4 > diagnostics.client-log-level: INFO > performance.md-cache-timeout: 600 > performance.io-cache: off > performance.read-ahead: off > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > transport.address-family: inet > nfs.disable: on > performance.client-io-threads: on > performance.stat-prefetch: on > > I'll be concentrating on how to disable fuse-auto-invalidations without > regressing on the consistency model we've been providing till now. The > consistency model Glusterfs has been providing till now is close to open > consistency similar to what NFS provides [6][7]. > > But the initial thoughts are, at least for the pgbench test-case there is > no harm in totally disabling fuse-auto-invalidations and md-cache > invalidations as this workload totally runs on single mount point and hence > invalidations itself are not necessary as all I/O goes through caches and > hence caches are in sync with the state of the file on backend. > > [6] http://nfs.sourceforge.net/#faq_a8 > [7] > https://lists.gluster.org/pipermail/gluster-users/2013-March/012805.html > > >> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1664934 >> >> regards, >> Raghavendra >> >> On Fri, Dec 28, 2018 at 12:44 PM Raghavendra Gowdappa < >> rgowdapp at redhat.com> wrote: >> >>> >>> >>> On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa < >>> rgowdapp at redhat.com> wrote: >>> >>>> >>>> >>>> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay < >>>> sankarshan.mukhopadhyay at gmail.com> wrote: >>>> >>>>> [pulling the conclusions up to enable better in-line] >>>>> >>>>> > Conclusions: >>>>> > >>>>> > We should never have a volume with caching-related xlators disabled. >>>>> The price we pay for it is too high. We need to make them work consistently >>>>> and aggressively to avoid as many requests as we can. >>>>> >>>>> Are there current issues in terms of behavior which are known/observed >>>>> when these are enabled? >>>>> >>>> >>>> We did have issues with pgbench in past. But they've have been fixed. >>>> Please refer to bz [1] for details. On 5.1, it runs successfully with all >>>> caching related xlators enabled. Having said that the only performance >>>> xlators which gave improved performance were open-behind and write-behind >>>> [2] (write-behind had some issues, which will be fixed by [3] and we'll >>>> have to measure performance again with fix to [3]). >>>> >>> >>> One quick update. Enabling write-behind and md-cache with fix for [3] >>> reduced the total time taken for pgbench init phase roughly by 20%-25% >>> (from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge >>> time (around 12hrs for a db of scale 8000). I'll follow up with a detailed >>> report once my experiments are complete. Currently trying to optimize the >>> read path. >>> >>> >>>> For some reason, read-side caching didn't improve transactions per >>>> second. I am working on this problem currently. Note that these bugs >>>> measure transaction phase of pgbench, but what xavi measured in his mail is >>>> init phase. Nevertheless, evaluation of read caching (metadata/data) will >>>> still be relevant for init phase too. >>>> >>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691 >>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4 >>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781 >>>> >>>> >>>>> > We need to analyze client/server xlators deeper to see if we can >>>>> avoid some delays. However optimizing something that is already at the >>>>> microsecond level can be very hard. >>>>> >>>>> That is true - are there any significant gains which can be accrued by >>>>> putting efforts here or, should this be a lower priority? >>>>> >>>> >>>> The problem identified by xavi is also the one we (Manoj, Krutika, me >>>> and Milind) had encountered in the past [4]. The solution we used was to >>>> have multiple rpc connections between single brick and client. The solution >>>> indeed fixed the bottleneck. So, there is definitely work involved here - >>>> either to fix the single connection model or go with multiple connection >>>> model. Its preferred to improve single connection and resort to multiple >>>> connections only if bottlenecks in single connection are not fixable. >>>> Personally I think this is high priority along with having appropriate >>>> client side caching. >>>> >>>> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52 >>>> >>>> >>>>> > We need to determine what causes the fluctuations in brick side and >>>>> avoid them. >>>>> > This scenario is very similar to a smallfile/metadata workload, so >>>>> this is probably one important cause of its bad performance. >>>>> >>>>> What kind of instrumentation is required to enable the determination? >>>>> >>>>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez >>>>> wrote: >>>>> > >>>>> > Hi, >>>>> > >>>>> > I've done some tracing of the latency that network layer introduces >>>>> in gluster. I've made the analysis as part of the pgbench performance issue >>>>> (in particulat the initialization and scaling phase), so I decided to look >>>>> at READV for this particular workload, but I think the results can be >>>>> extrapolated to other operations that also have small latency (cached data >>>>> from FS for example). >>>>> > >>>>> > Note that measuring latencies introduces some latency. It consists >>>>> in a call to clock_get_time() for each probe point, so the real latency >>>>> will be a bit lower, but still proportional to these numbers. >>>>> > >>>>> >>>>> [snip] >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Sun Jan 27 07:03:16 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Sun, 27 Jan 2019 08:03:16 +0100 Subject: [Gluster-devel] Performance improvements In-Reply-To: References: Message-ID: On Fri, 25 Jan 2019, 08:53 Vijay Bellur Thank you for the detailed update, Xavi! This looks very interesting. > > On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez > wrote: > >> Hi all, >> >> I've just updated a patch [1] that implements a new thread pool based on >> a wait-free queue provided by userspace-rcu library. The patch also >> includes an auto scaling mechanism that only keeps running the needed >> amount of threads for the current workload. >> >> This new approach has some advantages: >> >> - It's provided globally inside libglusterfs instead of inside an >> xlator >> >> This makes it possible that fuse thread and epoll threads transfer the >> received request to another thread sooner, wating less CPU and reacting >> sooner to other incoming requests. >> >> >> - Adding jobs to the queue used by the thread pool only requires an >> atomic operation >> >> This makes the producer side of the queue really fast, almost with no >> delay. >> >> >> - Contention is reduced >> >> The producer side has negligible contention thanks to the wait-free >> enqueue operation based on an atomic access. The consumer side requires a >> mutex, but the duration is very small and the scaling mechanism makes sure >> that there are no more threads than needed contending for the mutex. >> >> >> This change disables io-threads, since it replaces part of its >> functionality. However there are two things that could be needed from >> io-threads: >> >> - Prioritization of fops >> >> Currently, io-threads assigns priorities to each fop, so that some fops >> are handled before than others. >> >> >> - Fair distribution of execution slots between clients >> >> Currently, io-threads processes requests from each client in round-robin. >> >> >> These features are not implemented right now. If they are needed, >> probably the best thing to do would be to keep them inside io-threads, but >> change its implementation so that it uses the global threads from the >> thread pool instead of its own threads. >> > > > These features are indeed useful to have and hence modifying the > implementation of io-threads to provide this behavior would be welcome. > > > >> >> >> These tests have shown that the limiting factor has been the disk in most >> cases, so it's hard to tell if the change has really improved things. There >> is only one clear exception: self-heal on a dispersed volume completes >> 12.7% faster. The utilization of CPU has also dropped drastically: >> >> Old implementation: 12.30 user, 41.78 sys, 43.16 idle, 0.73 wait >> >> New implementation: 4.91 user, 5.52 sys, 81.60 idle, 5.91 wait >> >> >> Now I'm running some more tests on NVMe to try to see the effects of the >> change when disk is not limiting performance. I'll update once I've more >> data. >> >> > Will look forward to these numbers. > I have identified an issue that limits the number of active threads when load is high, causing some regressions. I'll fix it and rerun the tests on Monday. Xavi > > Regards, > Vijay > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon Jan 28 01:45:03 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 28 Jan 2019 01:45:03 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <693909255.16.1548639904287.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1667168 / arbiter: Thin Arbiter documentation refers commands don't exist "glustercli' https://bugzilla.redhat.com/1668227 / core: gluster(8) - Add SELinux context glusterd_brick_t to man page https://bugzilla.redhat.com/1665145 / core: Writes on Gluster 5 volumes fail with EIO when "cluster.consistent-metadata" is set https://bugzilla.redhat.com/1668239 / disperse: [man page] Gluster(8) - Missing disperse-data parameter Gluster Console Manager man page https://bugzilla.redhat.com/1663337 / doc: Gluster documentation on quorum-reads option is incorrect https://bugzilla.redhat.com/1663205 / fuse: List dictionary is too slow https://bugzilla.redhat.com/1668118 / geo-replication: Failure to start geo-replication for tiered volume. https://bugzilla.redhat.com/1664524 / geo-replication: Non-root geo-replication session goes to faulty state, when the session is started https://bugzilla.redhat.com/1668245 / glusterd: gluster(8) - Man page - create gluster example session https://bugzilla.redhat.com/1663247 / glusterd: remove static memory allocations from code https://bugzilla.redhat.com/1663519 / gluster-smb: Memory leak when smb.conf has "store dos attributes = yes" https://bugzilla.redhat.com/1666326 / open-behind: reopening bug 1405147: Failed to dispatch handler: glusterfs seems to check for "write permission" instead for "file owner" during open() when writing to a file https://bugzilla.redhat.com/1668259 / packaging: Glusterfs 5.3 RPMs can't be build on rhel7 https://bugzilla.redhat.com/1665361 / project-infrastructure: Alerts for offline nodes https://bugzilla.redhat.com/1663780 / project-infrastructure: On docs.gluster.org, we should convert spaces in folder or file names to 301 redirects to hypens https://bugzilla.redhat.com/1666634 / protocol: nfs client cannot compile files on dispersed volume https://bugzilla.redhat.com/1665677 / rdma: volume create and transport change with rdma failed https://bugzilla.redhat.com/1668286 / read-ahead: READDIRP incorrectly updates posix-acl inode ctx https://bugzilla.redhat.com/1664215 / read-ahead: Toggling readdir-ahead translator off causes some clients to umount some of its volumes https://bugzilla.redhat.com/1664398 / tests: ./tests/00-geo-rep/00-georep-verify-setup.t does not work with ./run-tests-in-vagrant.sh [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 2699 bytes Desc: not available URL: From sunkumar at redhat.com Tue Jan 29 06:05:49 2019 From: sunkumar at redhat.com (Sunny Kumar) Date: Tue, 29 Jan 2019 11:35:49 +0530 Subject: [Gluster-devel] Infer results - Glusterfs Message-ID: Hello folks, As many of you already know, coverity is down down for nearly 2 months and during that period we were not able to perform any static analysis on our code base. Coverity is still in read only mode. So I tried to use another tool infer[1] on latest master and summary of report: DEAD_STORE: 2994 MEMORY_LEAK: 1926 NULL_DEREFERENCE: 552 UNINITIALIZED_VALUE: 33 RESOURCE_LEAK: 6 USE_AFTER_FREE: 1 *************** Infer will be very useful to us as it is inter-procedural and each procedure gets analyzed independently so it will uncover some deep inter-procedural bugs. I am analysing the result now, at first scan it looks promising and it will be good to fix these reported issues. Soon I will made the report public and automate infur to run as daily job so that we can track fixes. [1]. https://github.com/facebook/infer - Sunny From sheggodu at redhat.com Tue Jan 29 07:06:06 2019 From: sheggodu at redhat.com (Sunil Kumar Heggodu Gopala Acharya) Date: Tue, 29 Jan 2019 12:36:06 +0530 Subject: [Gluster-devel] Improvements to Gluster upstream documentation Message-ID: Hi, As part of our continuous effort to improve Gluster upstream documentation , we are proposing a change to the documentation theme that we are currently using through the glusterdocs pull request 454 . Preview of the changes proposed can be viewed through this temporary website . Request you to review and share the comments/concerns/feedback. Regards, Sunil kumar AcharYa -------------- next part -------------- An HTML attachment was scrubbed... URL: From anuradha.stalur at gmail.com Thu Jan 31 00:27:38 2019 From: anuradha.stalur at gmail.com (Anuradha Talur) Date: Wed, 30 Jan 2019 16:27:38 -0800 Subject: [Gluster-devel] Release 6: Kick off! In-Reply-To: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> References: <03fb0aaa-66c4-d9b1-1228-3ce527ce5fb9@redhat.com> Message-ID: Patches being worked on for Gluster-6. These haven't been refreshed in a while. Will be done sometime this week or early next week. 1) https://review.gluster.org/#/c/glusterfs/+/21585/ 2) https://review.gluster.org/#/c/glusterfs/+/21756/ 3) https://review.gluster.org/#/c/glusterfs/+/21681/ 4) https://review.gluster.org/#/c/glusterfs/+/21694/ 5) https://review.gluster.org/#/c/glusterfs/+/21757/ 6) https://review.gluster.org/#/c/glusterfs/+/21771/ -- Thanks, Anuradha Talur. On Tue, Nov 6, 2018 at 8:34 AM Shyam Ranganathan wrote: > > Hi, > > With release-5 out of the door, it is time to start some activities for > release-6. > > ## Scope > It is time to collect and determine scope for the release, so as usual, > please send in features/enhancements that you are working towards > reaching maturity for this release to the devel list, and mark/open the > github issue with the required milestone [1]. > > At a broader scale, in the maintainers meeting we discussed the > enhancement wish list as in [2]. > > Other than the above, we are continuing with our quality focus and would > want to see a downward trend (or near-zero) in the following areas, > - Coverity > - clang > - ASAN > > We would also like to tighten our nightly testing health, and would > ideally not want to have tests retry and pass in the second attempt on > the testing runs. Towards this, we would send in reports of retried and > failed tests, that need attention and fixes as required. > > ## Schedule > NOTE: Schedule is going to get heavily impacted due to end of the year > holidays, but we will try to keep it up as much as possible. > > Working backwards on the schedule, here's what we have: > - Announcement: Week of Feb 4th, 2019 > - GA tagging: Feb-01-2019 > - RC1: On demand before GA > - RC0: Jan-02-2019 > - Late features cut-off: Week of Dec-24th, 2018 > - Branching (feature cutoff date): Dec-17-2018 > (~45 days prior to branching) > - Feature/scope proposal for the release (end date): Nov-21-2018 > > ## Volunteers > This is my usual call for volunteers to run the release with me or > otherwise, but please do consider. We need more hands this time, and > possibly some time sharing during the end of the year owing to the holidays. > > Thanks, > Shyam > > [1] Release-6 github milestone: > https://github.com/gluster/glusterfs/milestone/8 > > [2] Release-6 enhancement wishlist: > https://hackmd.io/sP5GsZ-uQpqnmGZmFKuWIg# > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel From xhernandez at redhat.com Thu Jan 31 17:08:37 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 31 Jan 2019 18:08:37 +0100 Subject: [Gluster-devel] Performance improvements In-Reply-To: References: Message-ID: On Sun, Jan 27, 2019 at 8:03 AM Xavi Hernandez wrote: > On Fri, 25 Jan 2019, 08:53 Vijay Bellur >> Thank you for the detailed update, Xavi! This looks very interesting. >> >> On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez >> wrote: >> >>> Hi all, >>> >>> I've just updated a patch [1] that implements a new thread pool based on >>> a wait-free queue provided by userspace-rcu library. The patch also >>> includes an auto scaling mechanism that only keeps running the needed >>> amount of threads for the current workload. >>> >>> This new approach has some advantages: >>> >>> - It's provided globally inside libglusterfs instead of inside an >>> xlator >>> >>> This makes it possible that fuse thread and epoll threads transfer the >>> received request to another thread sooner, wating less CPU and reacting >>> sooner to other incoming requests. >>> >>> >>> - Adding jobs to the queue used by the thread pool only requires an >>> atomic operation >>> >>> This makes the producer side of the queue really fast, almost with no >>> delay. >>> >>> >>> - Contention is reduced >>> >>> The producer side has negligible contention thanks to the wait-free >>> enqueue operation based on an atomic access. The consumer side requires a >>> mutex, but the duration is very small and the scaling mechanism makes sure >>> that there are no more threads than needed contending for the mutex. >>> >>> >>> This change disables io-threads, since it replaces part of its >>> functionality. However there are two things that could be needed from >>> io-threads: >>> >>> - Prioritization of fops >>> >>> Currently, io-threads assigns priorities to each fop, so that some fops >>> are handled before than others. >>> >>> >>> - Fair distribution of execution slots between clients >>> >>> Currently, io-threads processes requests from each client in round-robin. >>> >>> >>> These features are not implemented right now. If they are needed, >>> probably the best thing to do would be to keep them inside io-threads, but >>> change its implementation so that it uses the global threads from the >>> thread pool instead of its own threads. >>> >> >> >> These features are indeed useful to have and hence modifying the >> implementation of io-threads to provide this behavior would be welcome. >> >> >> >>> >>> >>> These tests have shown that the limiting factor has been the disk in >>> most cases, so it's hard to tell if the change has really improved things. >>> There is only one clear exception: self-heal on a dispersed volume >>> completes 12.7% faster. The utilization of CPU has also dropped drastically: >>> >>> Old implementation: 12.30 user, 41.78 sys, 43.16 idle, 0.73 wait >>> >>> New implementation: 4.91 user, 5.52 sys, 81.60 idle, 5.91 wait >>> >>> >>> Now I'm running some more tests on NVMe to try to see the effects of the >>> change when disk is not limiting performance. I'll update once I've more >>> data. >>> >>> >> Will look forward to these numbers. >> > > I have identified an issue that limits the number of active threads when > load is high, causing some regressions. I'll fix it and rerun the tests on > Monday. > Once the issue was solved, it caused high load averages for some workloads that were actually causing a regression (too much I/O I guess) instead of improving performance. So I added a configurable maximum amount of threads and made the whole implementation optional, so that it can be safely used when required. I did some tests and I was able to, at least, have the same performance we had before this patch in all cases. In some cases even better. But each test needed a manual configuration on the number of threads. I need to work on a way to automatically compute the maximum so that it can be used easily in any workload (or even combined workloads). I uploaded the latest version of the patch. Xavi > Xavi > > >> >> Regards, >> Vijay >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu Jan 31 18:01:04 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 31 Jan 2019 19:01:04 +0100 Subject: [Gluster-devel] I/O performance Message-ID: Hi, I've been doing some tests with the global thread pool [1], and I've observed one important thing: Since this new thread pool has very low contention (apparently), it exposes other problems when the number of threads grows. What I've seen is that some workloads use all available threads on bricks to do I/O, causing avgload to grow rapidly and saturating the machine (or it seems so), which really makes everything slower. Reducing the maximum number of threads improves performance actually. Other workloads, though, do little I/O (probably most is locking or smallfile operations). In this case limiting the number of threads to a small value causes a performance reduction. To increase performance we need more threads. So this is making me thing that maybe we should implement some sort of I/O queue with a maximum I/O depth for each brick (or disk if bricks share same disk). This way we can limit the amount of requests physically accessing the underlying FS concurrently, without actually limiting the number of threads that can be doing other things on each brick. I think this could improve performance. Maybe this approach could also be useful in client side, but I think it's not so critical there. What do you think ? Xavi [1] https://review.gluster.org/c/glusterfs/+/20636 -------------- next part -------------- An HTML attachment was scrubbed... URL: