From biholcomb at l1049h.com Tue Jan 1 16:58:31 2019 From: biholcomb at l1049h.com (Brett Holcomb) Date: Tue, 1 Jan 2019 11:58:31 -0500 Subject: [Gluster-users] [External] Re: Self Heal Confusion In-Reply-To: References: <24cfe8a5-dadb-6271-9b7f-af8670f43fce@l1049h.com> <14cbebd2-44d0-8558-1e26-944e1dec15a7@l1049h.com> <1851464190.54195617.1545898152060.JavaMail.zimbra@redhat.com> <988970243.54246776.1545976827971.JavaMail.zimbra@redhat.com> <9d548f7b-1859-f438-2cb9-9ca1cb3baa86@l1049h.com> <3c2edc47-2cdc-0b90-a708-c59cc8a51937@l1049h.com> <7fe3f846-7289-5885-9905-3e7812964970@l1049h.com> Message-ID: Healing time set to 120 seconds for now. Just to make sure I understand I need to take the result of the gluster volume heal projects info and put it in a file. Then try and find each guid listed in that file in the .glusterfs directory for each brick listed in the output as having unhealed files and delete that file - if it exists.? If it doesn't exist don't worry about it. So these bricks have unhealed entries listed /srv/gfs01/Projects/.glusterfs - 85 files /srv/gfs05/Projects/.glusterfs? - 58854 files /srv/gfs06/Projects/.glusterfs- 58854 files Script time! On 12/31/18 4:39 AM, Davide Obbi wrote: > cluster.quorum-type auto > cluster.quorum-count (null) > cluster.server-quorum-type off > cluster.server-quorum-ratio 0 > cluster.quorum-reads??????????????????? no > > Where exacty do I remove the gfid entries from - the .glusterfs > directory? --> yes can't remember exactly where but try to do a find > in the brick paths with the gfid? it should return something > > Where do I put the cluster.heal-timeout option - which file? --> > gluster volume set volumename option value > > On Mon, Dec 31, 2018 at 10:34 AM Brett Holcomb > wrote: > > That is probably the case as a lot of files were deleted some time > ago. > > I'm on version 5.2 but was on 3.12 until about a week ago. > > Here is the quorum info.? I'm running a distributed replicated > volumes > in 2 x 3 = 6 > > cluster.quorum-type auto > cluster.quorum-count (null) > cluster.server-quorum-type off > cluster.server-quorum-ratio 0 > cluster.quorum-reads??????????????????? no > > Where exacty do I remove the gfid entries from - the .glusterfs > directory?? Do I just delete all the directories can files under this > directory? > > Where do I put the cluster.heal-timeout option - which file? > > I think you've hit on the cause of the issue.? Thinking back we've > had > some extended power outages and due to a misconfiguration in the swap > file device name a couple of the nodes did not come up and I didn't > catch it for a while so maybe the deletes occured then. > > Thank you. > > On 12/31/18 2:58 AM, Davide Obbi wrote: > > if the long GFID does not correspond to any file it could mean the > > file has been deleted by the client mounting the volume. I think > this > > is caused when the delete was issued and the number of active > bricks > > were not reaching quorum majority or a second brick was taken down > > while another was down or did not finish the selfheal, the > latter more > > likely. > > It would be interesting to see: > > - what version of glusterfs you running, it happened to me with 3.12 > > - volume quorum rules: "gluster volume get vol all | grep quorum" > > > > To clean it up if i remember correctly it should be possible to > delete > > the gfid entries from the brick mounts on the glusterfs server > nodes > > reporting the files to heal. > > > > As a side note you might want to consider changing the selfheal > > timeout to more agressive schedule in cluster.heal-timeout option > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Davide Obbi > System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > Direct +31207031558 > Booking.com > Empowering people to experience the world since 1996 > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed Jan 2 08:00:20 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 2 Jan 2019 13:30:20 +0530 Subject: [Gluster-users] On making ctime generator enabled by default in stack In-Reply-To: References: Message-ID: On Mon, Nov 12, 2018 at 10:48 AM Amar Tumballi wrote: > > > On Mon, Nov 12, 2018 at 10:39 AM Vijay Bellur wrote: > >> >> >> On Sun, Nov 11, 2018 at 8:25 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Sun, Nov 11, 2018 at 11:41 PM Vijay Bellur >>> wrote: >>> >>>> >>>> >>>> On Mon, Nov 5, 2018 at 8:31 PM Raghavendra Gowdappa < >>>> rgowdapp at redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On Tue, Nov 6, 2018 at 9:58 AM Vijay Bellur >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Mon, Nov 5, 2018 at 7:56 PM Raghavendra Gowdappa < >>>>>> rgowdapp at redhat.com> wrote: >>>>>> >>>>>>> All, >>>>>>> >>>>>>> There is a patch [1] from Kotresh, which makes ctime generator as >>>>>>> default in stack. Currently ctime generator is being recommended only for >>>>>>> usecases where ctime is important (like for Elasticsearch). However, a >>>>>>> reliable (c)(m)time can fix many consistency issues within glusterfs stack >>>>>>> too. These are issues with caching layers having stale (meta)data >>>>>>> [2][3][4]. Basically just like applications, components within glusterfs >>>>>>> stack too need a time to find out which among racing ops (like write, stat, >>>>>>> etc) has latest (meta)data. >>>>>>> >>>>>>> Also note that a consistent (c)(m)time is not an optional feature, >>>>>>> but instead forms the core of the infrastructure. So, I am proposing to >>>>>>> merge this patch. If you've any objections, please voice out before Nov 13, >>>>>>> 2018 (a week from today). >>>>>>> >>>>>>> As to the existing known issues/limitations with ctime generator, my >>>>>>> conversations with Kotresh, revealed following: >>>>>>> * Potential performance degradation (we don't yet have data to >>>>>>> conclusively prove it, preliminary basic tests from Kotresh didn't indicate >>>>>>> a significant perf drop). >>>>>>> >>>>>> >>>>>> Do we have this data captured somewhere? If not, would it be possible >>>>>> to share that data here? >>>>>> >>>>> >>>>> I misquoted Kotresh. He had measured impact of gfid2path and said both >>>>> features might've similar impact as major perf cost is related to storing >>>>> xattrs on backend fs. I am in the process of getting a fresh set of >>>>> numbers. Will post those numbers when available. >>>>> >>>>> >>>> >>>> I observe that the patch under discussion has been merged now [1]. A >>>> quick search did not yield me any performance data. Do we have the >>>> performance numbers posted somewhere? >>>> >>> >>> No. Perf benchmarking is a task pending on me. >>> >> >> When can we expect this task to be complete? >> >> In any case, I don't think it is ideal for us to merge a patch without >> completing our due diligence on it. How do we want to handle this scenario >> since the patch is already merged? >> >> We could: >> >> 1. Revert the patch now >> 2. Review the performance data and revert the patch if performance >> characterization indicates a significant dip. It would be preferable to >> complete this activity before we branch off for the next release. >> > > I am for option 2. Considering the branch out for next release is another > 2 months, and no one is expected to use the 'release' off a master branch > yet, it makes sense to give that buffer time to get this activity completed. > Its unlikely I'll have time for carrying out perf benchmark. Hence I've posted a revert here: https://review.gluster.org/#/c/glusterfs/+/21975/ > Regards, > Amar > > 3. Think of some other option? >> >> Thanks, >> Vijay >> >> >>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.buitelaar at gmail.com Wed Jan 2 10:42:58 2019 From: olaf.buitelaar at gmail.com (Olaf Buitelaar) Date: Wed, 2 Jan 2019 11:42:58 +0100 Subject: [Gluster-users] [Stale file handle] in shard volume In-Reply-To: References: Message-ID: Dear All, The bash file i'm planning to run can be found here; https://gist.github.com/olafbuitelaar/ff6fe9d4ab39696d9ad6ca689cc89986 It would be nice to receive some feedback from the community before i would actually run the clean-up of all stale file handles. Thanks Olaf Op zo 30 dec. 2018 om 20:56 schreef Olaf Buitelaar : > Dear All, > > till now a selected group of VM's still seem to produce new stale file's > and getting paused due to this. > I've not updated gluster recently, however i did change the op version > from 31200 to 31202 about a week before this issue arose. > Looking at the .shard directory, i've 100.000+ files sharing the same > characteristics as a stale file. which are found till now, > they all have the sticky bit set, e.g. file permissions; ---------T. are > 0kb in size, and have the trusted.glusterfs.dht.linkto attribute. > These files range from long a go (beginning of the year) till now. Which > makes me suspect this was laying dormant for some time now..and somehow > recently surfaced. > Checking other sub-volumes they contain also 0kb files in the .shard > directory, but don't have the sticky bit and the linkto attribute. > > Does anybody else experience this issue? Could this be a bug or an > environmental issue? > > Also i wonder if there is any tool or gluster command to clean all stale > file handles? > Otherwise i'm planning to make a simple bash script, which iterates over > the .shard dir, checks each file for the above mentioned criteria, and > (re)moves the file and the corresponding .glusterfs file. > If there are other criteria needed to identify a stale file handle, i > would like to hear that. > If this is a viable and safe operation to do of course. > > Thanks Olaf > > > > Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar < > olaf.buitelaar at gmail.com>: > >> Dear All, >> >> I figured it out, it appeared to be the exact same issue as described >> here; >> https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html >> Another subvolume also had the shard file, only were all 0 bytes and had >> the dht.linkto >> >> for reference; >> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >> >> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >> >> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >> >> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >> >> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >> # file: .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >> >> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >> >> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >> >> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >> >> [root at lease-04 ovirt-backbone-2]# stat >> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >> File: ?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d? >> Size: 0 Blocks: 0 IO Block: 4096 regular empty >> file >> Device: fd01h/64769d Inode: 1918631406 Links: 2 >> Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ root) >> Context: system_u:object_r:etc_runtime_t:s0 >> Access: 2018-12-17 21:43:36.405735296 +0000 >> Modify: 2018-12-17 21:43:36.405735296 +0000 >> Change: 2018-12-17 21:43:36.405735296 +0000 >> Birth: - >> >> removing the shard file and glusterfs file from each node resolved the >> issue. >> >> I also found this thread; >> https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html >> Maybe he suffers from the same issue. >> >> Best Olaf >> >> >> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar < >> olaf.buitelaar at gmail.com>: >> >>> Dear All, >>> >>> It appears i've a stale file in one of the volumes, on 2 files. These >>> files are qemu images (1 raw and 1 qcow2). >>> I'll just focus on 1 file since the situation on the other seems the >>> same. >>> >>> The VM get's paused more or less directly after being booted with error; >>> [2018-12-18 14:05:05.275713] E [MSGID: 133010] >>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-backbone-2-shard: >>> Lookup on shard 51500 failed. Base file gfid = >>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file handle] >>> >>> investigating the shard; >>> >>> #on the arbiter node: >>> >>> [root at lease-05 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> getfattr: Removing leading '/' from absolute path names >>> # file: >>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>> >>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-05 ovirt-backbone-2]# stat >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>> Size: 0 Blocks: 0 IO Block: 4096 regular >>> empty file >>> Device: fd01h/64769d Inode: 537277306 Links: 2 >>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>> Context: system_u:object_r:etc_runtime_t:s0 >>> Access: 2018-12-17 21:43:36.361984810 +0000 >>> Modify: 2018-12-17 21:43:36.361984810 +0000 >>> Change: 2018-12-18 20:55:29.908647417 +0000 >>> Birth: - >>> >>> [root at lease-05 ovirt-backbone-2]# find . -inum 537277306 >>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> #on the data nodes: >>> >>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> getfattr: Removing leading '/' from absolute path names >>> # file: >>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>> >>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-08 ovirt-backbone-2]# stat >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular file >>> Device: fd03h/64771d Inode: 12893624759 Links: 3 >>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>> Context: system_u:object_r:etc_runtime_t:s0 >>> Access: 2018-12-18 18:52:38.070776585 +0000 >>> Modify: 2018-12-17 21:43:36.388054443 +0000 >>> Change: 2018-12-18 21:01:47.810506528 +0000 >>> Birth: - >>> >>> [root at lease-08 ovirt-backbone-2]# find . -inum 12893624759 >>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> ======================== >>> >>> [root at lease-11 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> getfattr: Removing leading '/' from absolute path names >>> # file: >>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>> >>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-11 ovirt-backbone-2]# stat >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular file >>> Device: fd03h/64771d Inode: 12956094809 Links: 3 >>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>> Context: system_u:object_r:etc_runtime_t:s0 >>> Access: 2018-12-18 20:11:53.595208449 +0000 >>> Modify: 2018-12-17 21:43:36.391580259 +0000 >>> Change: 2018-12-18 19:19:25.888055392 +0000 >>> Birth: - >>> >>> [root at lease-11 ovirt-backbone-2]# find . -inum 12956094809 >>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> ================ >>> >>> I don't really see any inconsistencies, except the dates on the stat. >>> However this is only after i tried moving the file out of the volumes to >>> force a heal, which does happen on the data nodes, but not on the arbiter >>> node. Before that they were also the same. >>> I've also compared the file >>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on the 2 nodes and they >>> are exactly the same. >>> >>> Things i've further tried; >>> - gluster v heal ovirt-backbone-2 full => gluster v heal >>> ovirt-backbone-2 info reports 0 entries on all nodes >>> >>> - stop each glusterd and glusterfsd, pause around 40sec and start them >>> again on each node, 1 at a time, waiting for the heal to recover before >>> moving to the next node >>> >>> - force a heal by stopping glusterd on a node and perform these steps; >>> mkdir /mnt/ovirt-backbone-2/trigger >>> rmdir /mnt/ovirt-backbone-2/trigger >>> setfattr -n trusted.non-existent-key -v abc /mnt/ovirt-backbone-2/ >>> setfattr -x trusted.non-existent-key /mnt/ovirt-backbone-2/ >>> start glusterd >>> >>> - gluster volume rebalance ovirt-backbone-2 start => success >>> >>> Whats further interesting is that according the mount log, the volume is >>> in split-brain; >>> [2018-12-18 10:06:04.606870] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:04.606908] E [MSGID: 133014] >>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>> [2018-12-18 10:06:04.606927] W [fuse-bridge.c:871:fuse_attr_cbk] >>> 0-glusterfs-fuse: 428090: FSTAT() >>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>> [2018-12-18 10:06:05.107729] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:05.107770] E [MSGID: 133014] >>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>> [2018-12-18 10:06:05.107791] W [fuse-bridge.c:871:fuse_attr_cbk] >>> 0-glusterfs-fuse: 428091: FSTAT() >>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>> [2018-12-18 10:06:05.537244] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.538523] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid >>> 00000000-0000-0000-0000-000000000001: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:05.538685] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.538794] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.539342] I [MSGID: 109063] >>> [dht-layout.c:716:dht_layout_normalize] 0-ovirt-backbone-2-dht: Found >>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732 (gfid = >>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2 overlaps=0 >>> [2018-12-18 10:06:05.539372] W [MSGID: 109005] >>> [dht-selfheal.c:2158:dht_selfheal_directory] 0-ovirt-backbone-2-dht: >>> Directory selfheal failed: 2 subvolumes down.Not fixing. path = >>> /b1c2c949-aef4-4aec-999b-b179efeef732, gfid = >>> 8c8598ce-1a52-418e-a7b4-435fee34bae8 >>> [2018-12-18 10:06:05.539694] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.540652] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.608612] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:05.608657] E [MSGID: 133014] >>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>> [2018-12-18 10:06:05.608672] W [fuse-bridge.c:871:fuse_attr_cbk] >>> 0-glusterfs-fuse: 428096: FSTAT() >>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>> [2018-12-18 10:06:06.109339] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:06.109378] E [MSGID: 133014] >>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>> [2018-12-18 10:06:06.109399] W [fuse-bridge.c:871:fuse_attr_cbk] >>> 0-glusterfs-fuse: 428097: FSTAT() >>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>> >>> #note i'm able to see ; /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>> [root at lease-11 ovirt-backbone-2]# stat >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>> File: >>> ?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids? >>> Size: 1048576 Blocks: 2048 IO Block: 131072 regular file >>> Device: 41h/65d Inode: 10492258721813610344 Links: 1 >>> Access: (0660/-rw-rw----) Uid: ( 36/ vdsm) Gid: ( 36/ kvm) >>> Context: system_u:object_r:fusefs_t:s0 >>> Access: 2018-12-19 20:07:39.917573869 +0000 >>> Modify: 2018-12-19 20:07:39.928573917 +0000 >>> Change: 2018-12-19 20:07:39.929573921 +0000 >>> Birth: - >>> >>> however checking: gluster v heal ovirt-backbone-2 info split-brain >>> reports no entries. >>> >>> I've also tried mounting the qemu image, and this works fine, i'm able >>> to see all contents; >>> losetup /dev/loop0 >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> kpartx -a /dev/loop0 >>> vgscan >>> vgchange -ay slave-data >>> mkdir /mnt/slv01 >>> mount /dev/mapper/slave--data-lvol0 /mnt/slv01/ >>> >>> Possible causes for this issue; >>> 1. the machine "lease-11" suffered from a faulty RAM module (ECC), which >>> halted the machine and causes an invalid state. (this machine also hosts >>> other volumes, with similar configurations, which report no issue) >>> 2. after the RAM module was replaced, the VM using the backing qemu >>> image, was restored from a backup (the backup was file based within the VM >>> on a different directory). This is because some files were corrupted. The >>> backup/recovery obviously causes extra IO, possible introducing race >>> conditions? The machine did run for about 12h without issues, and in total >>> for about 36h. >>> 3. since only the client (maybe only gfapi?) reports errors, something >>> is broken there? >>> >>> The volume info; >>> root at lease-06 ~# gluster v info ovirt-backbone-2 >>> >>> Volume Name: ovirt-backbone-2 >>> Type: Distributed-Replicate >>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 3 x (2 + 1) = 9 >>> Transport-type: tcp >>> Bricks: >>> Brick1: 10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick2: 10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick3: 10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>> Brick4: 10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick5: 10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick6: 10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>> Brick7: 10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick8: 10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick9: 10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>> Options Reconfigured: >>> nfs.disable: on >>> transport.address-family: inet >>> performance.quick-read: off >>> performance.read-ahead: off >>> performance.io-cache: off >>> performance.low-prio-threads: 32 >>> network.remote-dio: enable >>> cluster.eager-lock: enable >>> cluster.quorum-type: auto >>> cluster.server-quorum-type: server >>> cluster.data-self-heal-algorithm: full >>> cluster.locking-scheme: granular >>> cluster.shd-max-threads: 8 >>> cluster.shd-wait-qlength: 10000 >>> features.shard: on >>> user.cifs: off >>> storage.owner-uid: 36 >>> storage.owner-gid: 36 >>> features.shard-block-size: 64MB >>> performance.write-behind-window-size: 512MB >>> performance.cache-size: 384MB >>> cluster.brick-multiplex: on >>> >>> The volume status; >>> root at lease-06 ~# gluster v status ovirt-backbone-2 >>> Status of volume: ovirt-backbone-2 >>> Gluster process TCP Port RDMA Port Online >>> Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi >>> rt-backbone-2 49152 0 Y >>> 7727 >>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi >>> rt-backbone-2 49152 0 Y >>> 12620 >>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi >>> rt-backbone-2 49152 0 Y >>> 8794 >>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov >>> irt-backbone-2 49161 0 Y >>> 22333 >>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o >>> virt-backbone-2 49152 0 Y >>> 15030 >>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi >>> rt-backbone-2 49166 0 Y >>> 24592 >>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov >>> irt-backbone-2 49153 0 Y >>> 20148 >>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o >>> virt-backbone-2 49154 0 Y >>> 15413 >>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi >>> rt-backbone-2 49152 0 Y >>> 43120 >>> Self-heal Daemon on localhost N/A N/A Y >>> 44587 >>> Self-heal Daemon on 10.201.0.2 N/A N/A Y >>> 8401 >>> Self-heal Daemon on 10.201.0.5 N/A N/A Y >>> 11038 >>> Self-heal Daemon on 10.201.0.8 N/A N/A Y >>> 9513 >>> Self-heal Daemon on 10.32.9.4 N/A N/A Y >>> 23736 >>> Self-heal Daemon on 10.32.9.20 N/A N/A Y >>> 2738 >>> Self-heal Daemon on 10.32.9.3 N/A N/A Y >>> 25598 >>> Self-heal Daemon on 10.32.9.5 N/A N/A Y >>> 511 >>> Self-heal Daemon on 10.32.9.9 N/A N/A Y >>> 23357 >>> Self-heal Daemon on 10.32.9.8 N/A N/A Y >>> 15225 >>> Self-heal Daemon on 10.32.9.7 N/A N/A Y >>> 25781 >>> Self-heal Daemon on 10.32.9.21 N/A N/A Y >>> 5034 >>> >>> Task Status of Volume ovirt-backbone-2 >>> >>> ------------------------------------------------------------------------------ >>> Task : Rebalance >>> ID : 6dfbac43-0125-4568-9ac3-a2c453faaa3d >>> Status : completed >>> >>> gluster version is @3.12.15 and cluster.op-version=31202 >>> >>> ======================== >>> >>> It would be nice to know if it's possible to mark the files as not stale >>> or if i should investigate other things? >>> Or should we consider this volume lost? >>> Also checking the code at; >>> https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c >>> it seems the functions shifted quite some (line 1724 vs. 2243), so maybe >>> it's fixed in a future version? >>> Any thoughts are welcome. >>> >>> Thanks Olaf >>> >>> >>> >>> >>> >>> >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Wed Jan 2 13:20:02 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Wed, 2 Jan 2019 18:50:02 +0530 Subject: [Gluster-users] [Stale file handle] in shard volume In-Reply-To: References: Message-ID: On Mon, 31 Dec 2018 at 01:27, Olaf Buitelaar wrote: > Dear All, > > till now a selected group of VM's still seem to produce new stale file's > and getting paused due to this. > I've not updated gluster recently, however i did change the op version > from 31200 to 31202 about a week before this issue arose. > Looking at the .shard directory, i've 100.000+ files sharing the same > characteristics as a stale file. which are found till now, > they all have the sticky bit set, e.g. file permissions; ---------T. are > 0kb in size, and have the trusted.glusterfs.dht.linkto attribute. > These are internal files used by gluster and do not necessarily mean they are stale. They "point" to data files which may be on different bricks (same name, gfid etc but no linkto xattr and no ----T permissions). > These files range from long a go (beginning of the year) till now. Which > makes me suspect this was laying dormant for some time now..and somehow > recently surfaced. > Checking other sub-volumes they contain also 0kb files in the .shard > directory, but don't have the sticky bit and the linkto attribute. > > Does anybody else experience this issue? Could this be a bug or an > environmental issue? > These are most likely valid files- please do not delete them without double-checking. Stale file handle errors show up when a file with a specified gfid is not found. You will need to debug the files for which you see this error by checking the bricks to see if they actually exist. > > Also i wonder if there is any tool or gluster command to clean all stale > file handles? > Otherwise i'm planning to make a simple bash script, which iterates over > the .shard dir, checks each file for the above mentioned criteria, and > (re)moves the file and the corresponding .glusterfs file. > If there are other criteria needed to identify a stale file handle, i > would like to hear that. > If this is a viable and safe operation to do of course. > > Thanks Olaf > > > > Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar < > olaf.buitelaar at gmail.com>: > >> Dear All, >> >> I figured it out, it appeared to be the exact same issue as described >> here; >> https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html >> Another subvolume also had the shard file, only were all 0 bytes and had >> the dht.linkto >> >> for reference; >> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >> >> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >> >> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >> >> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >> >> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >> # file: .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >> >> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >> >> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >> >> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >> >> [root at lease-04 ovirt-backbone-2]# stat >> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >> File: ?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d? >> Size: 0 Blocks: 0 IO Block: 4096 regular empty >> file >> Device: fd01h/64769d Inode: 1918631406 Links: 2 >> Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ root) >> Context: system_u:object_r:etc_runtime_t:s0 >> Access: 2018-12-17 21:43:36.405735296 +0000 >> Modify: 2018-12-17 21:43:36.405735296 +0000 >> Change: 2018-12-17 21:43:36.405735296 +0000 >> Birth: - >> >> removing the shard file and glusterfs file from each node resolved the >> issue. >> >> I also found this thread; >> https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html >> Maybe he suffers from the same issue. >> >> Best Olaf >> >> >> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar < >> olaf.buitelaar at gmail.com>: >> >>> Dear All, >>> >>> It appears i've a stale file in one of the volumes, on 2 files. These >>> files are qemu images (1 raw and 1 qcow2). >>> I'll just focus on 1 file since the situation on the other seems the >>> same. >>> >>> The VM get's paused more or less directly after being booted with error; >>> [2018-12-18 14:05:05.275713] E [MSGID: 133010] >>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-backbone-2-shard: >>> Lookup on shard 51500 failed. Base file gfid = >>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file handle] >>> >>> investigating the shard; >>> >>> #on the arbiter node: >>> >>> [root at lease-05 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> getfattr: Removing leading '/' from absolute path names >>> # file: >>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>> >>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-05 ovirt-backbone-2]# stat >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>> Size: 0 Blocks: 0 IO Block: 4096 regular >>> empty file >>> Device: fd01h/64769d Inode: 537277306 Links: 2 >>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>> Context: system_u:object_r:etc_runtime_t:s0 >>> Access: 2018-12-17 21:43:36.361984810 +0000 >>> Modify: 2018-12-17 21:43:36.361984810 +0000 >>> Change: 2018-12-18 20:55:29.908647417 +0000 >>> Birth: - >>> >>> [root at lease-05 ovirt-backbone-2]# find . -inum 537277306 >>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> #on the data nodes: >>> >>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> getfattr: Removing leading '/' from absolute path names >>> # file: >>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>> >>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-08 ovirt-backbone-2]# stat >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular file >>> Device: fd03h/64771d Inode: 12893624759 Links: 3 >>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>> Context: system_u:object_r:etc_runtime_t:s0 >>> Access: 2018-12-18 18:52:38.070776585 +0000 >>> Modify: 2018-12-17 21:43:36.388054443 +0000 >>> Change: 2018-12-18 21:01:47.810506528 +0000 >>> Birth: - >>> >>> [root at lease-08 ovirt-backbone-2]# find . -inum 12893624759 >>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> ======================== >>> >>> [root at lease-11 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> getfattr: Removing leading '/' from absolute path names >>> # file: >>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>> >>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> [root at lease-11 ovirt-backbone-2]# stat >>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular file >>> Device: fd03h/64771d Inode: 12956094809 Links: 3 >>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>> Context: system_u:object_r:etc_runtime_t:s0 >>> Access: 2018-12-18 20:11:53.595208449 +0000 >>> Modify: 2018-12-17 21:43:36.391580259 +0000 >>> Change: 2018-12-18 19:19:25.888055392 +0000 >>> Birth: - >>> >>> [root at lease-11 ovirt-backbone-2]# find . -inum 12956094809 >>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> ================ >>> >>> I don't really see any inconsistencies, except the dates on the stat. >>> However this is only after i tried moving the file out of the volumes to >>> force a heal, which does happen on the data nodes, but not on the arbiter >>> node. Before that they were also the same. >>> I've also compared the file >>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on the 2 nodes and they >>> are exactly the same. >>> >>> Things i've further tried; >>> - gluster v heal ovirt-backbone-2 full => gluster v heal >>> ovirt-backbone-2 info reports 0 entries on all nodes >>> >>> - stop each glusterd and glusterfsd, pause around 40sec and start them >>> again on each node, 1 at a time, waiting for the heal to recover before >>> moving to the next node >>> >>> - force a heal by stopping glusterd on a node and perform these steps; >>> mkdir /mnt/ovirt-backbone-2/trigger >>> rmdir /mnt/ovirt-backbone-2/trigger >>> setfattr -n trusted.non-existent-key -v abc /mnt/ovirt-backbone-2/ >>> setfattr -x trusted.non-existent-key /mnt/ovirt-backbone-2/ >>> start glusterd >>> >>> - gluster volume rebalance ovirt-backbone-2 start => success >>> >>> Whats further interesting is that according the mount log, the volume is >>> in split-brain; >>> [2018-12-18 10:06:04.606870] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:04.606908] E [MSGID: 133014] >>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>> [2018-12-18 10:06:04.606927] W [fuse-bridge.c:871:fuse_attr_cbk] >>> 0-glusterfs-fuse: 428090: FSTAT() >>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>> [2018-12-18 10:06:05.107729] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:05.107770] E [MSGID: 133014] >>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>> [2018-12-18 10:06:05.107791] W [fuse-bridge.c:871:fuse_attr_cbk] >>> 0-glusterfs-fuse: 428091: FSTAT() >>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>> [2018-12-18 10:06:05.537244] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.538523] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid >>> 00000000-0000-0000-0000-000000000001: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:05.538685] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.538794] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.539342] I [MSGID: 109063] >>> [dht-layout.c:716:dht_layout_normalize] 0-ovirt-backbone-2-dht: Found >>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732 (gfid = >>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2 overlaps=0 >>> [2018-12-18 10:06:05.539372] W [MSGID: 109005] >>> [dht-selfheal.c:2158:dht_selfheal_directory] 0-ovirt-backbone-2-dht: >>> Directory selfheal failed: 2 subvolumes down.Not fixing. path = >>> /b1c2c949-aef4-4aec-999b-b179efeef732, gfid = >>> 8c8598ce-1a52-418e-a7b4-435fee34bae8 >>> [2018-12-18 10:06:05.539694] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.540652] I [MSGID: 108006] >>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>> subvolumes up >>> [2018-12-18 10:06:05.608612] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:05.608657] E [MSGID: 133014] >>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>> [2018-12-18 10:06:05.608672] W [fuse-bridge.c:871:fuse_attr_cbk] >>> 0-glusterfs-fuse: 428096: FSTAT() >>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>> [2018-12-18 10:06:06.109339] E [MSGID: 108008] >>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>> error] >>> [2018-12-18 10:06:06.109378] E [MSGID: 133014] >>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>> [2018-12-18 10:06:06.109399] W [fuse-bridge.c:871:fuse_attr_cbk] >>> 0-glusterfs-fuse: 428097: FSTAT() >>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>> >>> #note i'm able to see ; /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>> [root at lease-11 ovirt-backbone-2]# stat >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>> File: >>> ?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids? >>> Size: 1048576 Blocks: 2048 IO Block: 131072 regular file >>> Device: 41h/65d Inode: 10492258721813610344 Links: 1 >>> Access: (0660/-rw-rw----) Uid: ( 36/ vdsm) Gid: ( 36/ kvm) >>> Context: system_u:object_r:fusefs_t:s0 >>> Access: 2018-12-19 20:07:39.917573869 +0000 >>> Modify: 2018-12-19 20:07:39.928573917 +0000 >>> Change: 2018-12-19 20:07:39.929573921 +0000 >>> Birth: - >>> >>> however checking: gluster v heal ovirt-backbone-2 info split-brain >>> reports no entries. >>> >>> I've also tried mounting the qemu image, and this works fine, i'm able >>> to see all contents; >>> losetup /dev/loop0 >>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>> kpartx -a /dev/loop0 >>> vgscan >>> vgchange -ay slave-data >>> mkdir /mnt/slv01 >>> mount /dev/mapper/slave--data-lvol0 /mnt/slv01/ >>> >>> Possible causes for this issue; >>> 1. the machine "lease-11" suffered from a faulty RAM module (ECC), which >>> halted the machine and causes an invalid state. (this machine also hosts >>> other volumes, with similar configurations, which report no issue) >>> 2. after the RAM module was replaced, the VM using the backing qemu >>> image, was restored from a backup (the backup was file based within the VM >>> on a different directory). This is because some files were corrupted. The >>> backup/recovery obviously causes extra IO, possible introducing race >>> conditions? The machine did run for about 12h without issues, and in total >>> for about 36h. >>> 3. since only the client (maybe only gfapi?) reports errors, something >>> is broken there? >>> >>> The volume info; >>> root at lease-06 ~# gluster v info ovirt-backbone-2 >>> >>> Volume Name: ovirt-backbone-2 >>> Type: Distributed-Replicate >>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 3 x (2 + 1) = 9 >>> Transport-type: tcp >>> Bricks: >>> Brick1: 10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick2: 10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick3: 10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>> Brick4: 10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick5: 10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick6: 10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>> Brick7: 10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick8: 10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>> Brick9: 10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>> Options Reconfigured: >>> nfs.disable: on >>> transport.address-family: inet >>> performance.quick-read: off >>> performance.read-ahead: off >>> performance.io-cache: off >>> performance.low-prio-threads: 32 >>> network.remote-dio: enable >>> cluster.eager-lock: enable >>> cluster.quorum-type: auto >>> cluster.server-quorum-type: server >>> cluster.data-self-heal-algorithm: full >>> cluster.locking-scheme: granular >>> cluster.shd-max-threads: 8 >>> cluster.shd-wait-qlength: 10000 >>> features.shard: on >>> user.cifs: off >>> storage.owner-uid: 36 >>> storage.owner-gid: 36 >>> features.shard-block-size: 64MB >>> performance.write-behind-window-size: 512MB >>> performance.cache-size: 384MB >>> cluster.brick-multiplex: on >>> >>> The volume status; >>> root at lease-06 ~# gluster v status ovirt-backbone-2 >>> Status of volume: ovirt-backbone-2 >>> Gluster process TCP Port RDMA Port Online >>> Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi >>> rt-backbone-2 49152 0 Y >>> 7727 >>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi >>> rt-backbone-2 49152 0 Y >>> 12620 >>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi >>> rt-backbone-2 49152 0 Y >>> 8794 >>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov >>> irt-backbone-2 49161 0 Y >>> 22333 >>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o >>> virt-backbone-2 49152 0 Y >>> 15030 >>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi >>> rt-backbone-2 49166 0 Y >>> 24592 >>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov >>> irt-backbone-2 49153 0 Y >>> 20148 >>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o >>> virt-backbone-2 49154 0 Y >>> 15413 >>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi >>> rt-backbone-2 49152 0 Y >>> 43120 >>> Self-heal Daemon on localhost N/A N/A Y >>> 44587 >>> Self-heal Daemon on 10.201.0.2 N/A N/A Y >>> 8401 >>> Self-heal Daemon on 10.201.0.5 N/A N/A Y >>> 11038 >>> Self-heal Daemon on 10.201.0.8 N/A N/A Y >>> 9513 >>> Self-heal Daemon on 10.32.9.4 N/A N/A Y >>> 23736 >>> Self-heal Daemon on 10.32.9.20 N/A N/A Y >>> 2738 >>> Self-heal Daemon on 10.32.9.3 N/A N/A Y >>> 25598 >>> Self-heal Daemon on 10.32.9.5 N/A N/A Y >>> 511 >>> Self-heal Daemon on 10.32.9.9 N/A N/A Y >>> 23357 >>> Self-heal Daemon on 10.32.9.8 N/A N/A Y >>> 15225 >>> Self-heal Daemon on 10.32.9.7 N/A N/A Y >>> 25781 >>> Self-heal Daemon on 10.32.9.21 N/A N/A Y >>> 5034 >>> >>> Task Status of Volume ovirt-backbone-2 >>> >>> ------------------------------------------------------------------------------ >>> Task : Rebalance >>> ID : 6dfbac43-0125-4568-9ac3-a2c453faaa3d >>> Status : completed >>> >>> gluster version is @3.12.15 and cluster.op-version=31202 >>> >>> ======================== >>> >>> It would be nice to know if it's possible to mark the files as not stale >>> or if i should investigate other things? >>> Or should we consider this volume lost? >>> Also checking the code at; >>> https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c >>> it seems the functions shifted quite some (line 1724 vs. 2243), so maybe >>> it's fixed in a future version? >>> Any thoughts are welcome. >>> >>> Thanks Olaf >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.buitelaar at gmail.com Wed Jan 2 15:25:46 2019 From: olaf.buitelaar at gmail.com (Olaf Buitelaar) Date: Wed, 2 Jan 2019 16:25:46 +0100 Subject: [Gluster-users] [Stale file handle] in shard volume In-Reply-To: References: Message-ID: Hi Nithya, Thank you for your reply. the VM's using the gluster volumes keeps on getting paused/stopped on errors like these; [2019-01-02 02:33:44.469132] E [MSGID: 133010] [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on shard 101487 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c [Stale file handle] [2019-01-02 02:33:44.563288] E [MSGID: 133010] [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on shard 101488 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c [Stale file handle] What i'm trying to find out, if i can purge all gluster volumes from all possible stale file handles (and hopefully find a method to prevent this in the future), so the VM's can start running stable again. For this i need to know when the "shard_common_lookup_shards_cbk" function considers a file as stale. The statement; "Stale file handle errors show up when a file with a specified gfid is not found." doesn't seem to cover it all, as i've shown in earlier mails the shard file and glusterfs/xx/xx/uuid file do both exist, and have the same inode. If the criteria i'm using aren't correct, could you please tell me which criteria i should use to determine if a file is stale or not? these criteria are just based observations i made, moving the stale files manually. After removing them i was able to start the VM again..until some time later it hangs on another stale shard file unfortunate. Thanks Olaf Op wo 2 jan. 2019 om 14:20 schreef Nithya Balachandran : > > > On Mon, 31 Dec 2018 at 01:27, Olaf Buitelaar > wrote: > >> Dear All, >> >> till now a selected group of VM's still seem to produce new stale file's >> and getting paused due to this. >> I've not updated gluster recently, however i did change the op version >> from 31200 to 31202 about a week before this issue arose. >> Looking at the .shard directory, i've 100.000+ files sharing the same >> characteristics as a stale file. which are found till now, >> they all have the sticky bit set, e.g. file permissions; ---------T. are >> 0kb in size, and have the trusted.glusterfs.dht.linkto attribute. >> > > These are internal files used by gluster and do not necessarily mean they > are stale. They "point" to data files which may be on different bricks > (same name, gfid etc but no linkto xattr and no ----T permissions). > > >> These files range from long a go (beginning of the year) till now. Which >> makes me suspect this was laying dormant for some time now..and somehow >> recently surfaced. >> Checking other sub-volumes they contain also 0kb files in the .shard >> directory, but don't have the sticky bit and the linkto attribute. >> >> Does anybody else experience this issue? Could this be a bug or an >> environmental issue? >> > These are most likely valid files- please do not delete them without > double-checking. > > Stale file handle errors show up when a file with a specified gfid is not > found. You will need to debug the files for which you see this error by > checking the bricks to see if they actually exist. > >> >> Also i wonder if there is any tool or gluster command to clean all stale >> file handles? >> Otherwise i'm planning to make a simple bash script, which iterates over >> the .shard dir, checks each file for the above mentioned criteria, and >> (re)moves the file and the corresponding .glusterfs file. >> If there are other criteria needed to identify a stale file handle, i >> would like to hear that. >> If this is a viable and safe operation to do of course. >> >> Thanks Olaf >> >> >> >> Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar < >> olaf.buitelaar at gmail.com>: >> >>> Dear All, >>> >>> I figured it out, it appeared to be the exact same issue as described >>> here; >>> https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html >>> Another subvolume also had the shard file, only were all 0 bytes and had >>> the dht.linkto >>> >>> for reference; >>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>> >>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>> # file: .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>> >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>> >>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>> >>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>> >>> [root at lease-04 ovirt-backbone-2]# stat >>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>> File: ?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d? >>> Size: 0 Blocks: 0 IO Block: 4096 regular >>> empty file >>> Device: fd01h/64769d Inode: 1918631406 Links: 2 >>> Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ root) >>> Context: system_u:object_r:etc_runtime_t:s0 >>> Access: 2018-12-17 21:43:36.405735296 +0000 >>> Modify: 2018-12-17 21:43:36.405735296 +0000 >>> Change: 2018-12-17 21:43:36.405735296 +0000 >>> Birth: - >>> >>> removing the shard file and glusterfs file from each node resolved the >>> issue. >>> >>> I also found this thread; >>> https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html >>> Maybe he suffers from the same issue. >>> >>> Best Olaf >>> >>> >>> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar < >>> olaf.buitelaar at gmail.com>: >>> >>>> Dear All, >>>> >>>> It appears i've a stale file in one of the volumes, on 2 files. These >>>> files are qemu images (1 raw and 1 qcow2). >>>> I'll just focus on 1 file since the situation on the other seems the >>>> same. >>>> >>>> The VM get's paused more or less directly after being booted with error; >>>> [2018-12-18 14:05:05.275713] E [MSGID: 133010] >>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-backbone-2-shard: >>>> Lookup on shard 51500 failed. Base file gfid = >>>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file handle] >>>> >>>> investigating the shard; >>>> >>>> #on the arbiter node: >>>> >>>> [root at lease-05 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>> getfattr: Removing leading '/' from absolute path names >>>> # file: >>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>> >>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> >>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>> trusted.afr.dirty=0x000000000000000000000000 >>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>> >>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>> >>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> >>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>> trusted.afr.dirty=0x000000000000000000000000 >>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>> >>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>> >>>> [root at lease-05 ovirt-backbone-2]# stat >>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>> empty file >>>> Device: fd01h/64769d Inode: 537277306 Links: 2 >>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>>> Context: system_u:object_r:etc_runtime_t:s0 >>>> Access: 2018-12-17 21:43:36.361984810 +0000 >>>> Modify: 2018-12-17 21:43:36.361984810 +0000 >>>> Change: 2018-12-18 20:55:29.908647417 +0000 >>>> Birth: - >>>> >>>> [root at lease-05 ovirt-backbone-2]# find . -inum 537277306 >>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> >>>> #on the data nodes: >>>> >>>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string >>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>> getfattr: Removing leading '/' from absolute path names >>>> # file: >>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>> >>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> >>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>> trusted.afr.dirty=0x000000000000000000000000 >>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>> >>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>> >>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> >>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>> trusted.afr.dirty=0x000000000000000000000000 >>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>> >>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>> >>>> [root at lease-08 ovirt-backbone-2]# stat >>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular file >>>> Device: fd03h/64771d Inode: 12893624759 Links: 3 >>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>>> Context: system_u:object_r:etc_runtime_t:s0 >>>> Access: 2018-12-18 18:52:38.070776585 +0000 >>>> Modify: 2018-12-17 21:43:36.388054443 +0000 >>>> Change: 2018-12-18 21:01:47.810506528 +0000 >>>> Birth: - >>>> >>>> [root at lease-08 ovirt-backbone-2]# find . -inum 12893624759 >>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> >>>> ======================== >>>> >>>> [root at lease-11 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>> getfattr: Removing leading '/' from absolute path names >>>> # file: >>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>> >>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> >>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>> trusted.afr.dirty=0x000000000000000000000000 >>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>> >>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>> >>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> >>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>> trusted.afr.dirty=0x000000000000000000000000 >>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>> >>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>> >>>> [root at lease-11 ovirt-backbone-2]# stat >>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular file >>>> Device: fd03h/64771d Inode: 12956094809 Links: 3 >>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) >>>> Context: system_u:object_r:etc_runtime_t:s0 >>>> Access: 2018-12-18 20:11:53.595208449 +0000 >>>> Modify: 2018-12-17 21:43:36.391580259 +0000 >>>> Change: 2018-12-18 19:19:25.888055392 +0000 >>>> Birth: - >>>> >>>> [root at lease-11 ovirt-backbone-2]# find . -inum 12956094809 >>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> >>>> ================ >>>> >>>> I don't really see any inconsistencies, except the dates on the stat. >>>> However this is only after i tried moving the file out of the volumes to >>>> force a heal, which does happen on the data nodes, but not on the arbiter >>>> node. Before that they were also the same. >>>> I've also compared the file >>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on the 2 nodes and they >>>> are exactly the same. >>>> >>>> Things i've further tried; >>>> - gluster v heal ovirt-backbone-2 full => gluster v heal >>>> ovirt-backbone-2 info reports 0 entries on all nodes >>>> >>>> - stop each glusterd and glusterfsd, pause around 40sec and start them >>>> again on each node, 1 at a time, waiting for the heal to recover before >>>> moving to the next node >>>> >>>> - force a heal by stopping glusterd on a node and perform these steps; >>>> mkdir /mnt/ovirt-backbone-2/trigger >>>> rmdir /mnt/ovirt-backbone-2/trigger >>>> setfattr -n trusted.non-existent-key -v abc /mnt/ovirt-backbone-2/ >>>> setfattr -x trusted.non-existent-key /mnt/ovirt-backbone-2/ >>>> start glusterd >>>> >>>> - gluster volume rebalance ovirt-backbone-2 start => success >>>> >>>> Whats further interesting is that according the mount log, the volume >>>> is in split-brain; >>>> [2018-12-18 10:06:04.606870] E [MSGID: 108008] >>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>> error] >>>> [2018-12-18 10:06:04.606908] E [MSGID: 133014] >>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>> [2018-12-18 10:06:04.606927] W [fuse-bridge.c:871:fuse_attr_cbk] >>>> 0-glusterfs-fuse: 428090: FSTAT() >>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>> [2018-12-18 10:06:05.107729] E [MSGID: 108008] >>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>> error] >>>> [2018-12-18 10:06:05.107770] E [MSGID: 133014] >>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>> [2018-12-18 10:06:05.107791] W [fuse-bridge.c:871:fuse_attr_cbk] >>>> 0-glusterfs-fuse: 428091: FSTAT() >>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>> [2018-12-18 10:06:05.537244] I [MSGID: 108006] >>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>> subvolumes up >>>> [2018-12-18 10:06:05.538523] E [MSGID: 108008] >>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid >>>> 00000000-0000-0000-0000-000000000001: split-brain observed. [Input/output >>>> error] >>>> [2018-12-18 10:06:05.538685] I [MSGID: 108006] >>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>> subvolumes up >>>> [2018-12-18 10:06:05.538794] I [MSGID: 108006] >>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>> subvolumes up >>>> [2018-12-18 10:06:05.539342] I [MSGID: 109063] >>>> [dht-layout.c:716:dht_layout_normalize] 0-ovirt-backbone-2-dht: Found >>>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732 (gfid = >>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2 overlaps=0 >>>> [2018-12-18 10:06:05.539372] W [MSGID: 109005] >>>> [dht-selfheal.c:2158:dht_selfheal_directory] 0-ovirt-backbone-2-dht: >>>> Directory selfheal failed: 2 subvolumes down.Not fixing. path = >>>> /b1c2c949-aef4-4aec-999b-b179efeef732, gfid = >>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8 >>>> [2018-12-18 10:06:05.539694] I [MSGID: 108006] >>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>> subvolumes up >>>> [2018-12-18 10:06:05.540652] I [MSGID: 108006] >>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>> subvolumes up >>>> [2018-12-18 10:06:05.608612] E [MSGID: 108008] >>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>> error] >>>> [2018-12-18 10:06:05.608657] E [MSGID: 133014] >>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>> [2018-12-18 10:06:05.608672] W [fuse-bridge.c:871:fuse_attr_cbk] >>>> 0-glusterfs-fuse: 428096: FSTAT() >>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>> [2018-12-18 10:06:06.109339] E [MSGID: 108008] >>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>> error] >>>> [2018-12-18 10:06:06.109378] E [MSGID: 133014] >>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>> [2018-12-18 10:06:06.109399] W [fuse-bridge.c:871:fuse_attr_cbk] >>>> 0-glusterfs-fuse: 428097: FSTAT() >>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>> >>>> #note i'm able to see ; >>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>> [root at lease-11 ovirt-backbone-2]# stat >>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>> File: >>>> ?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids? >>>> Size: 1048576 Blocks: 2048 IO Block: 131072 regular file >>>> Device: 41h/65d Inode: 10492258721813610344 Links: 1 >>>> Access: (0660/-rw-rw----) Uid: ( 36/ vdsm) Gid: ( 36/ kvm) >>>> Context: system_u:object_r:fusefs_t:s0 >>>> Access: 2018-12-19 20:07:39.917573869 +0000 >>>> Modify: 2018-12-19 20:07:39.928573917 +0000 >>>> Change: 2018-12-19 20:07:39.929573921 +0000 >>>> Birth: - >>>> >>>> however checking: gluster v heal ovirt-backbone-2 info split-brain >>>> reports no entries. >>>> >>>> I've also tried mounting the qemu image, and this works fine, i'm able >>>> to see all contents; >>>> losetup /dev/loop0 >>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>> kpartx -a /dev/loop0 >>>> vgscan >>>> vgchange -ay slave-data >>>> mkdir /mnt/slv01 >>>> mount /dev/mapper/slave--data-lvol0 /mnt/slv01/ >>>> >>>> Possible causes for this issue; >>>> 1. the machine "lease-11" suffered from a faulty RAM module (ECC), >>>> which halted the machine and causes an invalid state. (this machine also >>>> hosts other volumes, with similar configurations, which report no issue) >>>> 2. after the RAM module was replaced, the VM using the backing qemu >>>> image, was restored from a backup (the backup was file based within the VM >>>> on a different directory). This is because some files were corrupted. The >>>> backup/recovery obviously causes extra IO, possible introducing race >>>> conditions? The machine did run for about 12h without issues, and in total >>>> for about 36h. >>>> 3. since only the client (maybe only gfapi?) reports errors, something >>>> is broken there? >>>> >>>> The volume info; >>>> root at lease-06 ~# gluster v info ovirt-backbone-2 >>>> >>>> Volume Name: ovirt-backbone-2 >>>> Type: Distributed-Replicate >>>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28 >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 3 x (2 + 1) = 9 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: 10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>> Brick2: 10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>> Brick3: 10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>> Brick4: 10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>> Brick5: 10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>> Brick6: 10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>> Brick7: 10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>> Brick8: 10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>> Brick9: 10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>> Options Reconfigured: >>>> nfs.disable: on >>>> transport.address-family: inet >>>> performance.quick-read: off >>>> performance.read-ahead: off >>>> performance.io-cache: off >>>> performance.low-prio-threads: 32 >>>> network.remote-dio: enable >>>> cluster.eager-lock: enable >>>> cluster.quorum-type: auto >>>> cluster.server-quorum-type: server >>>> cluster.data-self-heal-algorithm: full >>>> cluster.locking-scheme: granular >>>> cluster.shd-max-threads: 8 >>>> cluster.shd-wait-qlength: 10000 >>>> features.shard: on >>>> user.cifs: off >>>> storage.owner-uid: 36 >>>> storage.owner-gid: 36 >>>> features.shard-block-size: 64MB >>>> performance.write-behind-window-size: 512MB >>>> performance.cache-size: 384MB >>>> cluster.brick-multiplex: on >>>> >>>> The volume status; >>>> root at lease-06 ~# gluster v status ovirt-backbone-2 >>>> Status of volume: ovirt-backbone-2 >>>> Gluster process TCP Port RDMA Port >>>> Online Pid >>>> >>>> ------------------------------------------------------------------------------ >>>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi >>>> rt-backbone-2 49152 0 >>>> Y 7727 >>>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi >>>> rt-backbone-2 49152 0 >>>> Y 12620 >>>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi >>>> rt-backbone-2 49152 0 >>>> Y 8794 >>>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov >>>> irt-backbone-2 49161 0 >>>> Y 22333 >>>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o >>>> virt-backbone-2 49152 0 >>>> Y 15030 >>>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi >>>> rt-backbone-2 49166 0 >>>> Y 24592 >>>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov >>>> irt-backbone-2 49153 0 >>>> Y 20148 >>>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o >>>> virt-backbone-2 49154 0 >>>> Y 15413 >>>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi >>>> rt-backbone-2 49152 0 >>>> Y 43120 >>>> Self-heal Daemon on localhost N/A N/A >>>> Y 44587 >>>> Self-heal Daemon on 10.201.0.2 N/A N/A >>>> Y 8401 >>>> Self-heal Daemon on 10.201.0.5 N/A N/A >>>> Y 11038 >>>> Self-heal Daemon on 10.201.0.8 N/A N/A >>>> Y 9513 >>>> Self-heal Daemon on 10.32.9.4 N/A N/A >>>> Y 23736 >>>> Self-heal Daemon on 10.32.9.20 N/A N/A >>>> Y 2738 >>>> Self-heal Daemon on 10.32.9.3 N/A N/A >>>> Y 25598 >>>> Self-heal Daemon on 10.32.9.5 N/A N/A >>>> Y 511 >>>> Self-heal Daemon on 10.32.9.9 N/A N/A >>>> Y 23357 >>>> Self-heal Daemon on 10.32.9.8 N/A N/A >>>> Y 15225 >>>> Self-heal Daemon on 10.32.9.7 N/A N/A >>>> Y 25781 >>>> Self-heal Daemon on 10.32.9.21 N/A N/A >>>> Y 5034 >>>> >>>> Task Status of Volume ovirt-backbone-2 >>>> >>>> ------------------------------------------------------------------------------ >>>> Task : Rebalance >>>> ID : 6dfbac43-0125-4568-9ac3-a2c453faaa3d >>>> Status : completed >>>> >>>> gluster version is @3.12.15 and cluster.op-version=31202 >>>> >>>> ======================== >>>> >>>> It would be nice to know if it's possible to mark the files as not >>>> stale or if i should investigate other things? >>>> Or should we consider this volume lost? >>>> Also checking the code at; >>>> https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c >>>> it seems the functions shifted quite some (line 1724 vs. 2243), so maybe >>>> it's fixed in a future version? >>>> Any thoughts are welcome. >>>> >>>> Thanks Olaf >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From raphael at badfile.net Tue Jan 1 16:45:53 2019 From: raphael at badfile.net (=?UTF-8?Q?Rapha=c3=abl_Yancey?=) Date: Tue, 1 Jan 2019 17:45:53 +0100 Subject: [Gluster-users] Multiple versions on the same machine, errors on glusterd startup Message-ID: <3aadefcc-56ba-b89f-478f-b41797c94177@badfile.net> Hi, I'm trying to host several GlusterFS versions on the same machine (3.8.8, 4.1.6 and 5.2), not to be ran together of course. I built them with the following procedure (examples with 3.8.8): > git clone https://github.com/gluster/glusterfs . > git checkout v3.8.8 > ./autogen > ./configure --program-suffix="-3.8.8" > make > sudo make install > sudo cp -a extras/systemd/glusterd.service > /etc/systemd/system/glusterd-3.8.8.service > sudo systemctl load glusterd-3.8.8 I had to edit the service for it to execute the right version of glusterd: > ExecStart=/usr/local/sbin/glusterd*-3.8.8* -p /var/run/glusterd.pid? > --log-level $LOG_LEVEL $GLUSTERD_OPTIONS And I had to create symlinks for glusterd: > cd /usr/local/sbin > ln -s glusterd-3.8.8 glusterfsd-3.8.8 I also ran ldconfig for good mesure... > sudo ldconfig When I run glusterd in the foreground (not even with systemd) I'm left with some errors and the process exits (errors emphasized): > user at host0:~/glusterfs-3.8.8 on e5f3a990c [!?]# sudo glusterd-3.8.8 > --debug > [2019-01-01 16:23:37.120684] I [MSGID: 100030] > [glusterfsd.c:2454:main] 0-glusterd-3.8.8: Started running > glusterd-3.8.8 version 3.8.8 (args: glusterd-3.8.8 --debug) > [2019-01-01 16:23:37.120765] D > [logging.c:1791:__gf_log_inject_timer_event] 0-logging-infra: Starting > timer now. Timeout = 120, current buf size = 5 > [2019-01-01 16:23:37.121187] D [MSGID: 0] [glusterfsd.c:660:get_volfp] > 0-glusterfsd: loading volume file /usr/local/etc/glusterfs/glusterd.vol > [2019-01-01 16:23:37.137003] I [MSGID: 106478] [glusterd.c:1379:init] > 0-management: Maximum allowed open file descriptors set to 65536 > [2019-01-01 16:23:37.137064] I [MSGID: 106479] [glusterd.c:1428:init] > 0-management: Using /var/lib/glusterd as working directory > [2019-01-01 16:23:37.137262] D [MSGID: 0] > [glusterd.c:406:glusterd_rpcsvc_options_build] 0-glusterd: > listen-backlog value: 128 > [2019-01-01 16:23:37.137683] D [rpcsvc.c:2316:rpcsvc_init] > 0-rpc-service: RPC service inited. > [2019-01-01 16:23:37.137723] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: > 1, Port: 0 > [2019-01-01 16:23:37.137798] D > [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: attempt to > load file /usr/local/lib/glusterfs/3.8.8/rpc-transport/socket.so > [2019-01-01 16:23:37.151778] D [socket.c:3938:socket_init] > 0-socket.management: Configued transport.tcp-user-timeout=0 > [2019-01-01 16:23:37.151823] D [socket.c:4021:socket_init] > 0-socket.management: SSL support on the I/O path is NOT enabled > [2019-01-01 16:23:37.151862] D [socket.c:4024:socket_init] > 0-socket.management: SSL support for glusterd is NOT enabled > [2019-01-01 16:23:37.151890] D [socket.c:4041:socket_init] > 0-socket.management: using system polling thread > [2019-01-01 16:23:37.151927] D [name.c:584:server_fill_address_family] > 0-socket.management: option address-family not specified, defaulting > to inet > [2019-01-01 16:23:37.152173] D > [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: attempt to > load file /usr/local/lib/glusterfs/3.8.8/rpc-transport/rdma.so > [2019-01-01 16:23:37.155510] D > [rpc-transport.c:321:rpc_transport_load] 0-rpc-transport: dlsym > (gf_rpc_transport_reconfigure) on > /usr/local/lib/glusterfs/3.8.8/rpc-transport/rdma.so: undefined > symbol: reconfigure > [2019-01-01 16:23:37.155830] W [MSGID: 103071] > [rdma.c:4589:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event > channel creation failed [No such device] > [2019-01-01 16:23:37.155884] W [MSGID: 103055] [rdma.c:4896:init] > 0-rdma.management: Failed to initialize IB Device > [2019-01-01 16:23:37.155920] W > [rpc-transport.c:354:rpc_transport_load] 0-rpc-transport: 'rdma' > initialization failed > [2019-01-01 16:23:37.156224] W [rpcsvc.c:1638:rpcsvc_create_listener] > 0-rpc-service: cannot create listener, initing the transport failed > *[2019-01-01 16:23:37.156258] E [MSGID: 106243] [glusterd.c:1652:init] > 0-management: creation of 1 listeners failed, continuing with > succeeded transport* > [2019-01-01 16:23:37.156300] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: GlusterD svc peer, Num: > 1238437, Ver: 2, Port: 0 > [2019-01-01 16:23:37.156332] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: GlusterD svc cli read-only, > Num: 1238463, Ver: 2, Port: 0 > [2019-01-01 16:23:37.156356] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: GlusterD svc mgmt, Num: > 1238433, Ver: 2, Port: 0 > [2019-01-01 16:23:37.156384] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: GlusterD svc mgmt v3, Num: > 1238433, Ver: 3, Port: 0 > [2019-01-01 16:23:37.156414] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: Gluster Portmap, Num: 34123456, > Ver: 1, Port: 0 > [2019-01-01 16:23:37.156438] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: Gluster Handshake, Num: > 14398633, Ver: 2, Port: 0 > [2019-01-01 16:23:37.156468] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: Gluster MGMT Handshake, Num: > 1239873, Ver: 1, Port: 0 > [2019-01-01 16:23:37.156591] D [rpcsvc.c:2316:rpcsvc_init] > 0-rpc-service: RPC service inited. > [2019-01-01 16:23:37.156619] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: > 1, Port: 0 > [2019-01-01 16:23:37.156665] D > [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: attempt to > load file /usr/local/lib/glusterfs/3.8.8/rpc-transport/socket.so > [2019-01-01 16:23:37.156854] D [socket.c:3887:socket_init] > 0-socket.management: disabling nodelay > [2019-01-01 16:23:37.156882] D [socket.c:3938:socket_init] > 0-socket.management: Configued transport.tcp-user-timeout=0 > [2019-01-01 16:23:37.156912] D [socket.c:4021:socket_init] > 0-socket.management: SSL support on the I/O path is NOT enabled > [2019-01-01 16:23:37.156933] D [socket.c:4024:socket_init] > 0-socket.management: SSL support for glusterd is NOT enabled > [2019-01-01 16:23:37.156961] D [socket.c:4041:socket_init] > 0-socket.management: using system polling thread > [2019-01-01 16:23:37.157095] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: GlusterD svc cli, Num: 1238463, > Ver: 2, Port: 0 > [2019-01-01 16:23:37.157125] D [rpcsvc.c:1866:rpcsvc_program_register] > 0-rpc-service: New program registered: Gluster Handshake (CLI > Getspec), Num: 14398633, Ver: 2, Port: 0 > [2019-01-01 16:23:37.157282] D [MSGID: 0] > [glusterd-utils.c:6379:glusterd_sm_tr_log_init] 0-glusterd: returning 0 > [2019-01-01 16:23:37.157318] D [MSGID: 0] [glusterd.c:1720:init] > 0-management: cannot get run-with-valgrind value > *[2019-01-01 16:23:37.170633] E [MSGID: 106229] > [glusterd.c:455:glusterd_check_gsync_present] 0-glusterd: > geo-replication module not working as desired* > [2019-01-01 16:23:37.171476] D [MSGID: 0] > [glusterd.c:465:glusterd_check_gsync_present] 0-glusterd: Returning -1 > *[2019-01-01 16:23:37.171572] E [MSGID: 101019] > [xlator.c:433:xlator_init] 0-management: Initialization of volume > 'management' failed, review your volfile again* > *[2019-01-01 16:23:37.171613] E [MSGID: 101066] > [graph.c:324:glusterfs_graph_init] 0-management: initializing > translator failed* > *[2019-01-01 16:23:37.171649] E [MSGID: 101176] > [graph.c:673:glusterfs_graph_activate] 0-graph: init failed* > [2019-01-01 16:23:37.173130] D > [logging.c:1765:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer > size reduced. About to flush 5 extra log messages > [2019-01-01 16:23:37.173187] D > [logging.c:1768:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed > 5 extra log messages > [2019-01-01 16:23:37.173280] W [MSGID: 100032] > [glusterfsd.c:1327:cleanup_and_exit] 0-: received signum (1), shutting > down > [2019-01-01 16:23:37.173335] D > [glusterfsd-mgmt.c:2385:glusterfs_mgmt_pmap_signout] 0-fsd-mgmt: > portmapper signout arguments not given I can't pin what particular error caused the process to exit and I failed findind informations on the 'management' volume fail... any hints? Thanks, Raphael. -------------- next part -------------- An HTML attachment was scrubbed... URL: From isakdim at gmail.com Wed Jan 2 16:28:42 2019 From: isakdim at gmail.com (Dmitry Isakbayev) Date: Wed, 2 Jan 2019 11:28:42 -0500 Subject: [Gluster-users] java application crushes while reading a zip file In-Reply-To: References: Message-ID: Still no JVM crushes. Is it possible that running glusterfs with performance options turned off for a couple of days cleared out the "stale metadata issue"? On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev wrote: > The software ran with all of the options turned off over the weekend > without any problems. > I will try to collect the debug info for you. I have re-enabled the 3 > three options, but yet to see the problem reoccurring. > > > On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa > wrote: > >> Thanks Dmitry. Can you provide the following debug info I asked earlier: >> >> * strace -ff -v ... of java application >> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse while >> mounting). >> >> regards, >> Raghavendra >> >> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev >> wrote: >> >>> These 3 options seem to trigger both (reading zip file and renaming >>> files) problems. >>> >>> Options Reconfigured: >>> performance.io-cache: off >>> performance.stat-prefetch: off >>> performance.quick-read: off >>> performance.parallel-readdir: off >>> *performance.readdir-ahead: on* >>> *performance.write-behind: on* >>> *performance.read-ahead: on* >>> performance.client-io-threads: off >>> nfs.disable: on >>> transport.address-family: inet >>> >>> >>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev >>> wrote: >>> >>>> Turning a single option on at a time still worked fine. I will keep >>>> trying. >>>> >>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues or log >>>> messages. Do you suppose these issues are triggered by the new environment >>>> or did not exist in 4.1.5? >>>> >>>> [root at node1 ~]# glusterfs --version >>>> glusterfs 4.1.5 >>>> >>>> On AWS using >>>> [root at node1 ~]# hostnamectl >>>> Static hostname: node1 >>>> Icon name: computer-vm >>>> Chassis: vm >>>> Machine ID: b30d0f2110ac3807b210c19ede3ce88f >>>> Boot ID: 52bb159a0aa94043a40e7c7651967bd9 >>>> Virtualization: kvm >>>> Operating System: CentOS Linux 7 (Core) >>>> CPE OS Name: cpe:/o:centos:centos:7 >>>> Kernel: Linux 3.10.0-862.3.2.el7.x86_64 >>>> Architecture: x86-64 >>>> >>>> >>>> >>>> >>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa < >>>> rgowdapp at redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev >>>>> wrote: >>>>> >>>>>> Ok. I will try different options. >>>>>> >>>>>> This system is scheduled to go into production soon. What version >>>>>> would you recommend to roll back to? >>>>>> >>>>> >>>>> These are long standing issues. So, rolling back may not make these >>>>> issues go away. Instead if you think performance is agreeable to you, >>>>> please keep these xlators off in production. >>>>> >>>>> >>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa < >>>>>> rgowdapp at redhat.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev >>>>>>> wrote: >>>>>>> >>>>>>>> Raghavendra, >>>>>>>> >>>>>>>> Thank for the suggestion. >>>>>>>> >>>>>>>> >>>>>>>> I am suing >>>>>>>> >>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster --version >>>>>>>> glusterfs 5.0 >>>>>>>> >>>>>>>> On >>>>>>>> [root at jl-fanexoss1p glusterfs]# hostnamectl >>>>>>>> Icon name: computer-vm >>>>>>>> Chassis: vm >>>>>>>> Machine ID: e44b8478ef7a467d98363614f4e50535 >>>>>>>> Boot ID: eed98992fdda4c88bdd459a89101766b >>>>>>>> Virtualization: vmware >>>>>>>> Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) >>>>>>>> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server >>>>>>>> Kernel: Linux 3.10.0-862.14.4.el7.x86_64 >>>>>>>> Architecture: x86-64 >>>>>>>> >>>>>>>> >>>>>>>> I have configured the following options >>>>>>>> >>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster volume info >>>>>>>> Volume Name: gv0 >>>>>>>> Type: Replicate >>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824 >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 1 x 3 = 3 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: jl-fanexoss1p.cspire.net:/data/brick1/gv0 >>>>>>>> Brick2: sl-fanexoss2p.cspire.net:/data/brick1/gv0 >>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0 >>>>>>>> Options Reconfigured: >>>>>>>> performance.io-cache: off >>>>>>>> performance.stat-prefetch: off >>>>>>>> performance.quick-read: off >>>>>>>> performance.parallel-readdir: off >>>>>>>> performance.readdir-ahead: off >>>>>>>> performance.write-behind: off >>>>>>>> performance.read-ahead: off >>>>>>>> performance.client-io-threads: off >>>>>>>> nfs.disable: on >>>>>>>> transport.address-family: inet >>>>>>>> >>>>>>>> I don't know if it is related, but I am seeing a lot of >>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031] >>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: remote >>>>>>>> operation failed [No such device or address] >>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>> handler >>>>>>>> >>>>>>> >>>>>>> These msgs were introduced by patch [1]. To the best of my knowledge >>>>>>> they are benign. We'll be sending a patch to fix these msgs though. >>>>>>> >>>>>>> +Mohit Agrawal +Milind Changire >>>>>>> . Can you try to identify why we are seeing >>>>>>> these messages? If possible please send a patch to fix this. >>>>>>> >>>>>>> [1] >>>>>>> https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5 >>>>>>> >>>>>>> >>>>>>>> And java.io exceptions trying to rename files. >>>>>>>> >>>>>>> >>>>>>> When you see the errors is it possible to collect, >>>>>>> * strace of the java application (strace -ff -v ...) >>>>>>> * fuse-dump of the glusterfs mount (use option --dump-fuse while >>>>>>> mounting)? >>>>>>> >>>>>>> I also need another favour from you. By trail and error, can you >>>>>>> point out which of the many performance xlators you've turned off is >>>>>>> causing the issue? >>>>>>> >>>>>>> The above two data-points will help us to fix the problem. >>>>>>> >>>>>>> >>>>>>>> Thank You, >>>>>>>> Dmitry >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra Gowdappa < >>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>> >>>>>>>>> What version of glusterfs are you using? It might be either >>>>>>>>> * a stale metadata issue. >>>>>>>>> * inconsistent ctime issue. >>>>>>>>> >>>>>>>>> Can you try turning off all performance xlators? If the issue is >>>>>>>>> 1, that should help. >>>>>>>>> >>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry Isakbayev < >>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Attempted to set 'performance.read-ahead off` according to >>>>>>>>>> https://jira.apache.org/jira/browse/AMQ-7041 >>>>>>>>>> That did not help. >>>>>>>>>> >>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry Isakbayev < >>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> The core file generated by JVM suggests that it happens because >>>>>>>>>>> the file is changing while it is being read - >>>>>>>>>>> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557. >>>>>>>>>>> The application reads in the zipfile and goes through the zip >>>>>>>>>>> entries, then reloads the file and goes the zip entries again. It does so >>>>>>>>>>> 3 times. The application never crushes on the 1st cycle but sometimes >>>>>>>>>>> crushes on the 2nd or 3rd cycle. >>>>>>>>>>> The zip file is generated about 20 seconds prior to it being >>>>>>>>>>> used and is not updated or even used by any other application. I have >>>>>>>>>>> never seen this problem on a plain file system. >>>>>>>>>>> >>>>>>>>>>> I would appreciate any suggestions on how to go debugging this >>>>>>>>>>> issue. I can change the source code of the java application. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Dmitry >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Thu Jan 3 02:25:36 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Thu, 3 Jan 2019 07:55:36 +0530 Subject: [Gluster-users] java application crushes while reading a zip file In-Reply-To: References: Message-ID: On Wed, Jan 2, 2019 at 9:59 PM Dmitry Isakbayev wrote: > Still no JVM crushes. Is it possible that running glusterfs with > performance options turned off for a couple of days cleared out the "stale > metadata issue"? > restarting these options, would've cleared the existing cache and hence previous stale metadata would've been cleared. Hitting stale metadata again depends on races. That might be the reason you are still not seeing the issue. Can you try with enabling all perf xlators (default configuration)? > > On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev > wrote: > >> The software ran with all of the options turned off over the weekend >> without any problems. >> I will try to collect the debug info for you. I have re-enabled the 3 >> three options, but yet to see the problem reoccurring. >> >> >> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa >> wrote: >> >>> Thanks Dmitry. Can you provide the following debug info I asked earlier: >>> >>> * strace -ff -v ... of java application >>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse while >>> mounting). >>> >>> regards, >>> Raghavendra >>> >>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev >>> wrote: >>> >>>> These 3 options seem to trigger both (reading zip file and renaming >>>> files) problems. >>>> >>>> Options Reconfigured: >>>> performance.io-cache: off >>>> performance.stat-prefetch: off >>>> performance.quick-read: off >>>> performance.parallel-readdir: off >>>> *performance.readdir-ahead: on* >>>> *performance.write-behind: on* >>>> *performance.read-ahead: on* >>>> performance.client-io-threads: off >>>> nfs.disable: on >>>> transport.address-family: inet >>>> >>>> >>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev >>>> wrote: >>>> >>>>> Turning a single option on at a time still worked fine. I will keep >>>>> trying. >>>>> >>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues or log >>>>> messages. Do you suppose these issues are triggered by the new environment >>>>> or did not exist in 4.1.5? >>>>> >>>>> [root at node1 ~]# glusterfs --version >>>>> glusterfs 4.1.5 >>>>> >>>>> On AWS using >>>>> [root at node1 ~]# hostnamectl >>>>> Static hostname: node1 >>>>> Icon name: computer-vm >>>>> Chassis: vm >>>>> Machine ID: b30d0f2110ac3807b210c19ede3ce88f >>>>> Boot ID: 52bb159a0aa94043a40e7c7651967bd9 >>>>> Virtualization: kvm >>>>> Operating System: CentOS Linux 7 (Core) >>>>> CPE OS Name: cpe:/o:centos:centos:7 >>>>> Kernel: Linux 3.10.0-862.3.2.el7.x86_64 >>>>> Architecture: x86-64 >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa < >>>>> rgowdapp at redhat.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev >>>>>> wrote: >>>>>> >>>>>>> Ok. I will try different options. >>>>>>> >>>>>>> This system is scheduled to go into production soon. What version >>>>>>> would you recommend to roll back to? >>>>>>> >>>>>> >>>>>> These are long standing issues. So, rolling back may not make these >>>>>> issues go away. Instead if you think performance is agreeable to you, >>>>>> please keep these xlators off in production. >>>>>> >>>>>> >>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa < >>>>>>> rgowdapp at redhat.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Raghavendra, >>>>>>>>> >>>>>>>>> Thank for the suggestion. >>>>>>>>> >>>>>>>>> >>>>>>>>> I am suing >>>>>>>>> >>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster --version >>>>>>>>> glusterfs 5.0 >>>>>>>>> >>>>>>>>> On >>>>>>>>> [root at jl-fanexoss1p glusterfs]# hostnamectl >>>>>>>>> Icon name: computer-vm >>>>>>>>> Chassis: vm >>>>>>>>> Machine ID: e44b8478ef7a467d98363614f4e50535 >>>>>>>>> Boot ID: eed98992fdda4c88bdd459a89101766b >>>>>>>>> Virtualization: vmware >>>>>>>>> Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) >>>>>>>>> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server >>>>>>>>> Kernel: Linux 3.10.0-862.14.4.el7.x86_64 >>>>>>>>> Architecture: x86-64 >>>>>>>>> >>>>>>>>> >>>>>>>>> I have configured the following options >>>>>>>>> >>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster volume info >>>>>>>>> Volume Name: gv0 >>>>>>>>> Type: Replicate >>>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824 >>>>>>>>> Status: Started >>>>>>>>> Snapshot Count: 0 >>>>>>>>> Number of Bricks: 1 x 3 = 3 >>>>>>>>> Transport-type: tcp >>>>>>>>> Bricks: >>>>>>>>> Brick1: jl-fanexoss1p.cspire.net:/data/brick1/gv0 >>>>>>>>> Brick2: sl-fanexoss2p.cspire.net:/data/brick1/gv0 >>>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0 >>>>>>>>> Options Reconfigured: >>>>>>>>> performance.io-cache: off >>>>>>>>> performance.stat-prefetch: off >>>>>>>>> performance.quick-read: off >>>>>>>>> performance.parallel-readdir: off >>>>>>>>> performance.readdir-ahead: off >>>>>>>>> performance.write-behind: off >>>>>>>>> performance.read-ahead: off >>>>>>>>> performance.client-io-threads: off >>>>>>>>> nfs.disable: on >>>>>>>>> transport.address-family: inet >>>>>>>>> >>>>>>>>> I don't know if it is related, but I am seeing a lot of >>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031] >>>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: remote >>>>>>>>> operation failed [No such device or address] >>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler >>>>>>>>> >>>>>>>> >>>>>>>> These msgs were introduced by patch [1]. To the best of my >>>>>>>> knowledge they are benign. We'll be sending a patch to fix these msgs >>>>>>>> though. >>>>>>>> >>>>>>>> +Mohit Agrawal +Milind Changire >>>>>>>> . Can you try to identify why we are seeing >>>>>>>> these messages? If possible please send a patch to fix this. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5 >>>>>>>> >>>>>>>> >>>>>>>>> And java.io exceptions trying to rename files. >>>>>>>>> >>>>>>>> >>>>>>>> When you see the errors is it possible to collect, >>>>>>>> * strace of the java application (strace -ff -v ...) >>>>>>>> * fuse-dump of the glusterfs mount (use option --dump-fuse while >>>>>>>> mounting)? >>>>>>>> >>>>>>>> I also need another favour from you. By trail and error, can you >>>>>>>> point out which of the many performance xlators you've turned off is >>>>>>>> causing the issue? >>>>>>>> >>>>>>>> The above two data-points will help us to fix the problem. >>>>>>>> >>>>>>>> >>>>>>>>> Thank You, >>>>>>>>> Dmitry >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra Gowdappa < >>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> What version of glusterfs are you using? It might be either >>>>>>>>>> * a stale metadata issue. >>>>>>>>>> * inconsistent ctime issue. >>>>>>>>>> >>>>>>>>>> Can you try turning off all performance xlators? If the issue is >>>>>>>>>> 1, that should help. >>>>>>>>>> >>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry Isakbayev < >>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Attempted to set 'performance.read-ahead off` according to >>>>>>>>>>> https://jira.apache.org/jira/browse/AMQ-7041 >>>>>>>>>>> That did not help. >>>>>>>>>>> >>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry Isakbayev < >>>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> The core file generated by JVM suggests that it happens because >>>>>>>>>>>> the file is changing while it is being read - >>>>>>>>>>>> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557. >>>>>>>>>>>> The application reads in the zipfile and goes through the zip >>>>>>>>>>>> entries, then reloads the file and goes the zip entries again. It does so >>>>>>>>>>>> 3 times. The application never crushes on the 1st cycle but sometimes >>>>>>>>>>>> crushes on the 2nd or 3rd cycle. >>>>>>>>>>>> The zip file is generated about 20 seconds prior to it being >>>>>>>>>>>> used and is not updated or even used by any other application. I have >>>>>>>>>>>> never seen this problem on a plain file system. >>>>>>>>>>>> >>>>>>>>>>>> I would appreciate any suggestions on how to go debugging this >>>>>>>>>>>> issue. I can change the source code of the java application. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Dmitry >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Thu Jan 3 07:50:10 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Thu, 3 Jan 2019 13:20:10 +0530 Subject: [Gluster-users] Error in Installing Glusterfs-4.1.6 from tar In-Reply-To: References: <0ee0afec-38a7-5261-badf-0e15803e5e6d@redhat.com> Message-ID: Can I skip this warning message in tail mail and continue with the installation? On Thu, Dec 27, 2018 at 5:11 PM Amudhan P wrote: > Thanks, Ravishankar it worked. > also, I am getting the following warning message when running `make` is it > safe to skip? > > dht-layout.c: In function ?dht_layout_new?: > dht-layout.c:51:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (layout->ref, 1); > ^ > dht-layout.c:51:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > CC dht-helper.lo > > > CC ec.lo > ec.c: In function ?ec_statistics_init?: > ec.c:637:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT(ec->stats.stripe_cache.hits, 0); > ^ > ec.c:637:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:638:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT(ec->stats.stripe_cache.misses, 0); > ^ > ec.c:638:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:639:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT(ec->stats.stripe_cache.updates, 0); > ^ > ec.c:639:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:640:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT(ec->stats.stripe_cache.invals, 0); > ^ > ec.c:640:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:641:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT(ec->stats.stripe_cache.evicts, 0); > ^ > ec.c:641:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:642:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT(ec->stats.stripe_cache.allocs, 0); > ^ > ec.c:642:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:643:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT(ec->stats.stripe_cache.errors, 0); > ^ > ec.c:643:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > CC ec-data.lo > > > CCLD posix.la > .libs/posix-inode-fd-ops.o: In function `posix_do_chmod': > /home/qubevaultadmin/gluster-tar/glusterfs-4.1.6/xlators/storage/posix/src/posix-inode-fd-ops.c:203: > warning: lchmod is not implemented and will always fail > make[5]: Nothing to be done for 'all-am'. > > > CC client-handshake.lo > client-handshake.c: In function ?clnt_fd_lk_local_create?: > client-handshake.c:150:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (local->ref, 1); > ^ > client-handshake.c:150:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > CC client-callback.lo > > CC readdir-ahead.lo > readdir-ahead.c: In function ?init?: > readdir-ahead.c:637:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (priv->rda_cache_size, 0); > ^ > readdir-ahead.c:637:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > CCLD readdir-ahead.la > > Making all in src > CC md-cache.lo > md-cache.c: In function ?mdc_init?: > md-cache.c:3431:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.stat_hit, 0); > ^ > md-cache.c:3431:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3432:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.stat_miss, 0); > ^ > md-cache.c:3432:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3433:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.xattr_hit, 0); > ^ > md-cache.c:3433:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3434:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.xattr_miss, 0); > ^ > md-cache.c:3434:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3435:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.negative_lookup, 0); > ^ > md-cache.c:3435:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3436:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.nameless_lookup, 0); > ^ > md-cache.c:3436:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3437:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.stat_invals, 0); > ^ > md-cache.c:3437:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3438:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.xattr_invals, 0); > ^ > md-cache.c:3438:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3439:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (conf->mdc_counter.need_lookup, 0); > ^ > md-cache.c:3439:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > CCLD md-cache.la > > > dht-layout.c: In function ?dht_layout_new?: > dht-layout.c:51:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (layout->ref, 1); > > CC io-stats.lo > io-stats.c: In function ?ios_init_iosstat?: > io-stats.c:1973:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (iosstat->counters[i], 0); > ^ > io-stats.c:1973:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c: In function ?io_stats_open_cbk?: > io-stats.c:2066:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (iosfd->data_read, 0); > ^ > io-stats.c:2066:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:2067:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (iosfd->data_written, 0); > ^ > io-stats.c:2067:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:2069:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (iosfd->block_count_write[i], 0); > ^ > io-stats.c:2069:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:2070:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (iosfd->block_count_read[i], 0); > ^ > io-stats.c:2070:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c: In function ?ios_init_stats?: > io-stats.c:4006:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (stats->data_read, 0); > ^ > io-stats.c:4006:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4007:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (stats->data_written, 0); > ^ > io-stats.c:4007:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4010:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (stats->block_count_write[i], 0); > ^ > io-stats.c:4010:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4011:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (stats->block_count_read[i], 0); > ^ > io-stats.c:4011:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4015:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (stats->fop_hits[i], 0); > ^ > io-stats.c:4015:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4018:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (stats->upcall_hits[i], 0); > ^ > io-stats.c:4018:17: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > CCLD io-stats.la > > CC barrier.lo > barrier.c: In function ?notify?: > barrier.c:499:33: warning: switch condition has boolean value > [-Wswitch-bool] > switch (past) { > ^ > barrier.c: In function ?reconfigure?: > barrier.c:565:25: warning: switch condition has boolean value > [-Wswitch-bool] > switch (past) { > ^ > CCLD barrier.la > > > CC client-handshake.lo > client-handshake.c: In function ?clnt_fd_lk_local_create?: > client-handshake.c:150:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (local->ref, 1); > ^ > client-handshake.c:150:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > > > CC changelog-rpc.lo > changelog-rpc.c: In function ?changelog_rpc_clnt_init?: > changelog-rpc.c:217:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > GF_ATOMIC_INIT (crpc->ref, 1); > ^ > changelog-rpc.c:217:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > CC changelog-barrier.lo > CC changelog-rpc-common.lo > > > > CC mount.lo > ../../../../contrib/fuse-lib/mount.c: In function ?gf_fuse_unmount_daemon?: > ../../../../contrib/fuse-lib/mount.c:106:17: warning: ignoring return > value of ?chdir?, declared with attribute warn_unused_result > [-Wunused-result] > (void)chdir("/"); > ^ > ../../../../contrib/fuse-lib/mount.c:110:17: warning: ignoring return > value of ?read?, declared with attribute warn_unused_result > [-Wunused-result] > read (ump[0], &c, 1); > ^ > ../../../../contrib/fuse-lib/mount.c: In function ?gf_fuse_mount?: > ../../../../contrib/fuse-lib/mount.c:507:25: warning: ignoring return > value of ?write?, declared with attribute warn_unused_result > [-Wunused-result] > (void)write (status_fd, &ret, sizeof (ret)); > ^ > CC mount-common.lo > ../../../../contrib/fuse-lib/mount-common.c: In function > ?mtab_needs_update?: > ../../../../contrib/fuse-lib/mount-common.c:59:25: warning: ignoring > return value of ?setreuid?, declared with attribute warn_unused_result > [-Wunused-result] > setreuid (0, -1); > ^ > ../../../../contrib/fuse-lib/mount-common.c:64:25: warning: ignoring > return value of ?setreuid?, declared with attribute warn_unused_result > [-Wunused-result] > setreuid (ruid, -1); > ^ > CCLD fuse.la > > > > glusterd-pmap.c: In function ?pmap_registry_bind?: > glusterd-pmap.c:287:17: warning: ignoring return value of ?asprintf?, > declared with attribute warn_unused_result [-Wunused-result] > asprintf (&pmap->ports[p].brickname, "%s %s", tmp, > brickname); > ^ > glusterd-pmap.c: In function ?pmap_registry_extend?: > glusterd-pmap.c:346:17: warning: ignoring return value of ?asprintf?, > declared with attribute warn_unused_result [-Wunused-result] > asprintf (&new_bn, "%s %s", old_bn, brickname); > > > > CC mount-common.o > ../../contrib/fuse-lib/mount-common.c: In function ?mtab_needs_update?: > ../../contrib/fuse-lib/mount-common.c:59:25: warning: ignoring return > value of ?setreuid?, declared with attribute warn_unused_result > [-Wunused-result] > setreuid (0, -1); > ^ > ../../contrib/fuse-lib/mount-common.c:64:25: warning: ignoring return > value of ?setreuid?, declared with attribute warn_unused_result > [-Wunused-result] > setreuid (ruid, -1); > ^ > CCLD fusermount-glusterfs > > > Amudhan > > On Thu, Dec 27, 2018 at 4:38 PM Ravishankar N > wrote: > >> >> >> On 12/27/2018 04:26 PM, Amudhan P wrote: >> >> Hi, >> >> I am trying to compile & install Glusterfs-4.1.6 using tar file and I am >> getting this error message when running `make`. >> ``` >> CC afr-self-heal-name.lo >> CC afr.lo >> In file included from afr.c:18:0: >> afr-common.c: In function ?afr_lookup_entry_heal?: >> afr-common.c:2892:29: error: implicit declaration of function >> ?uuid_is_null? [-Werror=implicit-function-declaration] >> if (uuid_is_null (gfid)) { >> ^ >> cc1: some warnings being treated as errors >> Makefile:585: recipe for target 'afr.lo' failed >> make[5]: *** [afr.lo] Error 1 >> Makefile:467: recipe for target 'all-recursive' failed >> make[4]: *** [all-recursive] Error 1 >> Makefile:467: recipe for target 'all-recursive' failed >> make[3]: *** [all-recursive] Error 1 >> Makefile:473: recipe for target 'all-recursive' failed >> make[2]: *** [all-recursive] Error 1 >> Makefile:606: recipe for target 'all-recursive' failed >> make[1]: *** [all-recursive] Error 1 >> Makefile:497: recipe for target 'all' failed >> make: *** [all] Error 2 >> ``` >> OS : Ubuntu 16.04 >> file used : glusterfs-4.1.6.tar.gz >> >> How to fix this issue? >> >> Try this fix: https://review.gluster.org/#/c/glusterfs/+/21571/ >> -Ravi >> >> >> regards >> Amudhan >> >> >> _______________________________________________ >> Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ravishankar at redhat.com Thu Jan 3 08:38:43 2019 From: ravishankar at redhat.com (Ravishankar N) Date: Thu, 3 Jan 2019 14:08:43 +0530 Subject: [Gluster-users] Error in Installing Glusterfs-4.1.6 from tar In-Reply-To: References: <0ee0afec-38a7-5261-badf-0e15803e5e6d@redhat.com> Message-ID: <41a541aa-f552-7f18-57ea-e096865ad5d5@redhat.com> ?I don't get these warnings when compiling 4.1.6 on fedora 28 with gcc (GCC) 8.1.1.? Perhaps it is a gcc issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80593. On 01/03/2019 01:20 PM, Amudhan P wrote: > Can I skip this warning message in tail mail and continue with the > installation? > > On Thu, Dec 27, 2018 at 5:11 PM Amudhan P > wrote: > > Thanks, Ravishankar it worked. > also, I am getting the following warning message when running > `make` is it safe to skip? > > dht-layout.c: In function ?dht_layout_new?: > dht-layout.c:51:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (layout->ref, 1); > ? ? ? ? ?^ > dht-layout.c:51:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? CC? ? ? ?dht-helper.lo > > > ? CC? ? ? ?ec.lo > ec.c: In function ?ec_statistics_init?: > ec.c:637:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ?GF_ATOMIC_INIT(ec->stats.stripe_cache.hits, 0); > ? ? ? ? ?^ > ec.c:637:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:638:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ?GF_ATOMIC_INIT(ec->stats.stripe_cache.misses, 0); > ? ? ? ? ?^ > ec.c:638:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:639:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ?GF_ATOMIC_INIT(ec->stats.stripe_cache.updates, 0); > ? ? ? ? ?^ > ec.c:639:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:640:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ?GF_ATOMIC_INIT(ec->stats.stripe_cache.invals, 0); > ? ? ? ? ?^ > ec.c:640:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:641:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ?GF_ATOMIC_INIT(ec->stats.stripe_cache.evicts, 0); > ? ? ? ? ?^ > ec.c:641:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:642:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ?GF_ATOMIC_INIT(ec->stats.stripe_cache.allocs, 0); > ? ? ? ? ?^ > ec.c:642:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ec.c:643:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ?GF_ATOMIC_INIT(ec->stats.stripe_cache.errors, 0); > ? ? ? ? ?^ > ec.c:643:9: warning: dereferencing type-punned pointer will break > strict-aliasing rules [-Wstrict-aliasing] > ? CC? ? ? ?ec-data.lo > > > ? CCLD posix.la > .libs/posix-inode-fd-ops.o: In function `posix_do_chmod': > /home/qubevaultadmin/gluster-tar/glusterfs-4.1.6/xlators/storage/posix/src/posix-inode-fd-ops.c:203: > warning: lchmod is not implemented and will always fail > make[5]: Nothing to be done for 'all-am'. > > > ?CC? ? ? ?client-handshake.lo > client-handshake.c: In function ?clnt_fd_lk_local_create?: > client-handshake.c:150:9: warning: dereferencing type-punned > pointer will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (local->ref, 1); > ? ? ? ? ?^ > client-handshake.c:150:9: warning: dereferencing type-punned > pointer will break strict-aliasing rules [-Wstrict-aliasing] > ? CC? ? ? ?client-callback.lo > > ? CC? ? ? ?readdir-ahead.lo > readdir-ahead.c: In function ?init?: > readdir-ahead.c:637:9: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (priv->rda_cache_size, 0); > ? ? ? ? ?^ > readdir-ahead.c:637:9: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? CCLD readdir-ahead.la > > Making all in src > ? CC? ? ? ?md-cache.lo > md-cache.c: In function ?mdc_init?: > md-cache.c:3431:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.stat_hit, 0); > ? ? ? ? ?^ > md-cache.c:3431:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3432:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.stat_miss, 0); > ? ? ? ? ?^ > md-cache.c:3432:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3433:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.xattr_hit, 0); > ? ? ? ? ?^ > md-cache.c:3433:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3434:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.xattr_miss, 0); > ? ? ? ? ?^ > md-cache.c:3434:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3435:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.negative_lookup, 0); > ? ? ? ? ?^ > md-cache.c:3435:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3436:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.nameless_lookup, 0); > ? ? ? ? ?^ > md-cache.c:3436:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3437:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.stat_invals, 0); > ? ? ? ? ?^ > md-cache.c:3437:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3438:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.xattr_invals, 0); > ? ? ? ? ?^ > md-cache.c:3438:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > md-cache.c:3439:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (conf->mdc_counter.need_lookup, 0); > ? ? ? ? ?^ > md-cache.c:3439:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? CCLD md-cache.la > > > dht-layout.c: In function ?dht_layout_new?: > dht-layout.c:51:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (layout->ref, 1); > > ? CC? ? ? ?io-stats.lo > io-stats.c: In function ?ios_init_iosstat?: > io-stats.c:1973:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ? ? ? ? ?GF_ATOMIC_INIT (iosstat->counters[i], 0); > ? ? ? ? ? ? ? ? ?^ > io-stats.c:1973:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c: In function ?io_stats_open_cbk?: > io-stats.c:2066:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (iosfd->data_read, 0); > ? ? ? ? ?^ > io-stats.c:2066:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:2067:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (iosfd->data_written, 0); > ? ? ? ? ?^ > io-stats.c:2067:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:2069:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ? ? ? ? ?GF_ATOMIC_INIT (iosfd->block_count_write[i], 0); > ? ? ? ? ? ? ? ? ?^ > io-stats.c:2069:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:2070:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ? ? ? ? ?GF_ATOMIC_INIT (iosfd->block_count_read[i], 0); > ? ? ? ? ? ? ? ? ?^ > io-stats.c:2070:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c: In function ?ios_init_stats?: > io-stats.c:4006:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (stats->data_read, 0); > ? ? ? ? ?^ > io-stats.c:4006:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4007:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (stats->data_written, 0); > ? ? ? ? ?^ > io-stats.c:4007:9: warning: dereferencing type-punned pointer will > break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4010:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ? ? ? ? ?GF_ATOMIC_INIT (stats->block_count_write[i], 0); > ? ? ? ? ? ? ? ? ?^ > io-stats.c:4010:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4011:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ? ? ? ? ?GF_ATOMIC_INIT (stats->block_count_read[i], 0); > ? ? ? ? ? ? ? ? ?^ > io-stats.c:4011:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4015:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ? ? ? ? ?GF_ATOMIC_INIT (stats->fop_hits[i], 0); > ? ? ? ? ? ? ? ? ?^ > io-stats.c:4015:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > io-stats.c:4018:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ? ? ? ? ?GF_ATOMIC_INIT (stats->upcall_hits[i], 0); > ? ? ? ? ? ? ? ? ?^ > io-stats.c:4018:17: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? CCLD io-stats.la > > ? CC? ? ? ?barrier.lo > barrier.c: In function ?notify?: > barrier.c:499:33: warning: switch condition has boolean value > [-Wswitch-bool] > ? ? ? ? ? ? ? ? ? ? ? ? ?switch (past) { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?^ > barrier.c: In function ?reconfigure?: > barrier.c:565:25: warning: switch condition has boolean value > [-Wswitch-bool] > ? ? ? ? ? ? ? ? ?switch (past) { > ? ? ? ? ? ? ? ? ? ? ? ? ?^ > ? CCLD barrier.la > > > ? CC? ? ? ?client-handshake.lo > client-handshake.c: In function ?clnt_fd_lk_local_create?: > client-handshake.c:150:9: warning: dereferencing type-punned > pointer will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (local->ref, 1); > ? ? ? ? ?^ > client-handshake.c:150:9: warning: dereferencing type-punned > pointer will break strict-aliasing rules [-Wstrict-aliasing] > > > ? CC? ? ? ?changelog-rpc.lo > changelog-rpc.c: In function ?changelog_rpc_clnt_init?: > changelog-rpc.c:217:9: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? ? ? ? ?GF_ATOMIC_INIT (crpc->ref, 1); > ? ? ? ? ?^ > changelog-rpc.c:217:9: warning: dereferencing type-punned pointer > will break strict-aliasing rules [-Wstrict-aliasing] > ? CC? ? ? ?changelog-barrier.lo > ? CC? ? ? ?changelog-rpc-common.lo > > > > ? CC? ? ? ?mount.lo > ../../../../contrib/fuse-lib/mount.c: In function > ?gf_fuse_unmount_daemon?: > ../../../../contrib/fuse-lib/mount.c:106:17: warning: ignoring > return value of ?chdir?, declared with attribute > warn_unused_result [-Wunused-result] > ? ? ? ? ? ? ? ? ?(void)chdir("/"); > ? ? ? ? ? ? ? ? ?^ > ../../../../contrib/fuse-lib/mount.c:110:17: warning: ignoring > return value of ?read?, declared with attribute warn_unused_result > [-Wunused-result] > ? ? ? ? ? ? ? ? ?read (ump[0], &c, 1); > ? ? ? ? ? ? ? ? ?^ > ../../../../contrib/fuse-lib/mount.c: In function ?gf_fuse_mount?: > ../../../../contrib/fuse-lib/mount.c:507:25: warning: ignoring > return value of ?write?, declared with attribute > warn_unused_result [-Wunused-result] > ? ? ? ? ? ? ? ? ? ? ? ? ?(void)write (status_fd, &ret, sizeof (ret)); > ? ? ? ? ? ? ? ? ? ? ? ? ?^ > ? CC? ? ? ?mount-common.lo > ../../../../contrib/fuse-lib/mount-common.c: In function > ?mtab_needs_update?: > ../../../../contrib/fuse-lib/mount-common.c:59:25: warning: > ignoring return value of ?setreuid?, declared with attribute > warn_unused_result [-Wunused-result] > ? ? ? ? ? ? ? ? ? ? ? ? ?setreuid (0, -1); > ? ? ? ? ? ? ? ? ? ? ? ? ?^ > ../../../../contrib/fuse-lib/mount-common.c:64:25: warning: > ignoring return value of ?setreuid?, declared with attribute > warn_unused_result [-Wunused-result] > ? ? ? ? ? ? ? ? ? ? ? ? ?setreuid (ruid, -1); > ? ? ? ? ? ? ? ? ? ? ? ? ?^ > ? CCLD fuse.la > > > > glusterd-pmap.c: In function ?pmap_registry_bind?: > glusterd-pmap.c:287:17: warning: ignoring return value of > ?asprintf?, declared with attribute warn_unused_result > [-Wunused-result] > ? ? ? ? ? ? ? ? ?asprintf (&pmap->ports[p].brickname, "%s %s", > tmp, brickname); > ? ? ? ? ? ? ? ? ?^ > glusterd-pmap.c: In function ?pmap_registry_extend?: > glusterd-pmap.c:346:17: warning: ignoring return value of > ?asprintf?, declared with attribute warn_unused_result > [-Wunused-result] > ? ? ? ? ? ? ? ? ?asprintf (&new_bn, "%s %s", old_bn, brickname); > > > > ? CC? ? ? ?mount-common.o > ../../contrib/fuse-lib/mount-common.c: In function > ?mtab_needs_update?: > ../../contrib/fuse-lib/mount-common.c:59:25: warning: ignoring > return value of ?setreuid?, declared with attribute > warn_unused_result [-Wunused-result] > ? ? ? ? ? ? ? ? ? ? ? ? ?setreuid (0, -1); > ? ? ? ? ? ? ? ? ? ? ? ? ?^ > ../../contrib/fuse-lib/mount-common.c:64:25: warning: ignoring > return value of ?setreuid?, declared with attribute > warn_unused_result [-Wunused-result] > ? ? ? ? ? ? ? ? ? ? ? ? ?setreuid (ruid, -1); > ? ? ? ? ? ? ? ? ? ? ? ? ?^ > ? CCLD? ? ?fusermount-glusterfs > > > Amudhan > > On Thu, Dec 27, 2018 at 4:38 PM Ravishankar N > > wrote: > > > > On 12/27/2018 04:26 PM, Amudhan P wrote: >> Hi, >> >> I am trying to compile?& install Glusterfs-4.1.6 using tar >> file and I am getting this error message when running `make`. >> ``` >> CC? ? ? ?afr-self-heal-name.lo >> CC? ? ? ?afr.lo >> In file included from afr.c:18:0: >> afr-common.c: In function ?afr_lookup_entry_heal?: >> afr-common.c:2892:29: error: implicit declaration of function >> ?uuid_is_null? [-Werror=implicit-function-declaration] >> ? ? ? ? ? ? ? ? ? ? ? ? ?if (uuid_is_null (gfid)) { >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?^ >> cc1: some warnings being treated as errors >> Makefile:585: recipe for target 'afr.lo' failed >> make[5]: *** [afr.lo] Error 1 >> Makefile:467: recipe for target 'all-recursive' failed >> make[4]: *** [all-recursive] Error 1 >> Makefile:467: recipe for target 'all-recursive' failed >> make[3]: *** [all-recursive] Error 1 >> Makefile:473: recipe for target 'all-recursive' failed >> make[2]: *** [all-recursive] Error 1 >> Makefile:606: recipe for target 'all-recursive' failed >> make[1]: *** [all-recursive] Error 1 >> Makefile:497: recipe for target 'all' failed >> make: *** [all] Error 2 >> ``` >> OS :? Ubuntu 16.04 >> file used :??glusterfs-4.1.6.tar.gz >> >> How to fix this issue? > Try this fix: https://review.gluster.org/#/c/glusterfs/+/21571/ > -Ravi >> >> regards >> Amudhan >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Thu Jan 3 10:55:58 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Thu, 3 Jan 2019 16:25:58 +0530 Subject: [Gluster-users] Glusterfs 4.1.6 Message-ID: Hi, I am working on Glusterfs 4.1.6 on a test machine. I am trying to replace a faulty disk and below are the steps I did but wasn't successful with that. 3 Nodes, 2 disks per node, Disperse Volume 4+2 :- Step 1 :- kill pid of the faulty brick in node Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted Step 4 :- run command "gluster v start volname force" Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' expected behavior was a new brick process & heal should have started. following above said steps 3.10.1 works perfectly, starting a new brick process and heal begins. But the same step not working in 4.1.6, Did I miss any steps? what should be done? Amudhan -------------- next part -------------- An HTML attachment was scrubbed... URL: From aspandey at redhat.com Thu Jan 3 11:38:31 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Thu, 3 Jan 2019 06:38:31 -0500 (EST) Subject: [Gluster-users] Glusterfs 4.1.6 In-Reply-To: References: Message-ID: <365618439.54755929.1546515511041.JavaMail.zimbra@redhat.com> Hi, Some of the the steps provided by you are not correct. You should have used reset-brick command which was introduced for the same task you wanted to do. https://docs.gluster.org/en/v3/release-notes/3.9.0/ Although your thinking was correct but replacing a faulty disk requires some of the additional task which this command will do automatically. Step 1 :- kill pid of the faulty brick in node >>>>>> This should be done using "reset-brick start" command. follow the steps provided in link. Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted Step 4 :- run command "gluster v start volname force" >>>>>>>>>>>> This should be done using "reset-brick commit force" command. This will trigger the heal. Follow the link. Step 5 :- running volume status, shows "N/A" under ' pid ' & 'TCP port' --- Ashish ----- Original Message ----- From: "Amudhan P" To: "Gluster Users" Sent: Thursday, January 3, 2019 4:25:58 PM Subject: [Gluster-users] Glusterfs 4.1.6 Hi, I am working on Glusterfs 4.1.6 on a test machine. I am trying to replace a faulty disk and below are the steps I did but wasn't successful with that. 3 Nodes, 2 disks per node, Disperse Volume 4+2 :- Step 1 :- kill pid of the faulty brick in node Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted Step 4 :- run command "gluster v start volname force" Step 5 :- running volume status, shows "N/A" under ' pid ' & 'TCP port' expected behavior was a new brick process & heal should have started. following above said steps 3.10.1 works perfectly, starting a new brick process and heal begins. But the same step not working in 4.1.6, Did I miss any steps? what should be done? Amudhan _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Thu Jan 3 14:21:37 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Thu, 3 Jan 2019 19:51:37 +0530 Subject: [Gluster-users] Glusterfs 4.1.6 In-Reply-To: <365618439.54755929.1546515511041.JavaMail.zimbra@redhat.com> References: <365618439.54755929.1546515511041.JavaMail.zimbra@redhat.com> Message-ID: Thank you, it works as expected. On Thu, Jan 3, 2019 at 5:08 PM Ashish Pandey wrote: > Hi, > > Some of the the steps provided by you are not correct. > You should have used reset-brick command which was introduced for the same > task you wanted to do. > > > https://docs.gluster.org/en/v3/release-notes/3.9.0/ > > Although your thinking was correct but replacing a faulty disk requires > some of the additional task which this command > will do automatically. > > Step 1 :- kill pid of the faulty brick in node >>>>>> This should be done > using "reset-brick start" command. follow the steps provided in link. > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > Step 3 :- replace disk and mount new disk in same mount point where the > old disk was mounted > Step 4 :- run command "gluster v start volname force" >>>>>>>>>>>> This > should be done using "reset-brick commit force" command. This will trigger > the heal. Follow the link. > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > --- > Ashish > > ------------------------------ > *From: *"Amudhan P" > *To: *"Gluster Users" > *Sent: *Thursday, January 3, 2019 4:25:58 PM > *Subject: *[Gluster-users] Glusterfs 4.1.6 > > Hi, > > I am working on Glusterfs 4.1.6 on a test machine. I am trying to replace > a faulty disk and below are the steps I did but wasn't successful with that. > > 3 Nodes, 2 disks per node, Disperse Volume 4+2 :- > Step 1 :- kill pid of the faulty brick in node > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > Step 3 :- replace disk and mount new disk in same mount point where the > old disk was mounted > Step 4 :- run command "gluster v start volname force" > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > expected behavior was a new brick process & heal should have started. > > following above said steps 3.10.1 works perfectly, starting a new brick > process and heal begins. > But the same step not working in 4.1.6, Did I miss any steps? what should > be done? > > Amudhan > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Fri Jan 4 06:50:36 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Fri, 4 Jan 2019 12:20:36 +0530 Subject: [Gluster-users] [Stale file handle] in shard volume In-Reply-To: References: Message-ID: Adding Krutika. On Wed, 2 Jan 2019 at 20:56, Olaf Buitelaar wrote: > Hi Nithya, > > Thank you for your reply. > > the VM's using the gluster volumes keeps on getting paused/stopped on > errors like these; > [2019-01-02 02:33:44.469132] E [MSGID: 133010] > [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on > shard 101487 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c > [Stale file handle] > [2019-01-02 02:33:44.563288] E [MSGID: 133010] > [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on > shard 101488 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c > [Stale file handle] > > Krutika, Can you take a look at this? > > What i'm trying to find out, if i can purge all gluster volumes from all > possible stale file handles (and hopefully find a method to prevent this in > the future), so the VM's can start running stable again. > For this i need to know when the "shard_common_lookup_shards_cbk" function > considers a file as stale. > The statement; "Stale file handle errors show up when a file with a > specified gfid is not found." doesn't seem to cover it all, as i've shown > in earlier mails the shard file and glusterfs/xx/xx/uuid file do both > exist, and have the same inode. > If the criteria i'm using aren't correct, could you please tell me which > criteria i should use to determine if a file is stale or not? > these criteria are just based observations i made, moving the stale files > manually. After removing them i was able to start the VM again..until some > time later it hangs on another stale shard file unfortunate. > > Thanks Olaf > > Op wo 2 jan. 2019 om 14:20 schreef Nithya Balachandran < > nbalacha at redhat.com>: > >> >> >> On Mon, 31 Dec 2018 at 01:27, Olaf Buitelaar >> wrote: >> >>> Dear All, >>> >>> till now a selected group of VM's still seem to produce new stale file's >>> and getting paused due to this. >>> I've not updated gluster recently, however i did change the op version >>> from 31200 to 31202 about a week before this issue arose. >>> Looking at the .shard directory, i've 100.000+ files sharing the same >>> characteristics as a stale file. which are found till now, >>> they all have the sticky bit set, e.g. file permissions; ---------T. are >>> 0kb in size, and have the trusted.glusterfs.dht.linkto attribute. >>> >> >> These are internal files used by gluster and do not necessarily mean they >> are stale. They "point" to data files which may be on different bricks >> (same name, gfid etc but no linkto xattr and no ----T permissions). >> >> >>> These files range from long a go (beginning of the year) till now. Which >>> makes me suspect this was laying dormant for some time now..and somehow >>> recently surfaced. >>> Checking other sub-volumes they contain also 0kb files in the .shard >>> directory, but don't have the sticky bit and the linkto attribute. >>> >>> Does anybody else experience this issue? Could this be a bug or an >>> environmental issue? >>> >> These are most likely valid files- please do not delete them without >> double-checking. >> >> Stale file handle errors show up when a file with a specified gfid is not >> found. You will need to debug the files for which you see this error by >> checking the bricks to see if they actually exist. >> >>> >>> Also i wonder if there is any tool or gluster command to clean all stale >>> file handles? >>> Otherwise i'm planning to make a simple bash script, which iterates over >>> the .shard dir, checks each file for the above mentioned criteria, and >>> (re)moves the file and the corresponding .glusterfs file. >>> If there are other criteria needed to identify a stale file handle, i >>> would like to hear that. >>> If this is a viable and safe operation to do of course. >>> >>> Thanks Olaf >>> >>> >>> >>> Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar < >>> olaf.buitelaar at gmail.com>: >>> >>>> Dear All, >>>> >>>> I figured it out, it appeared to be the exact same issue as described >>>> here; >>>> https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html >>>> Another subvolume also had the shard file, only were all 0 bytes and >>>> had the dht.linkto >>>> >>>> for reference; >>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>> >>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>>> >>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>> >>>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>>> >>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>> # file: .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>> >>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>>> >>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>> >>>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>>> >>>> [root at lease-04 ovirt-backbone-2]# stat >>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>> File: ?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d? >>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>> empty file >>>> Device: fd01h/64769d Inode: 1918631406 Links: 2 >>>> Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ root) >>>> Context: system_u:object_r:etc_runtime_t:s0 >>>> Access: 2018-12-17 21:43:36.405735296 +0000 >>>> Modify: 2018-12-17 21:43:36.405735296 +0000 >>>> Change: 2018-12-17 21:43:36.405735296 +0000 >>>> Birth: - >>>> >>>> removing the shard file and glusterfs file from each node resolved the >>>> issue. >>>> >>>> I also found this thread; >>>> https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html >>>> Maybe he suffers from the same issue. >>>> >>>> Best Olaf >>>> >>>> >>>> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar < >>>> olaf.buitelaar at gmail.com>: >>>> >>>>> Dear All, >>>>> >>>>> It appears i've a stale file in one of the volumes, on 2 files. These >>>>> files are qemu images (1 raw and 1 qcow2). >>>>> I'll just focus on 1 file since the situation on the other seems the >>>>> same. >>>>> >>>>> The VM get's paused more or less directly after being booted with >>>>> error; >>>>> [2018-12-18 14:05:05.275713] E [MSGID: 133010] >>>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-backbone-2-shard: >>>>> Lookup on shard 51500 failed. Base file gfid = >>>>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file handle] >>>>> >>>>> investigating the shard; >>>>> >>>>> #on the arbiter node: >>>>> >>>>> [root at lease-05 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>> getfattr: Removing leading '/' from absolute path names >>>>> # file: >>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>> >>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> >>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>> >>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>> >>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> >>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>> >>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>> >>>>> [root at lease-05 ovirt-backbone-2]# stat >>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>>> empty file >>>>> Device: fd01h/64769d Inode: 537277306 Links: 2 >>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>> root) >>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>> Access: 2018-12-17 21:43:36.361984810 +0000 >>>>> Modify: 2018-12-17 21:43:36.361984810 +0000 >>>>> Change: 2018-12-18 20:55:29.908647417 +0000 >>>>> Birth: - >>>>> >>>>> [root at lease-05 ovirt-backbone-2]# find . -inum 537277306 >>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> >>>>> #on the data nodes: >>>>> >>>>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string >>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>> getfattr: Removing leading '/' from absolute path names >>>>> # file: >>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>> >>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> >>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>> >>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>> >>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> >>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>> >>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>> >>>>> [root at lease-08 ovirt-backbone-2]# stat >>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular >>>>> file >>>>> Device: fd03h/64771d Inode: 12893624759 Links: 3 >>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>> root) >>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>> Access: 2018-12-18 18:52:38.070776585 +0000 >>>>> Modify: 2018-12-17 21:43:36.388054443 +0000 >>>>> Change: 2018-12-18 21:01:47.810506528 +0000 >>>>> Birth: - >>>>> >>>>> [root at lease-08 ovirt-backbone-2]# find . -inum 12893624759 >>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> >>>>> ======================== >>>>> >>>>> [root at lease-11 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>> getfattr: Removing leading '/' from absolute path names >>>>> # file: >>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>> >>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> >>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>> >>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>> >>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> >>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>> >>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>> >>>>> [root at lease-11 ovirt-backbone-2]# stat >>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular >>>>> file >>>>> Device: fd03h/64771d Inode: 12956094809 Links: 3 >>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>> root) >>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>> Access: 2018-12-18 20:11:53.595208449 +0000 >>>>> Modify: 2018-12-17 21:43:36.391580259 +0000 >>>>> Change: 2018-12-18 19:19:25.888055392 +0000 >>>>> Birth: - >>>>> >>>>> [root at lease-11 ovirt-backbone-2]# find . -inum 12956094809 >>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> >>>>> ================ >>>>> >>>>> I don't really see any inconsistencies, except the dates on the stat. >>>>> However this is only after i tried moving the file out of the volumes to >>>>> force a heal, which does happen on the data nodes, but not on the arbiter >>>>> node. Before that they were also the same. >>>>> I've also compared the file >>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on the 2 nodes and they >>>>> are exactly the same. >>>>> >>>>> Things i've further tried; >>>>> - gluster v heal ovirt-backbone-2 full => gluster v heal >>>>> ovirt-backbone-2 info reports 0 entries on all nodes >>>>> >>>>> - stop each glusterd and glusterfsd, pause around 40sec and start them >>>>> again on each node, 1 at a time, waiting for the heal to recover before >>>>> moving to the next node >>>>> >>>>> - force a heal by stopping glusterd on a node and perform these steps; >>>>> mkdir /mnt/ovirt-backbone-2/trigger >>>>> rmdir /mnt/ovirt-backbone-2/trigger >>>>> setfattr -n trusted.non-existent-key -v abc /mnt/ovirt-backbone-2/ >>>>> setfattr -x trusted.non-existent-key /mnt/ovirt-backbone-2/ >>>>> start glusterd >>>>> >>>>> - gluster volume rebalance ovirt-backbone-2 start => success >>>>> >>>>> Whats further interesting is that according the mount log, the volume >>>>> is in split-brain; >>>>> [2018-12-18 10:06:04.606870] E [MSGID: 108008] >>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>> error] >>>>> [2018-12-18 10:06:04.606908] E [MSGID: 133014] >>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>> [2018-12-18 10:06:04.606927] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>> 0-glusterfs-fuse: 428090: FSTAT() >>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>> [2018-12-18 10:06:05.107729] E [MSGID: 108008] >>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>> error] >>>>> [2018-12-18 10:06:05.107770] E [MSGID: 133014] >>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>> [2018-12-18 10:06:05.107791] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>> 0-glusterfs-fuse: 428091: FSTAT() >>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>> [2018-12-18 10:06:05.537244] I [MSGID: 108006] >>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>> subvolumes up >>>>> [2018-12-18 10:06:05.538523] E [MSGID: 108008] >>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid >>>>> 00000000-0000-0000-0000-000000000001: split-brain observed. [Input/output >>>>> error] >>>>> [2018-12-18 10:06:05.538685] I [MSGID: 108006] >>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>> subvolumes up >>>>> [2018-12-18 10:06:05.538794] I [MSGID: 108006] >>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>> subvolumes up >>>>> [2018-12-18 10:06:05.539342] I [MSGID: 109063] >>>>> [dht-layout.c:716:dht_layout_normalize] 0-ovirt-backbone-2-dht: Found >>>>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732 (gfid = >>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2 overlaps=0 >>>>> [2018-12-18 10:06:05.539372] W [MSGID: 109005] >>>>> [dht-selfheal.c:2158:dht_selfheal_directory] 0-ovirt-backbone-2-dht: >>>>> Directory selfheal failed: 2 subvolumes down.Not fixing. path = >>>>> /b1c2c949-aef4-4aec-999b-b179efeef732, gfid = >>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8 >>>>> [2018-12-18 10:06:05.539694] I [MSGID: 108006] >>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>> subvolumes up >>>>> [2018-12-18 10:06:05.540652] I [MSGID: 108006] >>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>> subvolumes up >>>>> [2018-12-18 10:06:05.608612] E [MSGID: 108008] >>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>> error] >>>>> [2018-12-18 10:06:05.608657] E [MSGID: 133014] >>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>> [2018-12-18 10:06:05.608672] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>> 0-glusterfs-fuse: 428096: FSTAT() >>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>> [2018-12-18 10:06:06.109339] E [MSGID: 108008] >>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>> error] >>>>> [2018-12-18 10:06:06.109378] E [MSGID: 133014] >>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>> [2018-12-18 10:06:06.109399] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>> 0-glusterfs-fuse: 428097: FSTAT() >>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>> >>>>> #note i'm able to see ; >>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>>> [root at lease-11 ovirt-backbone-2]# stat >>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>>> File: >>>>> ?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids? >>>>> Size: 1048576 Blocks: 2048 IO Block: 131072 regular >>>>> file >>>>> Device: 41h/65d Inode: 10492258721813610344 Links: 1 >>>>> Access: (0660/-rw-rw----) Uid: ( 36/ vdsm) Gid: ( 36/ >>>>> kvm) >>>>> Context: system_u:object_r:fusefs_t:s0 >>>>> Access: 2018-12-19 20:07:39.917573869 +0000 >>>>> Modify: 2018-12-19 20:07:39.928573917 +0000 >>>>> Change: 2018-12-19 20:07:39.929573921 +0000 >>>>> Birth: - >>>>> >>>>> however checking: gluster v heal ovirt-backbone-2 info split-brain >>>>> reports no entries. >>>>> >>>>> I've also tried mounting the qemu image, and this works fine, i'm able >>>>> to see all contents; >>>>> losetup /dev/loop0 >>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>> kpartx -a /dev/loop0 >>>>> vgscan >>>>> vgchange -ay slave-data >>>>> mkdir /mnt/slv01 >>>>> mount /dev/mapper/slave--data-lvol0 /mnt/slv01/ >>>>> >>>>> Possible causes for this issue; >>>>> 1. the machine "lease-11" suffered from a faulty RAM module (ECC), >>>>> which halted the machine and causes an invalid state. (this machine also >>>>> hosts other volumes, with similar configurations, which report no issue) >>>>> 2. after the RAM module was replaced, the VM using the backing qemu >>>>> image, was restored from a backup (the backup was file based within the VM >>>>> on a different directory). This is because some files were corrupted. The >>>>> backup/recovery obviously causes extra IO, possible introducing race >>>>> conditions? The machine did run for about 12h without issues, and in total >>>>> for about 36h. >>>>> 3. since only the client (maybe only gfapi?) reports errors, something >>>>> is broken there? >>>>> >>>>> The volume info; >>>>> root at lease-06 ~# gluster v info ovirt-backbone-2 >>>>> >>>>> Volume Name: ovirt-backbone-2 >>>>> Type: Distributed-Replicate >>>>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28 >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 3 x (2 + 1) = 9 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: 10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>>> Brick2: 10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>>> Brick3: 10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>> Brick4: 10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>> Brick5: 10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>> Brick6: 10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>> Brick7: 10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>> Brick8: 10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>> Brick9: 10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>> Options Reconfigured: >>>>> nfs.disable: on >>>>> transport.address-family: inet >>>>> performance.quick-read: off >>>>> performance.read-ahead: off >>>>> performance.io-cache: off >>>>> performance.low-prio-threads: 32 >>>>> network.remote-dio: enable >>>>> cluster.eager-lock: enable >>>>> cluster.quorum-type: auto >>>>> cluster.server-quorum-type: server >>>>> cluster.data-self-heal-algorithm: full >>>>> cluster.locking-scheme: granular >>>>> cluster.shd-max-threads: 8 >>>>> cluster.shd-wait-qlength: 10000 >>>>> features.shard: on >>>>> user.cifs: off >>>>> storage.owner-uid: 36 >>>>> storage.owner-gid: 36 >>>>> features.shard-block-size: 64MB >>>>> performance.write-behind-window-size: 512MB >>>>> performance.cache-size: 384MB >>>>> cluster.brick-multiplex: on >>>>> >>>>> The volume status; >>>>> root at lease-06 ~# gluster v status ovirt-backbone-2 >>>>> Status of volume: ovirt-backbone-2 >>>>> Gluster process TCP Port RDMA Port >>>>> Online Pid >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi >>>>> rt-backbone-2 49152 0 >>>>> Y 7727 >>>>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi >>>>> rt-backbone-2 49152 0 >>>>> Y 12620 >>>>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi >>>>> rt-backbone-2 49152 0 >>>>> Y 8794 >>>>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov >>>>> irt-backbone-2 49161 0 >>>>> Y 22333 >>>>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o >>>>> virt-backbone-2 49152 0 >>>>> Y 15030 >>>>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi >>>>> rt-backbone-2 49166 0 >>>>> Y 24592 >>>>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov >>>>> irt-backbone-2 49153 0 >>>>> Y 20148 >>>>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o >>>>> virt-backbone-2 49154 0 >>>>> Y 15413 >>>>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi >>>>> rt-backbone-2 49152 0 >>>>> Y 43120 >>>>> Self-heal Daemon on localhost N/A N/A >>>>> Y 44587 >>>>> Self-heal Daemon on 10.201.0.2 N/A N/A >>>>> Y 8401 >>>>> Self-heal Daemon on 10.201.0.5 N/A N/A >>>>> Y 11038 >>>>> Self-heal Daemon on 10.201.0.8 N/A N/A >>>>> Y 9513 >>>>> Self-heal Daemon on 10.32.9.4 N/A N/A >>>>> Y 23736 >>>>> Self-heal Daemon on 10.32.9.20 N/A N/A >>>>> Y 2738 >>>>> Self-heal Daemon on 10.32.9.3 N/A N/A >>>>> Y 25598 >>>>> Self-heal Daemon on 10.32.9.5 N/A N/A >>>>> Y 511 >>>>> Self-heal Daemon on 10.32.9.9 N/A N/A >>>>> Y 23357 >>>>> Self-heal Daemon on 10.32.9.8 N/A N/A >>>>> Y 15225 >>>>> Self-heal Daemon on 10.32.9.7 N/A N/A >>>>> Y 25781 >>>>> Self-heal Daemon on 10.32.9.21 N/A N/A >>>>> Y 5034 >>>>> >>>>> Task Status of Volume ovirt-backbone-2 >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Task : Rebalance >>>>> ID : 6dfbac43-0125-4568-9ac3-a2c453faaa3d >>>>> Status : completed >>>>> >>>>> gluster version is @3.12.15 and cluster.op-version=31202 >>>>> >>>>> ======================== >>>>> >>>>> It would be nice to know if it's possible to mark the files as not >>>>> stale or if i should investigate other things? >>>>> Or should we consider this volume lost? >>>>> Also checking the code at; >>>>> https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c >>>>> it seems the functions shifted quite some (line 1724 vs. 2243), so maybe >>>>> it's fixed in a future version? >>>>> Any thoughts are welcome. >>>>> >>>>> Thanks Olaf >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From kashif.alig at gmail.com Fri Jan 4 10:17:44 2019 From: kashif.alig at gmail.com (mohammad kashif) Date: Fri, 4 Jan 2019 10:17:44 +0000 Subject: [Gluster-users] update to 4.1.6-1 and fix-layout failing Message-ID: Hi I have updated our distributed gluster storage from 3.12.9-1 to 4.1.6-1. The existing cluster had seven servers totalling in around 450 TB. OS is Centos7. The update went OK and I could access files. Then I added two more servers of 90TB each to cluster and started fix-layout gluster volume rebalance atlasglust fix-layout start Some directories were created at new servers and then stopped although rebalance status was showing that it is still running. I think it stopped creating new directories after this error E [MSGID: 106061] [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index The message "E [MSGID: 106061] [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index" repeated 7 times between [2019-01-03 13:16:31.146779] and [2019-01-03 13:16:31.158612] There are also many warning like this [2019-01-03 16:04:34.120777] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume atlasglust [2019-01-03 17:04:28.541805] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-management: error returned while attempting to connect to host:(null), port:0 I waited for around 12 hours and then stopped fix-layout and started again I can see the same error again [2019-01-04 09:59:20.825930] E [MSGID: 106061] [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index The message "E [MSGID: 106061] [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index" repeated 7 times between [2019-01-04 09:59:20.825930] and [2019-01-04 09:59:20.837068] Please suggest as it is our production service. At the moment, I have stopped clients from using file system. Would it be OK if I allow clients to access file system while fix-layout is still going. Thanks Kashif -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Fri Jan 4 10:41:50 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Fri, 4 Jan 2019 16:11:50 +0530 Subject: [Gluster-users] update to 4.1.6-1 and fix-layout failing In-Reply-To: References: Message-ID: On Fri, 4 Jan 2019 at 15:48, mohammad kashif wrote: > Hi > > I have updated our distributed gluster storage from 3.12.9-1 to 4.1.6-1. > The existing cluster had seven servers totalling in around 450 TB. OS is > Centos7. The update went OK and I could access files. > Then I added two more servers of 90TB each to cluster and started > fix-layout > > gluster volume rebalance atlasglust fix-layout start > > Some directories were created at new servers and then stopped although > rebalance status was showing that it is still running. I think it stopped > creating new directories after this error > > E [MSGID: 106061] > [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: > failed to get index The message "E [MSGID: 106061] > [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: > failed to get index" repeated 7 times between [2019-01-03 13:16:31.146779] > and [2019-01-03 13:16:31.158612] > > There are also many warning like this > [2019-01-03 16:04:34.120777] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume atlasglust [2019-01-03 > 17:04:28.541805] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-management: error > returned while attempting to connect to host:(null), port:0 > > These are the glusterd logs. Do you see any errors in the rebalance logs for this volume? > I waited for around 12 hours and then stopped fix-layout and started again > I can see the same error again > > [2019-01-04 09:59:20.825930] E [MSGID: 106061] > [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: > failed to get index The message "E [MSGID: 106061] > [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: > failed to get index" repeated 7 times between [2019-01-04 09:59:20.825930] > and [2019-01-04 09:59:20.837068] > > Please suggest as it is our production service. > > At the moment, I have stopped clients from using file system. Would it be > OK if I allow clients to access file system while fix-layout is still going. > > Thanks > > Kashif > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From kashif.alig at gmail.com Fri Jan 4 11:40:33 2019 From: kashif.alig at gmail.com (mohammad kashif) Date: Fri, 4 Jan 2019 11:40:33 +0000 Subject: [Gluster-users] update to 4.1.6-1 and fix-layout failing In-Reply-To: References: Message-ID: Hi Nithya rebalance logs has only these warnings 2019-01-04 09:59:20.826261] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-atlasglust-client-5: error returned while attempting to connect to host:(null), port:0 [2019-01-04 09:59:20.828113] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-atlasglust-client-6: error returned while attempting to connect to host:(null), port:0 [2019-01-04 09:59:20.832017] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-atlasglust-client-4: error returned while attempting to connect to host:(null), port:0 gluster volume rebalance atlasglust status Node status run time in h:m:s --------- ----------- ------------ localhost fix-layout in progress 1:0:59 pplxgluster02.physics.ox.ac.uk fix-layout in progress 1:0:59 pplxgluster03.physics.ox.ac.uk fix-layout in progress 1:0:59 pplxgluster04.physics.ox.ac.uk fix-layout in progress 1:0:59 pplxgluster05.physics.ox.ac.uk fix-layout in progress 1:0:59 pplxgluster06.physics.ox.ac.uk fix-layout in progress 1:0:59 pplxgluster07.physics.ox.ac.uk fix-layout in progress 1:0:59 pplxgluster08.physics.ox.ac.uk fix-layout in progress 1:0:59 pplxgluster09.physics.ox.ac.uk fix-layout in progress 1:0:59 But there is no new entry in logs for last one hour and I can't see any new directories being created. Thanks Kashif On Fri, Jan 4, 2019 at 10:42 AM Nithya Balachandran wrote: > > > On Fri, 4 Jan 2019 at 15:48, mohammad kashif > wrote: > >> Hi >> >> I have updated our distributed gluster storage from 3.12.9-1 to 4.1.6-1. >> The existing cluster had seven servers totalling in around 450 TB. OS is >> Centos7. The update went OK and I could access files. >> Then I added two more servers of 90TB each to cluster and started >> fix-layout >> >> gluster volume rebalance atlasglust fix-layout start >> >> Some directories were created at new servers and then stopped although >> rebalance status was showing that it is still running. I think it stopped >> creating new directories after this error >> >> E [MSGID: 106061] >> [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: >> failed to get index The message "E [MSGID: 106061] >> [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: >> failed to get index" repeated 7 times between [2019-01-03 13:16:31.146779] >> and [2019-01-03 13:16:31.158612] >> >> > There are also many warning like this >> [2019-01-03 16:04:34.120777] I [MSGID: 106499] >> [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: >> Received status volume req for volume atlasglust [2019-01-03 >> 17:04:28.541805] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-management: error >> returned while attempting to connect to host:(null), port:0 >> >> These are the glusterd logs. Do you see any errors in the rebalance logs > for this volume? > > >> I waited for around 12 hours and then stopped fix-layout and started again >> I can see the same error again >> >> [2019-01-04 09:59:20.825930] E [MSGID: 106061] >> [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: >> failed to get index The message "E [MSGID: 106061] >> [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: >> failed to get index" repeated 7 times between [2019-01-04 09:59:20.825930] >> and [2019-01-04 09:59:20.837068] >> >> Please suggest as it is our production service. >> >> At the moment, I have stopped clients from using file system. Would it be >> OK if I allow clients to access file system while fix-layout is still going. >> >> Thanks >> >> Kashif >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From biholcomb at l1049h.com Fri Jan 4 22:24:07 2019 From: biholcomb at l1049h.com (Brett Holcomb) Date: Fri, 4 Jan 2019 17:24:07 -0500 Subject: [Gluster-users] [External] Re: Self Heal Confusion In-Reply-To: References: <24cfe8a5-dadb-6271-9b7f-af8670f43fce@l1049h.com> <14cbebd2-44d0-8558-1e26-944e1dec15a7@l1049h.com> <1851464190.54195617.1545898152060.JavaMail.zimbra@redhat.com> <988970243.54246776.1545976827971.JavaMail.zimbra@redhat.com> <9d548f7b-1859-f438-2cb9-9ca1cb3baa86@l1049h.com> <3c2edc47-2cdc-0b90-a708-c59cc8a51937@l1049h.com> <7fe3f846-7289-5885-9905-3e7812964970@l1049h.com> Message-ID: I wrote a script to search the output of gluster volume heal projects info, picks the brick I gave it and then deletes any of the files? listed that actually exist in .glusterfs/dir1/dir2.? I did this on the first host which had 85 pending and that cleared them up so I'll do it via ssh on the other two servers. Hopefully that will clear it up and glusterfs will be happy again. Thanks everyone for the help. On 12/31/18 4:39 AM, Davide Obbi wrote: > cluster.quorum-type auto > cluster.quorum-count (null) > cluster.server-quorum-type off > cluster.server-quorum-ratio 0 > cluster.quorum-reads??????????????????? no > > Where exacty do I remove the gfid entries from - the .glusterfs > directory? --> yes can't remember exactly where but try to do a find > in the brick paths with the gfid? it should return something > > Where do I put the cluster.heal-timeout option - which file? --> > gluster volume set volumename option value > > On Mon, Dec 31, 2018 at 10:34 AM Brett Holcomb > wrote: > > That is probably the case as a lot of files were deleted some time > ago. > > I'm on version 5.2 but was on 3.12 until about a week ago. > > Here is the quorum info.? I'm running a distributed replicated > volumes > in 2 x 3 = 6 > > cluster.quorum-type auto > cluster.quorum-count (null) > cluster.server-quorum-type off > cluster.server-quorum-ratio 0 > cluster.quorum-reads??????????????????? no > > Where exacty do I remove the gfid entries from - the .glusterfs > directory?? Do I just delete all the directories can files under this > directory? > > Where do I put the cluster.heal-timeout option - which file? > > I think you've hit on the cause of the issue.? Thinking back we've > had > some extended power outages and due to a misconfiguration in the swap > file device name a couple of the nodes did not come up and I didn't > catch it for a while so maybe the deletes occured then. > > Thank you. > > On 12/31/18 2:58 AM, Davide Obbi wrote: > > if the long GFID does not correspond to any file it could mean the > > file has been deleted by the client mounting the volume. I think > this > > is caused when the delete was issued and the number of active > bricks > > were not reaching quorum majority or a second brick was taken down > > while another was down or did not finish the selfheal, the > latter more > > likely. > > It would be interesting to see: > > - what version of glusterfs you running, it happened to me with 3.12 > > - volume quorum rules: "gluster volume get vol all | grep quorum" > > > > To clean it up if i remember correctly it should be possible to > delete > > the gfid entries from the brick mounts on the glusterfs server > nodes > > reporting the files to heal. > > > > As a side note you might want to consider changing the selfheal > > timeout to more agressive schedule in cluster.heal-timeout option > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Davide Obbi > System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > Direct +31207031558 > Booking.com > Empowering people to experience the world since 1996 > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwaymack at nsgdv.com Sat Jan 5 22:49:04 2019 From: mwaymack at nsgdv.com (Matt Waymack) Date: Sat, 5 Jan 2019 22:49:04 +0000 Subject: [Gluster-users] Input/output error on FUSE log Message-ID: Hi all, I'm having a problem writing to our volume. When writing files larger than about 2GB, I get an intermittent issue where the write will fail and return Input/Output error. This is also shown in the FUSE log of the client (this is affecting all clients). A snip of a client log is below: [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51040978: WRITE => -1 gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041266: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output error) [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041548: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 22:39:33.925981] and [2019-01-05 22:39:50.451862] The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] This is intermittent for most files, but eventually if a file is large enough it will not write. The workflow is SFTP tot he client which then writes to the volume over FUSE. When files get to a certain point,w e can no longer write to them. The file sizes are different as well, so it's not like they all get to the same size and just stop either. I've ruled out a free space issue, our files at their largest are only a few hundred GB and we have tens of terrabytes free on each brick. We are also sharding at 1GB. I'm not sure where to go from here as the error seems vague and I can only see it on the client log. I'm not seeing these errors on the nodes themselves. This is also seen if I mount the volume via FUSE on any of the nodes as well and it is only reflected in the FUSE log. Here is the volume info: Volume Name: gv1 Type: Distributed-Replicate Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c Status: Started Snapshot Count: 0 Number of Bricks: 8 x (2 + 1) = 24 Transport-type: tcp Bricks: Brick1: tpc-glus4:/exp/b1/gv1 Brick2: tpc-glus2:/exp/b1/gv1 Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) Brick4: tpc-glus2:/exp/b2/gv1 Brick5: tpc-glus4:/exp/b2/gv1 Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) Brick7: tpc-glus4:/exp/b3/gv1 Brick8: tpc-glus2:/exp/b3/gv1 Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) Brick10: tpc-glus4:/exp/b4/gv1 Brick11: tpc-glus2:/exp/b4/gv1 Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) Brick13: tpc-glus1:/exp/b5/gv1 Brick14: tpc-glus3:/exp/b5/gv1 Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) Brick16: tpc-glus1:/exp/b6/gv1 Brick17: tpc-glus3:/exp/b6/gv1 Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) Brick19: tpc-glus1:/exp/b7/gv1 Brick20: tpc-glus3:/exp/b7/gv1 Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) Brick22: tpc-glus1:/exp/b8/gv1 Brick23: tpc-glus3:/exp/b8/gv1 Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) Options Reconfigured: performance.cache-samba-metadata: on performance.cache-invalidation: off features.shard-block-size: 1000MB features.shard: on transport.address-family: inet nfs.disable: on cluster.lookup-optimize: on I'm a bit stumped on this, any help is appreciated. Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Sun Jan 6 02:28:33 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Sun, 6 Jan 2019 07:58:33 +0530 Subject: [Gluster-users] Input/output error on FUSE log In-Reply-To: References: Message-ID: On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack wrote: > Hi all, > > > I'm having a problem writing to our volume. When writing files larger > than about 2GB, I get an intermittent issue where the write will fail and > return Input/Output error. This is also shown in the FUSE log of the > client (this is affecting all clients). A snip of a client log is below: > > [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51040978: WRITE => -1 > gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041266: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output > error) > > [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041548: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times > between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] > > The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] > 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 > 22:39:33.925981] and [2019-01-05 22:39:50.451862] > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times > between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] > This looks to be a DHT issue. Some questions: * Are all subvolumes of DHT up and client is connected to them? Particularly the subvolume which contains the file in question. * Can you get all extended attributes of parent directory of the file from all bricks? * set diagnostics.client-log-level to TRACE, capture these errors again and attach the client log file. > This is intermittent for most files, but eventually if a file is large > enough it will not write. The workflow is SFTP tot he client which then > writes to the volume over FUSE. When files get to a certain point,w e can > no longer write to them. The file sizes are different as well, so it's not > like they all get to the same size and just stop either. I've ruled out a > free space issue, our files at their largest are only a few hundred GB and > we have tens of terrabytes free on each brick. We are also sharding at 1GB. > > I'm not sure where to go from here as the error seems vague and I can only > see it on the client log. I'm not seeing these errors on the nodes > themselves. This is also seen if I mount the volume via FUSE on any of the > nodes as well and it is only reflected in the FUSE log. > > Here is the volume info: > Volume Name: gv1 > Type: Distributed-Replicate > Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c > Status: Started > Snapshot Count: 0 > Number of Bricks: 8 x (2 + 1) = 24 > Transport-type: tcp > Bricks: > Brick1: tpc-glus4:/exp/b1/gv1 > Brick2: tpc-glus2:/exp/b1/gv1 > Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) > Brick4: tpc-glus2:/exp/b2/gv1 > Brick5: tpc-glus4:/exp/b2/gv1 > Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) > Brick7: tpc-glus4:/exp/b3/gv1 > Brick8: tpc-glus2:/exp/b3/gv1 > Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) > Brick10: tpc-glus4:/exp/b4/gv1 > Brick11: tpc-glus2:/exp/b4/gv1 > Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) > Brick13: tpc-glus1:/exp/b5/gv1 > Brick14: tpc-glus3:/exp/b5/gv1 > Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) > Brick16: tpc-glus1:/exp/b6/gv1 > Brick17: tpc-glus3:/exp/b6/gv1 > Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) > Brick19: tpc-glus1:/exp/b7/gv1 > Brick20: tpc-glus3:/exp/b7/gv1 > Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) > Brick22: tpc-glus1:/exp/b8/gv1 > Brick23: tpc-glus3:/exp/b8/gv1 > Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) > Options Reconfigured: > performance.cache-samba-metadata: on > performance.cache-invalidation: off > features.shard-block-size: 1000MB > features.shard: on > transport.address-family: inet > nfs.disable: on > cluster.lookup-optimize: on > > I'm a bit stumped on this, any help is appreciated. Thank you! > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Sun Jan 6 02:32:19 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Sun, 6 Jan 2019 08:02:19 +0530 Subject: [Gluster-users] Input/output error on FUSE log In-Reply-To: References: Message-ID: On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa wrote: > > > On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack wrote: > >> Hi all, >> >> >> I'm having a problem writing to our volume. When writing files larger >> than about 2GB, I get an intermittent issue where the write will fail and >> return Input/Output error. This is also shown in the FUSE log of the >> client (this is affecting all clients). A snip of a client log is below: >> >> [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] >> 0-glusterfs-fuse: 51040978: WRITE => -1 >> gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output >> error) >> >> [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] >> 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) >> >> [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] >> 0-glusterfs-fuse: 51041266: WRITE => -1 >> gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output >> error) >> >> [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] >> 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) >> >> [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] >> 0-glusterfs-fuse: 51041548: WRITE => -1 >> gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output >> error) >> >> [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] >> 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) >> >> The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] >> 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times >> between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] >> >> The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] >> 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 >> 22:39:33.925981] and [2019-01-05 22:39:50.451862] >> >> The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] >> 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times >> between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] >> > > This looks to be a DHT issue. Some questions: > * Are all subvolumes of DHT up and client is connected to them? > Particularly the subvolume which contains the file in question. > * Can you get all extended attributes of parent directory of the file from > all bricks? > * set diagnostics.client-log-level to TRACE, capture these errors again > and attach the client log file. > I spoke a bit early. dht_writev doesn't search hashed subvolume as its already been looked up in lookup. So, these msgs looks to be of a different issue - not writev failure. > >> This is intermittent for most files, but eventually if a file is large >> enough it will not write. The workflow is SFTP tot he client which then >> writes to the volume over FUSE. When files get to a certain point,w e can >> no longer write to them. The file sizes are different as well, so it's not >> like they all get to the same size and just stop either. I've ruled out a >> free space issue, our files at their largest are only a few hundred GB and >> we have tens of terrabytes free on each brick. We are also sharding at 1GB. >> >> I'm not sure where to go from here as the error seems vague and I can >> only see it on the client log. I'm not seeing these errors on the nodes >> themselves. This is also seen if I mount the volume via FUSE on any of the >> nodes as well and it is only reflected in the FUSE log. >> >> Here is the volume info: >> Volume Name: gv1 >> Type: Distributed-Replicate >> Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 8 x (2 + 1) = 24 >> Transport-type: tcp >> Bricks: >> Brick1: tpc-glus4:/exp/b1/gv1 >> Brick2: tpc-glus2:/exp/b1/gv1 >> Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) >> Brick4: tpc-glus2:/exp/b2/gv1 >> Brick5: tpc-glus4:/exp/b2/gv1 >> Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) >> Brick7: tpc-glus4:/exp/b3/gv1 >> Brick8: tpc-glus2:/exp/b3/gv1 >> Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) >> Brick10: tpc-glus4:/exp/b4/gv1 >> Brick11: tpc-glus2:/exp/b4/gv1 >> Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) >> Brick13: tpc-glus1:/exp/b5/gv1 >> Brick14: tpc-glus3:/exp/b5/gv1 >> Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) >> Brick16: tpc-glus1:/exp/b6/gv1 >> Brick17: tpc-glus3:/exp/b6/gv1 >> Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) >> Brick19: tpc-glus1:/exp/b7/gv1 >> Brick20: tpc-glus3:/exp/b7/gv1 >> Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) >> Brick22: tpc-glus1:/exp/b8/gv1 >> Brick23: tpc-glus3:/exp/b8/gv1 >> Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) >> Options Reconfigured: >> performance.cache-samba-metadata: on >> performance.cache-invalidation: off >> features.shard-block-size: 1000MB >> features.shard: on >> transport.address-family: inet >> nfs.disable: on >> cluster.lookup-optimize: on >> >> I'm a bit stumped on this, any help is appreciated. Thank you! >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davide.obbi at booking.com Sun Jan 6 18:26:10 2019 From: davide.obbi at booking.com (Davide Obbi) Date: Sun, 6 Jan 2019 19:26:10 +0100 Subject: [Gluster-users] [External] Re: Input/output error on FUSE log In-Reply-To: References: Message-ID: Hi, i would start doing some checks like: "(Input/output error)" seems returned by the operating system, this happens for instance trying to access a file system which is on a device not available so i would check the network connectivity between the client to servers and server to server during the reported time. Regards Davide On Sun, Jan 6, 2019 at 3:32 AM Raghavendra Gowdappa wrote: > > > On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: > >> >> >> On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack wrote: >> >>> Hi all, >>> >>> >>> I'm having a problem writing to our volume. When writing files larger >>> than about 2GB, I get an intermittent issue where the write will fail and >>> return Input/Output error. This is also shown in the FUSE log of the >>> client (this is affecting all clients). A snip of a client log is below: >>> >>> [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] >>> 0-glusterfs-fuse: 51040978: WRITE => -1 >>> gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output >>> error) >>> >>> [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] >>> 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) >>> >>> [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] >>> 0-glusterfs-fuse: 51041266: WRITE => -1 >>> gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output >>> error) >>> >>> [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] >>> 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) >>> >>> [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] >>> 0-glusterfs-fuse: 51041548: WRITE => -1 >>> gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output >>> error) >>> >>> [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] >>> 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) >>> >>> The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] >>> 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times >>> between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] >>> >>> The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] >>> 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 >>> 22:39:33.925981] and [2019-01-05 22:39:50.451862] >>> >>> The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] >>> 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times >>> between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] >>> >> >> This looks to be a DHT issue. Some questions: >> * Are all subvolumes of DHT up and client is connected to them? >> Particularly the subvolume which contains the file in question. >> * Can you get all extended attributes of parent directory of the file >> from all bricks? >> * set diagnostics.client-log-level to TRACE, capture these errors again >> and attach the client log file. >> > > I spoke a bit early. dht_writev doesn't search hashed subvolume as its > already been looked up in lookup. So, these msgs looks to be of a different > issue - not writev failure. > > >> >>> This is intermittent for most files, but eventually if a file is large >>> enough it will not write. The workflow is SFTP tot he client which then >>> writes to the volume over FUSE. When files get to a certain point,w e can >>> no longer write to them. The file sizes are different as well, so it's not >>> like they all get to the same size and just stop either. I've ruled out a >>> free space issue, our files at their largest are only a few hundred GB and >>> we have tens of terrabytes free on each brick. We are also sharding at 1GB. >>> >>> I'm not sure where to go from here as the error seems vague and I can >>> only see it on the client log. I'm not seeing these errors on the nodes >>> themselves. This is also seen if I mount the volume via FUSE on any of the >>> nodes as well and it is only reflected in the FUSE log. >>> >>> Here is the volume info: >>> Volume Name: gv1 >>> Type: Distributed-Replicate >>> Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 8 x (2 + 1) = 24 >>> Transport-type: tcp >>> Bricks: >>> Brick1: tpc-glus4:/exp/b1/gv1 >>> Brick2: tpc-glus2:/exp/b1/gv1 >>> Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) >>> Brick4: tpc-glus2:/exp/b2/gv1 >>> Brick5: tpc-glus4:/exp/b2/gv1 >>> Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) >>> Brick7: tpc-glus4:/exp/b3/gv1 >>> Brick8: tpc-glus2:/exp/b3/gv1 >>> Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) >>> Brick10: tpc-glus4:/exp/b4/gv1 >>> Brick11: tpc-glus2:/exp/b4/gv1 >>> Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) >>> Brick13: tpc-glus1:/exp/b5/gv1 >>> Brick14: tpc-glus3:/exp/b5/gv1 >>> Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) >>> Brick16: tpc-glus1:/exp/b6/gv1 >>> Brick17: tpc-glus3:/exp/b6/gv1 >>> Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) >>> Brick19: tpc-glus1:/exp/b7/gv1 >>> Brick20: tpc-glus3:/exp/b7/gv1 >>> Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) >>> Brick22: tpc-glus1:/exp/b8/gv1 >>> Brick23: tpc-glus3:/exp/b8/gv1 >>> Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) >>> Options Reconfigured: >>> performance.cache-samba-metadata: on >>> performance.cache-invalidation: off >>> features.shard-block-size: 1000MB >>> features.shard: on >>> transport.address-family: inet >>> nfs.disable: on >>> cluster.lookup-optimize: on >>> >>> I'm a bit stumped on this, any help is appreciated. Thank you! >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [image: Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Mon Jan 7 07:11:29 2019 From: revirii at googlemail.com (Hu Bert) Date: Mon, 7 Jan 2019 08:11:29 +0100 Subject: [Gluster-users] Glusterfs 4.1.6 In-Reply-To: <365618439.54755929.1546515511041.JavaMail.zimbra@redhat.com> References: <365618439.54755929.1546515511041.JavaMail.zimbra@redhat.com> Message-ID: Hi Ashish & all others, if i may jump in... i have a little question if that's ok? replace-brick and reset-brick are different commands for 2 distinct problems? I once had a faulty disk (=brick), it got replaced (hot-swap) and received the same identifier (/dev/sdd again); i followed this guide: https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/ -->> "Replacing bricks in Replicate/Distributed Replicate volumes" If i unterstand it correctly: - using replace-brick is for "i have an additional disk and want to move data from existing brick to new brick", old brick gets removed from volume and new brick gets added to the volume. - reset-brick is for "one of my hdds crashed and it will be replaced by a new one", the brick name stays the same. did i get that right? If so: holy smokes... then i misunderstood this completly (sorry @Pranith&Xavi). The wording is a bit strange here... Thx Hubert Am Do., 3. Jan. 2019 um 12:38 Uhr schrieb Ashish Pandey : > > Hi, > > Some of the the steps provided by you are not correct. > You should have used reset-brick command which was introduced for the same task you wanted to do. > > https://docs.gluster.org/en/v3/release-notes/3.9.0/ > > Although your thinking was correct but replacing a faulty disk requires some of the additional task which this command > will do automatically. > > Step 1 :- kill pid of the faulty brick in node >>>>>> This should be done using "reset-brick start" command. follow the steps provided in link. > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted > Step 4 :- run command "gluster v start volname force" >>>>>>>>>>>> This should be done using "reset-brick commit force" command. This will trigger the heal. Follow the link. > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > --- > Ashish > > ________________________________ > From: "Amudhan P" > To: "Gluster Users" > Sent: Thursday, January 3, 2019 4:25:58 PM > Subject: [Gluster-users] Glusterfs 4.1.6 > > Hi, > > I am working on Glusterfs 4.1.6 on a test machine. I am trying to replace a faulty disk and below are the steps I did but wasn't successful with that. > > 3 Nodes, 2 disks per node, Disperse Volume 4+2 :- > Step 1 :- kill pid of the faulty brick in node > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted > Step 4 :- run command "gluster v start volname force" > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > expected behavior was a new brick process & heal should have started. > > following above said steps 3.10.1 works perfectly, starting a new brick process and heal begins. > But the same step not working in 4.1.6, Did I miss any steps? what should be done? > > Amudhan > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From aspandey at redhat.com Mon Jan 7 07:21:54 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Mon, 7 Jan 2019 02:21:54 -0500 (EST) Subject: [Gluster-users] Glusterfs 4.1.6 In-Reply-To: References: <365618439.54755929.1546515511041.JavaMail.zimbra@redhat.com> Message-ID: <2128880662.55600195.1546845714853.JavaMail.zimbra@redhat.com> comments inline ----- Original Message ----- From: "Hu Bert" To: "Ashish Pandey" Cc: "Gluster Users" Sent: Monday, January 7, 2019 12:41:29 PM Subject: Re: [Gluster-users] Glusterfs 4.1.6 Hi Ashish & all others, if i may jump in... i have a little question if that's ok? replace-brick and reset-brick are different commands for 2 distinct problems? I once had a faulty disk (=brick), it got replaced (hot-swap) and received the same identifier (/dev/sdd again); i followed this guide: https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/ -->> "Replacing bricks in Replicate/Distributed Replicate volumes" If i unterstand it correctly: - using replace-brick is for "i have an additional disk and want to move data from existing brick to new brick", old brick gets removed from volume and new brick gets added to the volume. - reset-brick is for "one of my hdds crashed and it will be replaced by a new one", the brick name stays the same. did i get that right? If so: holy smokes... then i misunderstood this completly (sorry @Pranith&Xavi). The wording is a bit strange here... >>>>>>>>>>>>>>>>>>>>>>>>> Yes, your understanding is correct. In addition to above, one more use of reset-brick - If you want to change hostname of your server and bricks are having hostname, then you can use reset-brick to change from hostname to Ip address and then change the hostname of the server. In short, whenever you want to change something on one of the brick while location and mount point are same, you should use reset-brick >>>>>>>>>>>>>>>>>>>>>>>>> Thx Hubert Am Do., 3. Jan. 2019 um 12:38 Uhr schrieb Ashish Pandey : > > Hi, > > Some of the the steps provided by you are not correct. > You should have used reset-brick command which was introduced for the same task you wanted to do. > > https://docs.gluster.org/en/v3/release-notes/3.9.0/ > > Although your thinking was correct but replacing a faulty disk requires some of the additional task which this command > will do automatically. > > Step 1 :- kill pid of the faulty brick in node >>>>>> This should be done using "reset-brick start" command. follow the steps provided in link. > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted > Step 4 :- run command "gluster v start volname force" >>>>>>>>>>>> This should be done using "reset-brick commit force" command. This will trigger the heal. Follow the link. > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > --- > Ashish > > ________________________________ > From: "Amudhan P" > To: "Gluster Users" > Sent: Thursday, January 3, 2019 4:25:58 PM > Subject: [Gluster-users] Glusterfs 4.1.6 > > Hi, > > I am working on Glusterfs 4.1.6 on a test machine. I am trying to replace a faulty disk and below are the steps I did but wasn't successful with that. > > 3 Nodes, 2 disks per node, Disperse Volume 4+2 :- > Step 1 :- kill pid of the faulty brick in node > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted > Step 4 :- run command "gluster v start volname force" > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > expected behavior was a new brick process & heal should have started. > > following above said steps 3.10.1 works perfectly, starting a new brick process and heal begins. > But the same step not working in 4.1.6, Did I miss any steps? what should be done? > > Amudhan > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Mon Jan 7 07:58:22 2019 From: revirii at googlemail.com (Hu Bert) Date: Mon, 7 Jan 2019 08:58:22 +0100 Subject: [Gluster-users] Glusterfs 4.1.6 In-Reply-To: <2128880662.55600195.1546845714853.JavaMail.zimbra@redhat.com> References: <365618439.54755929.1546515511041.JavaMail.zimbra@redhat.com> <2128880662.55600195.1546845714853.JavaMail.zimbra@redhat.com> Message-ID: Hi, thx Ashish for the clarification. Just another question... so the commands in case of a hdd (lets say sdd) failure and identical brick paths (mount: /gluster/bricksdd1) should look like this: gluster volume reset-brick $volname /gluster/bricksdd1 start >> change hdd, create partition & filesystem, mount << gluster volume reset-brick $volname $host:/gluster/bricksdd1 $host:/gluster/bricksdd1 commit force Is it possible to change the mountpoint/brick name with this command? In my case: old: /gluster/bricksdd1_new new: /gluster/bricksdd1 i.e. only the mount point is different. gluster volume reset-brick $volname $host:/gluster/bricksdd1_new $host:/gluster/bricksdd1 commit force I would try to: - gluster volume reset-brick $volname $host:/gluster/bricksdd1_new start - reformat sdd etc. - gluster volume reset-brick $volname $host:/gluster/bricksdd1_new $host:/gluster/bricksdd1 commit force thx Hubert Am Mo., 7. Jan. 2019 um 08:21 Uhr schrieb Ashish Pandey : > > comments inline > > ________________________________ > From: "Hu Bert" > To: "Ashish Pandey" > Cc: "Gluster Users" > Sent: Monday, January 7, 2019 12:41:29 PM > Subject: Re: [Gluster-users] Glusterfs 4.1.6 > > Hi Ashish & all others, > > if i may jump in... i have a little question if that's ok? > replace-brick and reset-brick are different commands for 2 distinct > problems? I once had a faulty disk (=brick), it got replaced > (hot-swap) and received the same identifier (/dev/sdd again); i > followed this guide: > > https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/ > -->> "Replacing bricks in Replicate/Distributed Replicate volumes" > > If i unterstand it correctly: > > - using replace-brick is for "i have an additional disk and want to > move data from existing brick to new brick", old brick gets removed > from volume and new brick gets added to the volume. > - reset-brick is for "one of my hdds crashed and it will be replaced > by a new one", the brick name stays the same. > > did i get that right? If so: holy smokes... then i misunderstood this > completly (sorry @Pranith&Xavi). The wording is a bit strange here... > > >>>>>>>>>>>>>>>>>>>>>>>>> > Yes, your understanding is correct. In addition to above, one more use of reset-brick - > If you want to change hostname of your server and bricks are having hostname, then you can use reset-brick to change from hostname to Ip address and then change the > hostname of the server. > In short, whenever you want to change something on one of the brick while location and mount point are same, you should use reset-brick > >>>>>>>>>>>>>>>>>>>>>>>>> > > > > Thx > Hubert > > Am Do., 3. Jan. 2019 um 12:38 Uhr schrieb Ashish Pandey : > > > > Hi, > > > > Some of the the steps provided by you are not correct. > > You should have used reset-brick command which was introduced for the same task you wanted to do. > > > > https://docs.gluster.org/en/v3/release-notes/3.9.0/ > > > > Although your thinking was correct but replacing a faulty disk requires some of the additional task which this command > > will do automatically. > > > > Step 1 :- kill pid of the faulty brick in node >>>>>> This should be done using "reset-brick start" command. follow the steps provided in link. > > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted > > Step 4 :- run command "gluster v start volname force" >>>>>>>>>>>> This should be done using "reset-brick commit force" command. This will trigger the heal. Follow the link. > > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > > > --- > > Ashish > > > > ________________________________ > > From: "Amudhan P" > > To: "Gluster Users" > > Sent: Thursday, January 3, 2019 4:25:58 PM > > Subject: [Gluster-users] Glusterfs 4.1.6 > > > > Hi, > > > > I am working on Glusterfs 4.1.6 on a test machine. I am trying to replace a faulty disk and below are the steps I did but wasn't successful with that. > > > > 3 Nodes, 2 disks per node, Disperse Volume 4+2 :- > > Step 1 :- kill pid of the faulty brick in node > > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted > > Step 4 :- run command "gluster v start volname force" > > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > > > expected behavior was a new brick process & heal should have started. > > > > following above said steps 3.10.1 works perfectly, starting a new brick process and heal begins. > > But the same step not working in 4.1.6, Did I miss any steps? what should be done? > > > > Amudhan > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > From aspandey at redhat.com Mon Jan 7 08:30:19 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Mon, 7 Jan 2019 03:30:19 -0500 (EST) Subject: [Gluster-users] Glusterfs 4.1.6 In-Reply-To: References: <365618439.54755929.1546515511041.JavaMail.zimbra@redhat.com> <2128880662.55600195.1546845714853.JavaMail.zimbra@redhat.com> Message-ID: <53235749.55605402.1546849819878.JavaMail.zimbra@redhat.com> ----- Original Message ----- From: "Hu Bert" To: "Ashish Pandey" Cc: "Gluster Users" Sent: Monday, January 7, 2019 1:28:22 PM Subject: Re: [Gluster-users] Glusterfs 4.1.6 Hi, thx Ashish for the clarification. Just another question... so the commands in case of a hdd (lets say sdd) failure and identical brick paths (mount: /gluster/bricksdd1) should look like this: gluster volume reset-brick $volname /gluster/bricksdd1 start >> change hdd, create partition & filesystem, mount << gluster volume reset-brick $volname $host:/gluster/bricksdd1 $host:/gluster/bricksdd1 commit force >>> Correct. Is it possible to change the mountpoint/brick name with this command? In my case: old: /gluster/bricksdd1_new new: /gluster/bricksdd1 i.e. only the mount point is different. gluster volume reset-brick $volname $host:/gluster/bricksdd1_new $host:/gluster/bricksdd1 commit force I would try to: - gluster volume reset-brick $volname $host:/gluster/bricksdd1_new start - reformat sdd etc. - gluster volume reset-brick $volname $host:/gluster/bricksdd1_new $host:/gluster/bricksdd1 commit force >>> I think it is not possible. At least this is what we tested during and after development. We would consider the above case as replace-brick and not as reset-brick. thx Hubert Am Mo., 7. Jan. 2019 um 08:21 Uhr schrieb Ashish Pandey : > > comments inline > > ________________________________ > From: "Hu Bert" > To: "Ashish Pandey" > Cc: "Gluster Users" > Sent: Monday, January 7, 2019 12:41:29 PM > Subject: Re: [Gluster-users] Glusterfs 4.1.6 > > Hi Ashish & all others, > > if i may jump in... i have a little question if that's ok? > replace-brick and reset-brick are different commands for 2 distinct > problems? I once had a faulty disk (=brick), it got replaced > (hot-swap) and received the same identifier (/dev/sdd again); i > followed this guide: > > https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/ > -->> "Replacing bricks in Replicate/Distributed Replicate volumes" > > If i unterstand it correctly: > > - using replace-brick is for "i have an additional disk and want to > move data from existing brick to new brick", old brick gets removed > from volume and new brick gets added to the volume. > - reset-brick is for "one of my hdds crashed and it will be replaced > by a new one", the brick name stays the same. > > did i get that right? If so: holy smokes... then i misunderstood this > completly (sorry @Pranith&Xavi). The wording is a bit strange here... > > >>>>>>>>>>>>>>>>>>>>>>>>> > Yes, your understanding is correct. In addition to above, one more use of reset-brick - > If you want to change hostname of your server and bricks are having hostname, then you can use reset-brick to change from hostname to Ip address and then change the > hostname of the server. > In short, whenever you want to change something on one of the brick while location and mount point are same, you should use reset-brick > >>>>>>>>>>>>>>>>>>>>>>>>> > > > > Thx > Hubert > > Am Do., 3. Jan. 2019 um 12:38 Uhr schrieb Ashish Pandey : > > > > Hi, > > > > Some of the the steps provided by you are not correct. > > You should have used reset-brick command which was introduced for the same task you wanted to do. > > > > https://docs.gluster.org/en/v3/release-notes/3.9.0/ > > > > Although your thinking was correct but replacing a faulty disk requires some of the additional task which this command > > will do automatically. > > > > Step 1 :- kill pid of the faulty brick in node >>>>>> This should be done using "reset-brick start" command. follow the steps provided in link. > > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted > > Step 4 :- run command "gluster v start volname force" >>>>>>>>>>>> This should be done using "reset-brick commit force" command. This will trigger the heal. Follow the link. > > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > > > --- > > Ashish > > > > ________________________________ > > From: "Amudhan P" > > To: "Gluster Users" > > Sent: Thursday, January 3, 2019 4:25:58 PM > > Subject: [Gluster-users] Glusterfs 4.1.6 > > > > Hi, > > > > I am working on Glusterfs 4.1.6 on a test machine. I am trying to replace a faulty disk and below are the steps I did but wasn't successful with that. > > > > 3 Nodes, 2 disks per node, Disperse Volume 4+2 :- > > Step 1 :- kill pid of the faulty brick in node > > Step 2 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > Step 3 :- replace disk and mount new disk in same mount point where the old disk was mounted > > Step 4 :- run command "gluster v start volname force" > > Step 5 :- running volume status, shows "N/A" under 'pid' & 'TCP port' > > > > expected behavior was a new brick process & heal should have started. > > > > following above said steps 3.10.1 works perfectly, starting a new brick process and heal begins. > > But the same step not working in 4.1.6, Did I miss any steps? what should be done? > > > > Amudhan > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Mon Jan 7 15:18:31 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Mon, 7 Jan 2019 20:48:31 +0530 Subject: [Gluster-users] update to 4.1.6-1 and fix-layout failing In-Reply-To: References: Message-ID: On Fri, 4 Jan 2019 at 17:10, mohammad kashif wrote: > Hi Nithya > > rebalance logs has only these warnings > 2019-01-04 09:59:20.826261] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-atlasglust-client-5: error returned while attempting to connect to > host:(null), port:0 [2019-01-04 09:59:20.828113] W > [rpc-clnt.c:1753:rpc_clnt_submit] 0-atlasglust-client-6: error returned > while attempting to connect to host:(null), port:0 [2019-01-04 > 09:59:20.832017] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-atlasglust-client-4: > error returned while attempting to connect to host:(null), port:0 > Please send me the rebalance logs if possible. Are 08 and 09 the newly added nodes? Are no directories being created on those ? > > gluster volume rebalance atlasglust status > Node > status run time in h:m:s > --------- > ----------- ------------ > localhost fix-layout > in progress 1:0:59 > pplxgluster02.physics.ox.ac.uk > fix-layout in progress 1:0:59 > pplxgluster03.physics.ox.ac.uk > fix-layout in progress 1:0:59 > pplxgluster04.physics.ox.ac.uk > fix-layout in progress 1:0:59 > pplxgluster05.physics.ox.ac.uk > fix-layout in progress 1:0:59 > pplxgluster06.physics.ox.ac.uk > fix-layout in progress 1:0:59 > pplxgluster07.physics.ox.ac.uk > fix-layout in progress 1:0:59 > pplxgluster08.physics.ox.ac.uk > fix-layout in progress 1:0:59 > pplxgluster09.physics.ox.ac.uk > fix-layout in progress 1:0:59 > > But there is no new entry in logs for last one hour and I can't see any > new directories being created. > > Thanks > > Kashif > > > On Fri, Jan 4, 2019 at 10:42 AM Nithya Balachandran > wrote: > >> >> >> On Fri, 4 Jan 2019 at 15:48, mohammad kashif >> wrote: >> >>> Hi >>> >>> I have updated our distributed gluster storage from 3.12.9-1 to 4.1.6-1. >>> The existing cluster had seven servers totalling in around 450 TB. OS is >>> Centos7. The update went OK and I could access files. >>> Then I added two more servers of 90TB each to cluster and started >>> fix-layout >>> >>> gluster volume rebalance atlasglust fix-layout start >>> >>> Some directories were created at new servers and then stopped although >>> rebalance status was showing that it is still running. I think it stopped >>> creating new directories after this error >>> >>> E [MSGID: 106061] >>> [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: >>> failed to get index The message "E [MSGID: 106061] >>> [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: >>> failed to get index" repeated 7 times between [2019-01-03 13:16:31.146779] >>> and [2019-01-03 13:16:31.158612] >>> >>> >> There are also many warning like this >>> [2019-01-03 16:04:34.120777] I [MSGID: 106499] >>> [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: >>> Received status volume req for volume atlasglust [2019-01-03 >>> 17:04:28.541805] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-management: error >>> returned while attempting to connect to host:(null), port:0 >>> >>> These are the glusterd logs. Do you see any errors in the rebalance logs >> for this volume? >> >> >>> I waited for around 12 hours and then stopped fix-layout and started >>> again >>> I can see the same error again >>> >>> [2019-01-04 09:59:20.825930] E [MSGID: 106061] >>> [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: >>> failed to get index The message "E [MSGID: 106061] >>> [glusterd-utils.c:10697:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: >>> failed to get index" repeated 7 times between [2019-01-04 09:59:20.825930] >>> and [2019-01-04 09:59:20.837068] >>> >>> Please suggest as it is our production service. >>> >>> At the moment, I have stopped clients from using file system. Would it >>> be OK if I allow clients to access file system while fix-layout is still >>> going. >>> >>> Thanks >>> >>> Kashif >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwaymack at nsgdv.com Mon Jan 7 16:35:35 2019 From: mwaymack at nsgdv.com (Matt Waymack) Date: Mon, 7 Jan 2019 16:35:35 +0000 Subject: [Gluster-users] [External] Re: Input/output error on FUSE log In-Reply-To: References: Message-ID: <860b5dd2c4c1493887cea657f85e3c0d@nsgdv.com> I think that I can rule out network as I have multiple volumes on the same nodes and not all volumes are affected. Additionally, access via SMB using samba-vfs-glusterfs is not affected, even on the same volumes. This is seemingly only affecting the FUSE clients. From: Davide Obbi Sent: Sunday, January 6, 2019 12:26 PM To: Raghavendra Gowdappa Cc: Matt Waymack ; gluster-users at gluster.org List Subject: Re: [External] Re: [Gluster-users] Input/output error on FUSE log Hi, i would start doing some checks like: "(Input/output error)" seems returned by the operating system, this happens for instance trying to access a file system which is on a device not available so i would check the network connectivity between the client to servers and server to server during the reported time. Regards Davide On Sun, Jan 6, 2019 at 3:32 AM Raghavendra Gowdappa > wrote: On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack > wrote: Hi all, I'm having a problem writing to our volume. When writing files larger than about 2GB, I get an intermittent issue where the write will fail and return Input/Output error. This is also shown in the FUSE log of the client (this is affecting all clients). A snip of a client log is below: [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51040978: WRITE => -1 gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041266: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output error) [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041548: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 22:39:33.925981] and [2019-01-05 22:39:50.451862] The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] This looks to be a DHT issue. Some questions: * Are all subvolumes of DHT up and client is connected to them? Particularly the subvolume which contains the file in question. * Can you get all extended attributes of parent directory of the file from all bricks? * set diagnostics.client-log-level to TRACE, capture these errors again and attach the client log file. I spoke a bit early. dht_writev doesn't search hashed subvolume as its already been looked up in lookup. So, these msgs looks to be of a different issue - not writev failure. This is intermittent for most files, but eventually if a file is large enough it will not write. The workflow is SFTP tot he client which then writes to the volume over FUSE. When files get to a certain point,w e can no longer write to them. The file sizes are different as well, so it's not like they all get to the same size and just stop either. I've ruled out a free space issue, our files at their largest are only a few hundred GB and we have tens of terrabytes free on each brick. We are also sharding at 1GB. I'm not sure where to go from here as the error seems vague and I can only see it on the client log. I'm not seeing these errors on the nodes themselves. This is also seen if I mount the volume via FUSE on any of the nodes as well and it is only reflected in the FUSE log. Here is the volume info: Volume Name: gv1 Type: Distributed-Replicate Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c Status: Started Snapshot Count: 0 Number of Bricks: 8 x (2 + 1) = 24 Transport-type: tcp Bricks: Brick1: tpc-glus4:/exp/b1/gv1 Brick2: tpc-glus2:/exp/b1/gv1 Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) Brick4: tpc-glus2:/exp/b2/gv1 Brick5: tpc-glus4:/exp/b2/gv1 Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) Brick7: tpc-glus4:/exp/b3/gv1 Brick8: tpc-glus2:/exp/b3/gv1 Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) Brick10: tpc-glus4:/exp/b4/gv1 Brick11: tpc-glus2:/exp/b4/gv1 Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) Brick13: tpc-glus1:/exp/b5/gv1 Brick14: tpc-glus3:/exp/b5/gv1 Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) Brick16: tpc-glus1:/exp/b6/gv1 Brick17: tpc-glus3:/exp/b6/gv1 Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) Brick19: tpc-glus1:/exp/b7/gv1 Brick20: tpc-glus3:/exp/b7/gv1 Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) Brick22: tpc-glus1:/exp/b8/gv1 Brick23: tpc-glus3:/exp/b8/gv1 Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) Options Reconfigured: performance.cache-samba-metadata: on performance.cache-invalidation: off features.shard-block-size: 1000MB features.shard: on transport.address-family: inet nfs.disable: on cluster.lookup-optimize: on I'm a bit stumped on this, any help is appreciated. Thank you! _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From isakdim at gmail.com Mon Jan 7 18:11:12 2019 From: isakdim at gmail.com (Dmitry Isakbayev) Date: Mon, 7 Jan 2019 13:11:12 -0500 Subject: [Gluster-users] java application crushes while reading a zip file In-Reply-To: References: Message-ID: This system is going into production. I will try to replicate this problem on the next installation. On Wed, Jan 2, 2019 at 9:25 PM Raghavendra Gowdappa wrote: > > > On Wed, Jan 2, 2019 at 9:59 PM Dmitry Isakbayev wrote: > >> Still no JVM crushes. Is it possible that running glusterfs with >> performance options turned off for a couple of days cleared out the "stale >> metadata issue"? >> > > restarting these options, would've cleared the existing cache and hence > previous stale metadata would've been cleared. Hitting stale metadata > again depends on races. That might be the reason you are still not seeing > the issue. Can you try with enabling all perf xlators (default > configuration)? > > >> >> On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev >> wrote: >> >>> The software ran with all of the options turned off over the weekend >>> without any problems. >>> I will try to collect the debug info for you. I have re-enabled the 3 >>> three options, but yet to see the problem reoccurring. >>> >>> >>> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa < >>> rgowdapp at redhat.com> wrote: >>> >>>> Thanks Dmitry. Can you provide the following debug info I asked earlier: >>>> >>>> * strace -ff -v ... of java application >>>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse while >>>> mounting). >>>> >>>> regards, >>>> Raghavendra >>>> >>>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev >>>> wrote: >>>> >>>>> These 3 options seem to trigger both (reading zip file and renaming >>>>> files) problems. >>>>> >>>>> Options Reconfigured: >>>>> performance.io-cache: off >>>>> performance.stat-prefetch: off >>>>> performance.quick-read: off >>>>> performance.parallel-readdir: off >>>>> *performance.readdir-ahead: on* >>>>> *performance.write-behind: on* >>>>> *performance.read-ahead: on* >>>>> performance.client-io-threads: off >>>>> nfs.disable: on >>>>> transport.address-family: inet >>>>> >>>>> >>>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev >>>>> wrote: >>>>> >>>>>> Turning a single option on at a time still worked fine. I will keep >>>>>> trying. >>>>>> >>>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues or log >>>>>> messages. Do you suppose these issues are triggered by the new environment >>>>>> or did not exist in 4.1.5? >>>>>> >>>>>> [root at node1 ~]# glusterfs --version >>>>>> glusterfs 4.1.5 >>>>>> >>>>>> On AWS using >>>>>> [root at node1 ~]# hostnamectl >>>>>> Static hostname: node1 >>>>>> Icon name: computer-vm >>>>>> Chassis: vm >>>>>> Machine ID: b30d0f2110ac3807b210c19ede3ce88f >>>>>> Boot ID: 52bb159a0aa94043a40e7c7651967bd9 >>>>>> Virtualization: kvm >>>>>> Operating System: CentOS Linux 7 (Core) >>>>>> CPE OS Name: cpe:/o:centos:centos:7 >>>>>> Kernel: Linux 3.10.0-862.3.2.el7.x86_64 >>>>>> Architecture: x86-64 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa < >>>>>> rgowdapp at redhat.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev >>>>>>> wrote: >>>>>>> >>>>>>>> Ok. I will try different options. >>>>>>>> >>>>>>>> This system is scheduled to go into production soon. What version >>>>>>>> would you recommend to roll back to? >>>>>>>> >>>>>>> >>>>>>> These are long standing issues. So, rolling back may not make these >>>>>>> issues go away. Instead if you think performance is agreeable to you, >>>>>>> please keep these xlators off in production. >>>>>>> >>>>>>> >>>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa < >>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev < >>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Raghavendra, >>>>>>>>>> >>>>>>>>>> Thank for the suggestion. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I am suing >>>>>>>>>> >>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster --version >>>>>>>>>> glusterfs 5.0 >>>>>>>>>> >>>>>>>>>> On >>>>>>>>>> [root at jl-fanexoss1p glusterfs]# hostnamectl >>>>>>>>>> Icon name: computer-vm >>>>>>>>>> Chassis: vm >>>>>>>>>> Machine ID: e44b8478ef7a467d98363614f4e50535 >>>>>>>>>> Boot ID: eed98992fdda4c88bdd459a89101766b >>>>>>>>>> Virtualization: vmware >>>>>>>>>> Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) >>>>>>>>>> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server >>>>>>>>>> Kernel: Linux 3.10.0-862.14.4.el7.x86_64 >>>>>>>>>> Architecture: x86-64 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I have configured the following options >>>>>>>>>> >>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster volume info >>>>>>>>>> Volume Name: gv0 >>>>>>>>>> Type: Replicate >>>>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824 >>>>>>>>>> Status: Started >>>>>>>>>> Snapshot Count: 0 >>>>>>>>>> Number of Bricks: 1 x 3 = 3 >>>>>>>>>> Transport-type: tcp >>>>>>>>>> Bricks: >>>>>>>>>> Brick1: jl-fanexoss1p.cspire.net:/data/brick1/gv0 >>>>>>>>>> Brick2: sl-fanexoss2p.cspire.net:/data/brick1/gv0 >>>>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0 >>>>>>>>>> Options Reconfigured: >>>>>>>>>> performance.io-cache: off >>>>>>>>>> performance.stat-prefetch: off >>>>>>>>>> performance.quick-read: off >>>>>>>>>> performance.parallel-readdir: off >>>>>>>>>> performance.readdir-ahead: off >>>>>>>>>> performance.write-behind: off >>>>>>>>>> performance.read-ahead: off >>>>>>>>>> performance.client-io-threads: off >>>>>>>>>> nfs.disable: on >>>>>>>>>> transport.address-family: inet >>>>>>>>>> >>>>>>>>>> I don't know if it is related, but I am seeing a lot of >>>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031] >>>>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: remote >>>>>>>>>> operation failed [No such device or address] >>>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>> handler >>>>>>>>>> >>>>>>>>> >>>>>>>>> These msgs were introduced by patch [1]. To the best of my >>>>>>>>> knowledge they are benign. We'll be sending a patch to fix these msgs >>>>>>>>> though. >>>>>>>>> >>>>>>>>> +Mohit Agrawal +Milind Changire >>>>>>>>> . Can you try to identify why we are seeing >>>>>>>>> these messages? If possible please send a patch to fix this. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5 >>>>>>>>> >>>>>>>>> >>>>>>>>>> And java.io exceptions trying to rename files. >>>>>>>>>> >>>>>>>>> >>>>>>>>> When you see the errors is it possible to collect, >>>>>>>>> * strace of the java application (strace -ff -v ...) >>>>>>>>> * fuse-dump of the glusterfs mount (use option --dump-fuse while >>>>>>>>> mounting)? >>>>>>>>> >>>>>>>>> I also need another favour from you. By trail and error, can you >>>>>>>>> point out which of the many performance xlators you've turned off is >>>>>>>>> causing the issue? >>>>>>>>> >>>>>>>>> The above two data-points will help us to fix the problem. >>>>>>>>> >>>>>>>>> >>>>>>>>>> Thank You, >>>>>>>>>> Dmitry >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra Gowdappa < >>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> What version of glusterfs are you using? It might be either >>>>>>>>>>> * a stale metadata issue. >>>>>>>>>>> * inconsistent ctime issue. >>>>>>>>>>> >>>>>>>>>>> Can you try turning off all performance xlators? If the issue is >>>>>>>>>>> 1, that should help. >>>>>>>>>>> >>>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry Isakbayev < >>>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Attempted to set 'performance.read-ahead off` according to >>>>>>>>>>>> https://jira.apache.org/jira/browse/AMQ-7041 >>>>>>>>>>>> That did not help. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry Isakbayev < >>>>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The core file generated by JVM suggests that it happens >>>>>>>>>>>>> because the file is changing while it is being read - >>>>>>>>>>>>> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557. >>>>>>>>>>>>> The application reads in the zipfile and goes through the zip >>>>>>>>>>>>> entries, then reloads the file and goes the zip entries again. It does so >>>>>>>>>>>>> 3 times. The application never crushes on the 1st cycle but sometimes >>>>>>>>>>>>> crushes on the 2nd or 3rd cycle. >>>>>>>>>>>>> The zip file is generated about 20 seconds prior to it being >>>>>>>>>>>>> used and is not updated or even used by any other application. I have >>>>>>>>>>>>> never seen this problem on a plain file system. >>>>>>>>>>>>> >>>>>>>>>>>>> I would appreciate any suggestions on how to go debugging this >>>>>>>>>>>>> issue. I can change the source code of the java application. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Dmitry >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwaymack at nsgdv.com Mon Jan 7 18:44:17 2019 From: mwaymack at nsgdv.com (Matt Waymack) Date: Mon, 7 Jan 2019 18:44:17 +0000 Subject: [Gluster-users] [External] Re: Input/output error on FUSE log In-Reply-To: References: <860b5dd2c4c1493887cea657f85e3c0d@nsgdv.com> Message-ID: <2b864f36f88541a9932ae4de9dc2f26f@nsgdv.com> Yes, all volumes use sharding. From: Davide Obbi Sent: Monday, January 7, 2019 12:43 PM To: Matt Waymack Cc: Raghavendra Gowdappa ; gluster-users at gluster.org List Subject: Re: [External] Re: [Gluster-users] Input/output error on FUSE log are all the volumes being configured with sharding? On Mon, Jan 7, 2019 at 5:35 PM Matt Waymack > wrote: I think that I can rule out network as I have multiple volumes on the same nodes and not all volumes are affected. Additionally, access via SMB using samba-vfs-glusterfs is not affected, even on the same volumes. This is seemingly only affecting the FUSE clients. From: Davide Obbi > Sent: Sunday, January 6, 2019 12:26 PM To: Raghavendra Gowdappa > Cc: Matt Waymack >; gluster-users at gluster.org List > Subject: Re: [External] Re: [Gluster-users] Input/output error on FUSE log Hi, i would start doing some checks like: "(Input/output error)" seems returned by the operating system, this happens for instance trying to access a file system which is on a device not available so i would check the network connectivity between the client to servers and server to server during the reported time. Regards Davide On Sun, Jan 6, 2019 at 3:32 AM Raghavendra Gowdappa > wrote: On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack > wrote: Hi all, I'm having a problem writing to our volume. When writing files larger than about 2GB, I get an intermittent issue where the write will fail and return Input/Output error. This is also shown in the FUSE log of the client (this is affecting all clients). A snip of a client log is below: [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51040978: WRITE => -1 gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041266: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output error) [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041548: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 22:39:33.925981] and [2019-01-05 22:39:50.451862] The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] This looks to be a DHT issue. Some questions: * Are all subvolumes of DHT up and client is connected to them? Particularly the subvolume which contains the file in question. * Can you get all extended attributes of parent directory of the file from all bricks? * set diagnostics.client-log-level to TRACE, capture these errors again and attach the client log file. I spoke a bit early. dht_writev doesn't search hashed subvolume as its already been looked up in lookup. So, these msgs looks to be of a different issue - not writev failure. This is intermittent for most files, but eventually if a file is large enough it will not write. The workflow is SFTP tot he client which then writes to the volume over FUSE. When files get to a certain point,w e can no longer write to them. The file sizes are different as well, so it's not like they all get to the same size and just stop either. I've ruled out a free space issue, our files at their largest are only a few hundred GB and we have tens of terrabytes free on each brick. We are also sharding at 1GB. I'm not sure where to go from here as the error seems vague and I can only see it on the client log. I'm not seeing these errors on the nodes themselves. This is also seen if I mount the volume via FUSE on any of the nodes as well and it is only reflected in the FUSE log. Here is the volume info: Volume Name: gv1 Type: Distributed-Replicate Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c Status: Started Snapshot Count: 0 Number of Bricks: 8 x (2 + 1) = 24 Transport-type: tcp Bricks: Brick1: tpc-glus4:/exp/b1/gv1 Brick2: tpc-glus2:/exp/b1/gv1 Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) Brick4: tpc-glus2:/exp/b2/gv1 Brick5: tpc-glus4:/exp/b2/gv1 Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) Brick7: tpc-glus4:/exp/b3/gv1 Brick8: tpc-glus2:/exp/b3/gv1 Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) Brick10: tpc-glus4:/exp/b4/gv1 Brick11: tpc-glus2:/exp/b4/gv1 Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) Brick13: tpc-glus1:/exp/b5/gv1 Brick14: tpc-glus3:/exp/b5/gv1 Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) Brick16: tpc-glus1:/exp/b6/gv1 Brick17: tpc-glus3:/exp/b6/gv1 Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) Brick19: tpc-glus1:/exp/b7/gv1 Brick20: tpc-glus3:/exp/b7/gv1 Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) Brick22: tpc-glus1:/exp/b8/gv1 Brick23: tpc-glus3:/exp/b8/gv1 Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) Options Reconfigured: performance.cache-samba-metadata: on performance.cache-invalidation: off features.shard-block-size: 1000MB features.shard: on transport.address-family: inet nfs.disable: on cluster.lookup-optimize: on I'm a bit stumped on this, any help is appreciated. Thank you! _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From davide.obbi at booking.com Mon Jan 7 18:46:53 2019 From: davide.obbi at booking.com (Davide Obbi) Date: Mon, 7 Jan 2019 19:46:53 +0100 Subject: [Gluster-users] [External] Re: Input/output error on FUSE log In-Reply-To: <2b864f36f88541a9932ae4de9dc2f26f@nsgdv.com> References: <860b5dd2c4c1493887cea657f85e3c0d@nsgdv.com> <2b864f36f88541a9932ae4de9dc2f26f@nsgdv.com> Message-ID: i guess you tried already unmounting, stop/star and mounting? On Mon, Jan 7, 2019 at 7:44 PM Matt Waymack wrote: > Yes, all volumes use sharding. > > > > *From:* Davide Obbi > *Sent:* Monday, January 7, 2019 12:43 PM > *To:* Matt Waymack > *Cc:* Raghavendra Gowdappa ; > gluster-users at gluster.org List > *Subject:* Re: [External] Re: [Gluster-users] Input/output error on FUSE > log > > > > are all the volumes being configured with sharding? > > > > On Mon, Jan 7, 2019 at 5:35 PM Matt Waymack wrote: > > I think that I can rule out network as I have multiple volumes on the same > nodes and not all volumes are affected. Additionally, access via SMB using > samba-vfs-glusterfs is not affected, even on the same volumes. This is > seemingly only affecting the FUSE clients. > > > > *From:* Davide Obbi > *Sent:* Sunday, January 6, 2019 12:26 PM > *To:* Raghavendra Gowdappa > *Cc:* Matt Waymack ; gluster-users at gluster.org List < > gluster-users at gluster.org> > *Subject:* Re: [External] Re: [Gluster-users] Input/output error on FUSE > log > > > > Hi, > > > > i would start doing some checks like: "(Input/output error)" seems > returned by the operating system, this happens for instance trying to > access a file system which is on a device not available so i would check > the network connectivity between the client to servers and server to > server during the reported time. > > > > Regards > > Davide > > > > On Sun, Jan 6, 2019 at 3:32 AM Raghavendra Gowdappa > wrote: > > > > > > On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: > > > > > > On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack wrote: > > Hi all, > > > > I'm having a problem writing to our volume. When writing files larger > than about 2GB, I get an intermittent issue where the write will fail and > return Input/Output error. This is also shown in the FUSE log of the > client (this is affecting all clients). A snip of a client log is below: > > [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51040978: WRITE => -1 > gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041266: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output > error) > > [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041548: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times > between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] > > The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] > 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 > 22:39:33.925981] and [2019-01-05 22:39:50.451862] > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times > between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] > > > > This looks to be a DHT issue. Some questions: > > * Are all subvolumes of DHT up and client is connected to them? > Particularly the subvolume which contains the file in question. > > * Can you get all extended attributes of parent directory of the file from > all bricks? > > * set diagnostics.client-log-level to TRACE, capture these errors again > and attach the client log file. > > > > I spoke a bit early. dht_writev doesn't search hashed subvolume as its > already been looked up in lookup. So, these msgs looks to be of a different > issue - not writev failure. > > > > > > This is intermittent for most files, but eventually if a file is large > enough it will not write. The workflow is SFTP tot he client which then > writes to the volume over FUSE. When files get to a certain point,w e can > no longer write to them. The file sizes are different as well, so it's not > like they all get to the same size and just stop either. I've ruled out a > free space issue, our files at their largest are only a few hundred GB and > we have tens of terrabytes free on each brick. We are also sharding at 1GB. > > > > I'm not sure where to go from here as the error seems vague and I can only > see it on the client log. I'm not seeing these errors on the nodes > themselves. This is also seen if I mount the volume via FUSE on any of the > nodes as well and it is only reflected in the FUSE log. > > > > Here is the volume info: > > Volume Name: gv1 > > Type: Distributed-Replicate > > Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 8 x (2 + 1) = 24 > > Transport-type: tcp > > Bricks: > > Brick1: tpc-glus4:/exp/b1/gv1 > > Brick2: tpc-glus2:/exp/b1/gv1 > > Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) > > Brick4: tpc-glus2:/exp/b2/gv1 > > Brick5: tpc-glus4:/exp/b2/gv1 > > Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) > > Brick7: tpc-glus4:/exp/b3/gv1 > > Brick8: tpc-glus2:/exp/b3/gv1 > > Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) > > Brick10: tpc-glus4:/exp/b4/gv1 > > Brick11: tpc-glus2:/exp/b4/gv1 > > Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) > > Brick13: tpc-glus1:/exp/b5/gv1 > > Brick14: tpc-glus3:/exp/b5/gv1 > > Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) > > Brick16: tpc-glus1:/exp/b6/gv1 > > Brick17: tpc-glus3:/exp/b6/gv1 > > Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) > > Brick19: tpc-glus1:/exp/b7/gv1 > > Brick20: tpc-glus3:/exp/b7/gv1 > > Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) > > Brick22: tpc-glus1:/exp/b8/gv1 > > Brick23: tpc-glus3:/exp/b8/gv1 > > Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) > > Options Reconfigured: > > performance.cache-samba-metadata: on > > performance.cache-invalidation: off > > features.shard-block-size: 1000MB > > features.shard: on > > transport.address-family: inet > > nfs.disable: on > > cluster.lookup-optimize: on > > > > I'm a bit stumped on this, any help is appreciated. Thank you! > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > > *Davide Obbi* > > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > > Direct +31207031558 > > *[image: Booking.com] * > > Empowering people to experience the world since 1996 > > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > > > > -- > > *Davide Obbi* > > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > > Direct +31207031558 > > *[image: Booking.com] * > > Empowering people to experience the world since 1996 > > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [image: Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From davide.obbi at booking.com Mon Jan 7 18:42:32 2019 From: davide.obbi at booking.com (Davide Obbi) Date: Mon, 7 Jan 2019 19:42:32 +0100 Subject: [Gluster-users] [External] Re: Input/output error on FUSE log In-Reply-To: <860b5dd2c4c1493887cea657f85e3c0d@nsgdv.com> References: <860b5dd2c4c1493887cea657f85e3c0d@nsgdv.com> Message-ID: are all the volumes being configured with sharding? On Mon, Jan 7, 2019 at 5:35 PM Matt Waymack wrote: > I think that I can rule out network as I have multiple volumes on the same > nodes and not all volumes are affected. Additionally, access via SMB using > samba-vfs-glusterfs is not affected, even on the same volumes. This is > seemingly only affecting the FUSE clients. > > > > *From:* Davide Obbi > *Sent:* Sunday, January 6, 2019 12:26 PM > *To:* Raghavendra Gowdappa > *Cc:* Matt Waymack ; gluster-users at gluster.org List < > gluster-users at gluster.org> > *Subject:* Re: [External] Re: [Gluster-users] Input/output error on FUSE > log > > > > Hi, > > > > i would start doing some checks like: "(Input/output error)" seems > returned by the operating system, this happens for instance trying to > access a file system which is on a device not available so i would check > the network connectivity between the client to servers and server to > server during the reported time. > > > > Regards > > Davide > > > > On Sun, Jan 6, 2019 at 3:32 AM Raghavendra Gowdappa > wrote: > > > > > > On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: > > > > > > On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack wrote: > > Hi all, > > > > I'm having a problem writing to our volume. When writing files larger > than about 2GB, I get an intermittent issue where the write will fail and > return Input/Output error. This is also shown in the FUSE log of the > client (this is affecting all clients). A snip of a client log is below: > > [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51040978: WRITE => -1 > gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041266: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output > error) > > [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041548: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times > between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] > > The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] > 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 > 22:39:33.925981] and [2019-01-05 22:39:50.451862] > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times > between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] > > > > This looks to be a DHT issue. Some questions: > > * Are all subvolumes of DHT up and client is connected to them? > Particularly the subvolume which contains the file in question. > > * Can you get all extended attributes of parent directory of the file from > all bricks? > > * set diagnostics.client-log-level to TRACE, capture these errors again > and attach the client log file. > > > > I spoke a bit early. dht_writev doesn't search hashed subvolume as its > already been looked up in lookup. So, these msgs looks to be of a different > issue - not writev failure. > > > > > > This is intermittent for most files, but eventually if a file is large > enough it will not write. The workflow is SFTP tot he client which then > writes to the volume over FUSE. When files get to a certain point,w e can > no longer write to them. The file sizes are different as well, so it's not > like they all get to the same size and just stop either. I've ruled out a > free space issue, our files at their largest are only a few hundred GB and > we have tens of terrabytes free on each brick. We are also sharding at 1GB. > > > > I'm not sure where to go from here as the error seems vague and I can only > see it on the client log. I'm not seeing these errors on the nodes > themselves. This is also seen if I mount the volume via FUSE on any of the > nodes as well and it is only reflected in the FUSE log. > > > > Here is the volume info: > > Volume Name: gv1 > > Type: Distributed-Replicate > > Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 8 x (2 + 1) = 24 > > Transport-type: tcp > > Bricks: > > Brick1: tpc-glus4:/exp/b1/gv1 > > Brick2: tpc-glus2:/exp/b1/gv1 > > Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) > > Brick4: tpc-glus2:/exp/b2/gv1 > > Brick5: tpc-glus4:/exp/b2/gv1 > > Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) > > Brick7: tpc-glus4:/exp/b3/gv1 > > Brick8: tpc-glus2:/exp/b3/gv1 > > Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) > > Brick10: tpc-glus4:/exp/b4/gv1 > > Brick11: tpc-glus2:/exp/b4/gv1 > > Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) > > Brick13: tpc-glus1:/exp/b5/gv1 > > Brick14: tpc-glus3:/exp/b5/gv1 > > Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) > > Brick16: tpc-glus1:/exp/b6/gv1 > > Brick17: tpc-glus3:/exp/b6/gv1 > > Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) > > Brick19: tpc-glus1:/exp/b7/gv1 > > Brick20: tpc-glus3:/exp/b7/gv1 > > Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) > > Brick22: tpc-glus1:/exp/b8/gv1 > > Brick23: tpc-glus3:/exp/b8/gv1 > > Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) > > Options Reconfigured: > > performance.cache-samba-metadata: on > > performance.cache-invalidation: off > > features.shard-block-size: 1000MB > > features.shard: on > > transport.address-family: inet > > nfs.disable: on > > cluster.lookup-optimize: on > > > > I'm a bit stumped on this, any help is appreciated. Thank you! > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > > *Davide Obbi* > > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > > Direct +31207031558 > > *[image: Booking.com] * > > Empowering people to experience the world since 1996 > > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [image: Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwaymack at nsgdv.com Mon Jan 7 18:52:30 2019 From: mwaymack at nsgdv.com (Matt Waymack) Date: Mon, 7 Jan 2019 18:52:30 +0000 Subject: [Gluster-users] [External] Re: Input/output error on FUSE log In-Reply-To: References: <860b5dd2c4c1493887cea657f85e3c0d@nsgdv.com> <2b864f36f88541a9932ae4de9dc2f26f@nsgdv.com> Message-ID: <65d9fe57aafb48a88e81cb93596bafef@nsgdv.com> Yep, first unmount/remounted, then rebooted clients. Stopped/started the volumes, and rebooted all nodes. From: Davide Obbi Sent: Monday, January 7, 2019 12:47 PM To: Matt Waymack Cc: Raghavendra Gowdappa ; gluster-users at gluster.org List Subject: Re: [External] Re: [Gluster-users] Input/output error on FUSE log i guess you tried already unmounting, stop/star and mounting? On Mon, Jan 7, 2019 at 7:44 PM Matt Waymack > wrote: Yes, all volumes use sharding. From: Davide Obbi > Sent: Monday, January 7, 2019 12:43 PM To: Matt Waymack > Cc: Raghavendra Gowdappa >; gluster-users at gluster.org List > Subject: Re: [External] Re: [Gluster-users] Input/output error on FUSE log are all the volumes being configured with sharding? On Mon, Jan 7, 2019 at 5:35 PM Matt Waymack > wrote: I think that I can rule out network as I have multiple volumes on the same nodes and not all volumes are affected. Additionally, access via SMB using samba-vfs-glusterfs is not affected, even on the same volumes. This is seemingly only affecting the FUSE clients. From: Davide Obbi > Sent: Sunday, January 6, 2019 12:26 PM To: Raghavendra Gowdappa > Cc: Matt Waymack >; gluster-users at gluster.org List > Subject: Re: [External] Re: [Gluster-users] Input/output error on FUSE log Hi, i would start doing some checks like: "(Input/output error)" seems returned by the operating system, this happens for instance trying to access a file system which is on a device not available so i would check the network connectivity between the client to servers and server to server during the reported time. Regards Davide On Sun, Jan 6, 2019 at 3:32 AM Raghavendra Gowdappa > wrote: On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack > wrote: Hi all, I'm having a problem writing to our volume. When writing files larger than about 2GB, I get an intermittent issue where the write will fail and return Input/Output error. This is also shown in the FUSE log of the client (this is affecting all clients). A snip of a client log is below: [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51040978: WRITE => -1 gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041266: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output error) [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041548: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 22:39:33.925981] and [2019-01-05 22:39:50.451862] The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] This looks to be a DHT issue. Some questions: * Are all subvolumes of DHT up and client is connected to them? Particularly the subvolume which contains the file in question. * Can you get all extended attributes of parent directory of the file from all bricks? * set diagnostics.client-log-level to TRACE, capture these errors again and attach the client log file. I spoke a bit early. dht_writev doesn't search hashed subvolume as its already been looked up in lookup. So, these msgs looks to be of a different issue - not writev failure. This is intermittent for most files, but eventually if a file is large enough it will not write. The workflow is SFTP tot he client which then writes to the volume over FUSE. When files get to a certain point,w e can no longer write to them. The file sizes are different as well, so it's not like they all get to the same size and just stop either. I've ruled out a free space issue, our files at their largest are only a few hundred GB and we have tens of terrabytes free on each brick. We are also sharding at 1GB. I'm not sure where to go from here as the error seems vague and I can only see it on the client log. I'm not seeing these errors on the nodes themselves. This is also seen if I mount the volume via FUSE on any of the nodes as well and it is only reflected in the FUSE log. Here is the volume info: Volume Name: gv1 Type: Distributed-Replicate Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c Status: Started Snapshot Count: 0 Number of Bricks: 8 x (2 + 1) = 24 Transport-type: tcp Bricks: Brick1: tpc-glus4:/exp/b1/gv1 Brick2: tpc-glus2:/exp/b1/gv1 Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) Brick4: tpc-glus2:/exp/b2/gv1 Brick5: tpc-glus4:/exp/b2/gv1 Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) Brick7: tpc-glus4:/exp/b3/gv1 Brick8: tpc-glus2:/exp/b3/gv1 Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) Brick10: tpc-glus4:/exp/b4/gv1 Brick11: tpc-glus2:/exp/b4/gv1 Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) Brick13: tpc-glus1:/exp/b5/gv1 Brick14: tpc-glus3:/exp/b5/gv1 Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) Brick16: tpc-glus1:/exp/b6/gv1 Brick17: tpc-glus3:/exp/b6/gv1 Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) Brick19: tpc-glus1:/exp/b7/gv1 Brick20: tpc-glus3:/exp/b7/gv1 Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) Brick22: tpc-glus1:/exp/b8/gv1 Brick23: tpc-glus3:/exp/b8/gv1 Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) Options Reconfigured: performance.cache-samba-metadata: on performance.cache-invalidation: off features.shard-block-size: 1000MB features.shard: on transport.address-family: inet nfs.disable: on cluster.lookup-optimize: on I'm a bit stumped on this, any help is appreciated. Thank you! _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From davide.obbi at booking.com Mon Jan 7 18:55:29 2019 From: davide.obbi at booking.com (Davide Obbi) Date: Mon, 7 Jan 2019 19:55:29 +0100 Subject: [Gluster-users] [External] Re: Input/output error on FUSE log In-Reply-To: <65d9fe57aafb48a88e81cb93596bafef@nsgdv.com> References: <860b5dd2c4c1493887cea657f85e3c0d@nsgdv.com> <2b864f36f88541a9932ae4de9dc2f26f@nsgdv.com> <65d9fe57aafb48a88e81cb93596bafef@nsgdv.com> Message-ID: then my last idea would be trying to create the same files or run the application on the other volumes, sorry but i will be interested in the solution! On Mon, Jan 7, 2019 at 7:52 PM Matt Waymack wrote: > Yep, first unmount/remounted, then rebooted clients. Stopped/started the > volumes, and rebooted all nodes. > > > > *From:* Davide Obbi > *Sent:* Monday, January 7, 2019 12:47 PM > *To:* Matt Waymack > *Cc:* Raghavendra Gowdappa ; > gluster-users at gluster.org List > *Subject:* Re: [External] Re: [Gluster-users] Input/output error on FUSE > log > > > > i guess you tried already unmounting, stop/star and mounting? > > > > On Mon, Jan 7, 2019 at 7:44 PM Matt Waymack wrote: > > Yes, all volumes use sharding. > > > > *From:* Davide Obbi > *Sent:* Monday, January 7, 2019 12:43 PM > *To:* Matt Waymack > *Cc:* Raghavendra Gowdappa ; > gluster-users at gluster.org List > *Subject:* Re: [External] Re: [Gluster-users] Input/output error on FUSE > log > > > > are all the volumes being configured with sharding? > > > > On Mon, Jan 7, 2019 at 5:35 PM Matt Waymack wrote: > > I think that I can rule out network as I have multiple volumes on the same > nodes and not all volumes are affected. Additionally, access via SMB using > samba-vfs-glusterfs is not affected, even on the same volumes. This is > seemingly only affecting the FUSE clients. > > > > *From:* Davide Obbi > *Sent:* Sunday, January 6, 2019 12:26 PM > *To:* Raghavendra Gowdappa > *Cc:* Matt Waymack ; gluster-users at gluster.org List < > gluster-users at gluster.org> > *Subject:* Re: [External] Re: [Gluster-users] Input/output error on FUSE > log > > > > Hi, > > > > i would start doing some checks like: "(Input/output error)" seems > returned by the operating system, this happens for instance trying to > access a file system which is on a device not available so i would check > the network connectivity between the client to servers and server to > server during the reported time. > > > > Regards > > Davide > > > > On Sun, Jan 6, 2019 at 3:32 AM Raghavendra Gowdappa > wrote: > > > > > > On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: > > > > > > On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack wrote: > > Hi all, > > > > I'm having a problem writing to our volume. When writing files larger > than about 2GB, I get an intermittent issue where the write will fail and > return Input/Output error. This is also shown in the FUSE log of the > client (this is affecting all clients). A snip of a client log is below: > > [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51040978: WRITE => -1 > gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041266: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output > error) > > [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041548: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times > between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] > > The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] > 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 > 22:39:33.925981] and [2019-01-05 22:39:50.451862] > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times > between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] > > > > This looks to be a DHT issue. Some questions: > > * Are all subvolumes of DHT up and client is connected to them? > Particularly the subvolume which contains the file in question. > > * Can you get all extended attributes of parent directory of the file from > all bricks? > > * set diagnostics.client-log-level to TRACE, capture these errors again > and attach the client log file. > > > > I spoke a bit early. dht_writev doesn't search hashed subvolume as its > already been looked up in lookup. So, these msgs looks to be of a different > issue - not writev failure. > > > > > > This is intermittent for most files, but eventually if a file is large > enough it will not write. The workflow is SFTP tot he client which then > writes to the volume over FUSE. When files get to a certain point,w e can > no longer write to them. The file sizes are different as well, so it's not > like they all get to the same size and just stop either. I've ruled out a > free space issue, our files at their largest are only a few hundred GB and > we have tens of terrabytes free on each brick. We are also sharding at 1GB. > > > > I'm not sure where to go from here as the error seems vague and I can only > see it on the client log. I'm not seeing these errors on the nodes > themselves. This is also seen if I mount the volume via FUSE on any of the > nodes as well and it is only reflected in the FUSE log. > > > > Here is the volume info: > > Volume Name: gv1 > > Type: Distributed-Replicate > > Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 8 x (2 + 1) = 24 > > Transport-type: tcp > > Bricks: > > Brick1: tpc-glus4:/exp/b1/gv1 > > Brick2: tpc-glus2:/exp/b1/gv1 > > Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) > > Brick4: tpc-glus2:/exp/b2/gv1 > > Brick5: tpc-glus4:/exp/b2/gv1 > > Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) > > Brick7: tpc-glus4:/exp/b3/gv1 > > Brick8: tpc-glus2:/exp/b3/gv1 > > Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) > > Brick10: tpc-glus4:/exp/b4/gv1 > > Brick11: tpc-glus2:/exp/b4/gv1 > > Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) > > Brick13: tpc-glus1:/exp/b5/gv1 > > Brick14: tpc-glus3:/exp/b5/gv1 > > Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) > > Brick16: tpc-glus1:/exp/b6/gv1 > > Brick17: tpc-glus3:/exp/b6/gv1 > > Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) > > Brick19: tpc-glus1:/exp/b7/gv1 > > Brick20: tpc-glus3:/exp/b7/gv1 > > Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) > > Brick22: tpc-glus1:/exp/b8/gv1 > > Brick23: tpc-glus3:/exp/b8/gv1 > > Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) > > Options Reconfigured: > > performance.cache-samba-metadata: on > > performance.cache-invalidation: off > > features.shard-block-size: 1000MB > > features.shard: on > > transport.address-family: inet > > nfs.disable: on > > cluster.lookup-optimize: on > > > > I'm a bit stumped on this, any help is appreciated. Thank you! > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > > *Davide Obbi* > > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > > Direct +31207031558 > > *[image: Booking.com] * > > Empowering people to experience the world since 1996 > > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > > > > -- > > *Davide Obbi* > > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > > Direct +31207031558 > > *[image: Booking.com] * > > Empowering people to experience the world since 1996 > > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > > > > -- > > *Davide Obbi* > > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > > Direct +31207031558 > > *[image: Booking.com] * > > Empowering people to experience the world since 1996 > > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [image: Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwaymack at nsgdv.com Mon Jan 7 19:18:31 2019 From: mwaymack at nsgdv.com (Matt Waymack) Date: Mon, 7 Jan 2019 19:18:31 +0000 Subject: [Gluster-users] Input/output error on FUSE log In-Reply-To: References: Message-ID: <07ad5733d55f4f0ba019dd0fba606b3e@nsgdv.com> Attached are the logs from when a failure occurred with diagnostics set to trace. Thank you! From: Raghavendra Gowdappa Sent: Saturday, January 5, 2019 8:32 PM To: Matt Waymack Cc: gluster-users at gluster.org List Subject: Re: [Gluster-users] Input/output error on FUSE log On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack > wrote: Hi all, I'm having a problem writing to our volume. When writing files larger than about 2GB, I get an intermittent issue where the write will fail and return Input/Output error. This is also shown in the FUSE log of the client (this is affecting all clients). A snip of a client log is below: [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51040978: WRITE => -1 gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041266: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output error) [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041548: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 22:39:33.925981] and [2019-01-05 22:39:50.451862] The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] This looks to be a DHT issue. Some questions: * Are all subvolumes of DHT up and client is connected to them? Particularly the subvolume which contains the file in question. * Can you get all extended attributes of parent directory of the file from all bricks? * set diagnostics.client-log-level to TRACE, capture these errors again and attach the client log file. I spoke a bit early. dht_writev doesn't search hashed subvolume as its already been looked up in lookup. So, these msgs looks to be of a different issue - not writev failure. This is intermittent for most files, but eventually if a file is large enough it will not write. The workflow is SFTP tot he client which then writes to the volume over FUSE. When files get to a certain point,w e can no longer write to them. The file sizes are different as well, so it's not like they all get to the same size and just stop either. I've ruled out a free space issue, our files at their largest are only a few hundred GB and we have tens of terrabytes free on each brick. We are also sharding at 1GB. I'm not sure where to go from here as the error seems vague and I can only see it on the client log. I'm not seeing these errors on the nodes themselves. This is also seen if I mount the volume via FUSE on any of the nodes as well and it is only reflected in the FUSE log. Here is the volume info: Volume Name: gv1 Type: Distributed-Replicate Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c Status: Started Snapshot Count: 0 Number of Bricks: 8 x (2 + 1) = 24 Transport-type: tcp Bricks: Brick1: tpc-glus4:/exp/b1/gv1 Brick2: tpc-glus2:/exp/b1/gv1 Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) Brick4: tpc-glus2:/exp/b2/gv1 Brick5: tpc-glus4:/exp/b2/gv1 Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) Brick7: tpc-glus4:/exp/b3/gv1 Brick8: tpc-glus2:/exp/b3/gv1 Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) Brick10: tpc-glus4:/exp/b4/gv1 Brick11: tpc-glus2:/exp/b4/gv1 Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) Brick13: tpc-glus1:/exp/b5/gv1 Brick14: tpc-glus3:/exp/b5/gv1 Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) Brick16: tpc-glus1:/exp/b6/gv1 Brick17: tpc-glus3:/exp/b6/gv1 Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) Brick19: tpc-glus1:/exp/b7/gv1 Brick20: tpc-glus3:/exp/b7/gv1 Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) Brick22: tpc-glus1:/exp/b8/gv1 Brick23: tpc-glus3:/exp/b8/gv1 Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) Options Reconfigured: performance.cache-samba-metadata: on performance.cache-invalidation: off features.shard-block-size: 1000MB features.shard: on transport.address-family: inet nfs.disable: on cluster.lookup-optimize: on I'm a bit stumped on this, any help is appreciated. Thank you! _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: client1.txt URL: From daimh at umich.edu Mon Jan 7 20:46:00 2019 From: daimh at umich.edu (Manhong Dai) Date: Mon, 7 Jan 2019 15:46:00 -0500 Subject: [Gluster-users] 'dirfingerprint' to get glusterfs directory stats Message-ID: <11132dc6-1d5f-490b-5cfe-f6427be75024@umich.edu> Hi, ??? I released a python program 'dirfingerprint' at https://github.com/daimh/dirfingerprint/? . We have been using this program to get directory stat recursively from each brick node for glusterfs filesystem. as it is always slower to access file meta data info from gluster filesystem indirectly than brick node directly. ??? In our environment, I did the steps below before accessing brick nodes. 1, generate a ssh key, and put it under all brick nodes. 2, ssh to each brick node so '.ssh/known_hosts' has an entry for each node. 3, as all our brick node has the actual data storage mounted under /brick, the dirfingerprint command I used is something like dirfingerprint --gluster-brick=node1:/brick --gluster-brick=node2:/brick /home ??? Feel free to let me know if you have any questions or suggestions. Best, Manhong From kannanv06 at gmail.com Mon Jan 7 10:18:20 2019 From: kannanv06 at gmail.com (Kannan V) Date: Mon, 7 Jan 2019 15:48:20 +0530 Subject: [Gluster-users] Glusterfs backup and restore Message-ID: Hi, I am able to take the glusterfs snapshot and activated it. Now I want to send the snapshot to another machine for backup (Preferably tar file). When there is a problem, I wanted to take the backed up data from another machine and restore. I could not compress the data. I mean snapshot have been created at " /var/lib/glusterd/snaps/" Now if i compress the snapshot, actual data is not present. Where exactly, I have to compress the data and restore back ? Kindly provide your suggestions. Thanks, Kannan V -------------- next part -------------- An HTML attachment was scrubbed... URL: From amye at redhat.com Tue Jan 8 02:51:56 2019 From: amye at redhat.com (Amye Scavarda) Date: Mon, 7 Jan 2019 18:51:56 -0800 Subject: [Gluster-users] Gluster Monthly Newsletter, December 2018 Message-ID: Gluster Monthly Newsletter, December 2018 See you at FOSDEM! We have a jampacked Software Defined Storage day on Sunday, Feb 3rd (with a few sessions on the previous day): https://fosdem.org/2019/schedule/track/software_defined_storage/ We also have a shared stand with Ceph, come find us! Gluster 6 - We?re in planning for our Gluster 6 release, currently scheduled for Feb 2019. More details on the mailing lists at https://lists.gluster.org/pipermail/gluster-devel/2018-November/055672.html Want swag for your meetup? https://www.gluster.org/events/ has a contact form for us to let us know about your Gluster meetup! We?d love to hear about Gluster presentations coming up, conference talks and gatherings. Let us know! Contributors Top Contributing Companies: Red Hat, Comcast, DataLab, Gentoo Linux, Facebook, BioDec, Samsung, Etersoft Top Contributors in December: Sunny Kumar, Amar Tumballi, Sheetal Pamecha, Harpreet Kaur Lalwani, Sanju Rakonde Noteworthy Threads: [Gluster-users] Update from GlusterFS project (November -2018) https://lists.gluster.org/pipermail/gluster-users/2018-December/035446.html [Gluster-users] Glusterd2 project updates (github.com/gluster/glusterd2) https://lists.gluster.org/pipermail/gluster-users/2018-December/035448.html [Gluster-users] GCS 0.4 release https://lists.gluster.org/pipermail/gluster-users/2018-December/035457.html [Gluster-users] Announcing Gluster release 5.2 https://lists.gluster.org/pipermail/gluster-users/2018-December/035461.html [Gluster-users] Gluster meetup: India https://lists.gluster.org/pipermail/gluster-users/2018-December/035476.html [Gluster-users] Update on GCS 0.5 release https://lists.gluster.org/pipermail/gluster-users/2018-December/035505.html [Gluster-devel] Gluster Weekly Report : Static Analyser https://lists.gluster.org/pipermail/gluster-devel/2018-December/055711.html [Gluster-devel] FOSDEM stand - February 2 & 3, 2019 https://lists.gluster.org/pipermail/gluster-devel/2018-December/055715.html [Gluster-devel] Infra Update for Nov and Dec https://lists.gluster.org/pipermail/gluster-devel/2018-December/055735.html [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench https://lists.gluster.org/pipermail/gluster-devel/2018-December/055741.html [Gluster-devel] Implementing multiplexing for self heal client. https://lists.gluster.org/pipermail/gluster-devel/2018-December/055742.html [Gluster-devel] include-what-you-use run on Gluster https://lists.gluster.org/pipermail/gluster-devel/2018-December/055750.html [Gluster-devel] [DHT] serialized readdir(p) across subvols and effect on performance https://lists.gluster.org/pipermail/gluster-devel/2018-December/055762.html Events: FOSDEM, Feb 2-3 2019 in Brussels, Belgium - https://fosdem.org/2019/ Vault: February 25?26, 2019 - https://www.usenix.org/conference/vault19/ Open CFPs: KubeCon EU - Barcelona: May 19-21 - CFP closes Jan 19! https://events.linuxfoundation.org/events/kubecon-cloudnativecon-europe-2019/ CFP: https://events.linuxfoundation.org/events/kubecon-cloudnativecon-europe-2019/cfp/ -- Amye Scavarda | amye at redhat.com | Gluster Community Lead -------------- next part -------------- An HTML attachment was scrubbed... URL: From amalagi at commvault.com Wed Jan 9 13:23:40 2019 From: amalagi at commvault.com (Anand Malagi) Date: Wed, 9 Jan 2019 13:23:40 +0000 Subject: [Gluster-users] replace-brick operation issue... In-Reply-To: <9d978fd98e7440a6ac823858dc571e5a@POST-3.gp.cv.commvault.com> References: <9d978fd98e7440a6ac823858dc571e5a@POST-3.gp.cv.commvault.com> Message-ID: <115c2d2b5cb64426ae782c79793bbbbf@POST-3.gp.cv.commvault.com> Can I please get some help in understanding the issue mentioned ? From: Anand Malagi Sent: Monday, December 31, 2018 1:39 PM To: 'Anand Malagi' ; gluster-users at gluster.org Subject: RE: replace-brick operation issue... Can someone please help here ?? From: gluster-users-bounces at gluster.org > On Behalf Of Anand Malagi Sent: Friday, December 21, 2018 3:44 PM To: gluster-users at gluster.org Subject: [Gluster-users] replace-brick operation issue... Hi Friends, Please note that, when replace-brick operation was tried for one of the bad brick present in distributed disperse EC volume, the command actually failed but the brick daemon of new replaced brick came online. Please help to understand in what situations this issue may arise and proposed solution if possible ? : glusterd.log : [2018-12-11 11:04:43.774120] I [MSGID: 106503] [glusterd-replace-brick.c:147:__glusterd_handle_replace_brick] 0-management: Received replace-brick commit force request. [2018-12-11 11:04:44.784578] I [MSGID: 106504] [glusterd-utils.c:13079:rb_update_dstbrick_port] 0-glusterd: adding dst-brick port no 0 ... [2018-12-11 11:04:46.457537] E [MSGID: 106029] [glusterd-utils.c:7981:glusterd_brick_signal] 0-glusterd: Unable to open pidfile: /var/run/gluster/vols/AM6_HyperScale/am6sv0004sds.saipemnet.saipem.intranet-ws-disk3-ws_brick.pid [No such file or directory] [2018-12-11 11:04:53.089810] I [glusterd-utils.c:5876:glusterd_brick_start] 0-management: starting a fresh brick process for brick /ws/disk15/ws_brick ... [2018-12-11 11:04:53.117935] W [socket.c:595:__socket_rwv] 0-socket.management: writev on 127.0.0.1:864 failed (Broken pipe) [2018-12-11 11:04:54.014023] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-12-11 11:04:54.273190] I [MSGID: 106005] [glusterd-handler.c:6120:__glusterd_brick_rpc_notify] 0-management: Brick am6sv0004sds.saipemnet.saipem.intranet:/ws/disk15/ws_brick has disconnected from glusterd. [2018-12-11 11:04:54.297603] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Commit failed on am6sv0006sds.saipemnet.saipem.intranet. Please check log file for details. [2018-12-11 11:04:54.350666] I [MSGID: 106143] [glusterd-pmap.c:278:pmap_registry_bind] 0-pmap: adding brick /ws/disk15/ws_brick on port 49164 [2018-12-11 11:05:01.137449] E [MSGID: 106123] [glusterd-mgmt.c:1519:glusterd_mgmt_v3_commit] 0-management: Commit failed on peers [2018-12-11 11:05:01.137496] E [MSGID: 106123] [glusterd-replace-brick.c:660:glusterd_mgmt_v3_initiate_replace_brick_cmd_phases] 0-management: Commit Op Failed [2018-12-11 11:06:12.275867] I [MSGID: 106499] [glusterd-handler.c:4370:__glusterd_handle_status_volume] 0-management: Received status volume req for volume AM6_HyperScale [2018-12-11 13:35:51.529365] I [MSGID: 106499] [glusterd-handler.c:4370:__glusterd_handle_status_volume] 0-management: Received status volume req for volume AM6_HyperScale gluster volume replace-brick AM6_HyperScale am6sv0004sds.saipemnet.saipem.intranet:/ws/disk3/ws_brick am6sv0004sds.saipemnet.saipem.intranet:/ws/disk15/ws_brick commit force Replace brick failure, brick [/ws/disk3], volume [AM6_HyperScale] "gluster volume status" now shows a new disk active /ws/disk15 The replacement appears to be successful, looks like healing started [cid:image001.png at 01D4A84C.A15F8680] Thanks and Regards, --Anand ********************************Legal Disclaimer******************************** "This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message by mistake, please advise the sender by reply email and delete the message. We may process information in the email header of business emails sent and received by us (including the names of recipient and sender, date and time of the email) for the purposes of evaluating our existing or prospective business relationship. The lawful basis we rely on for this processing is our legitimate interests. For more information about how we use personal information please read our privacy policy https://www.commvault.com/privacy-policy. Thank you." ******************************************************************************** ********************************Legal Disclaimer******************************** "This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message by mistake, please advise the sender by reply email and delete the message. We may process information in the email header of business emails sent and received by us (including the names of recipient and sender, date and time of the email) for the purposes of evaluating our existing or prospective business relationship. The lawful basis we rely on for this processing is our legitimate interests. For more information about how we use personal information please read our privacy policy https://www.commvault.com/privacy-policy. Thank you." ******************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 2727 bytes Desc: image001.png URL: From revirii at googlemail.com Wed Jan 9 13:38:25 2019 From: revirii at googlemail.com (Hu Bert) Date: Wed, 9 Jan 2019 14:38:25 +0100 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? Message-ID: Hi @all, we have 3 servers, 4 disks (10TB) each, in a replicate 3 setup. We're having some problems after a disk failed; the restore via reset-brick takes way too long (way over a month), disk utilization is at 100%, it doesn't get any faster, some params have already been tweaked. Only about 50GB per day are copied, and for 2.5TB this takes loooong... We were thinking about migrating to 3 servers with a RAID10 (HW or SW), again in a replicate 3 setup. We would waste a lot of space, but the idea is that, if a hdd fails: - the data are still available on the hdd copy - performance is better than with a failed/restoring hdd - the restore via SW/HW RAID is faster than the restore via glusterfs Any opinions on that? Maybe it would be better to use more servers and smaller disks, but this isn't possible at the moment. thx Hubert From isakdim at gmail.com Wed Jan 9 14:17:19 2019 From: isakdim at gmail.com (Dmitry Isakbayev) Date: Wed, 9 Jan 2019 09:17:19 -0500 Subject: [Gluster-users] A broken file that can not be deleted Message-ID: I am seeing a broken file that exists on 2 out of 3 nodes. The application trying to use the file throws file permissions error. ls, rm, mv, touch all throw "Input/output error" $ ls -la ls: cannot access .download_suspensions.memo: Input/output error drwxrwxr-x. 2 ossadmin ossadmin 4096 Jan 9 08:06 . drwxrwxr-x. 5 ossadmin ossadmin 4096 Jan 3 11:36 .. -?????????? ? ? ? ? ? .download_suspensions.memo $ rm ".download_suspensions.memo" rm: cannot remove ?.download_suspensions.memo?: Input/output error -------------- next part -------------- An HTML attachment was scrubbed... URL: From combr at ya.ru Wed Jan 9 19:52:02 2019 From: combr at ya.ru (Mike) Date: Wed, 9 Jan 2019 23:52:02 +0400 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? In-Reply-To: References: Message-ID: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> 09.01.2019 17:38, Hu Bert ?????: > Hi @all, > > we have 3 servers, 4 disks (10TB) each, in a replicate 3 setup. We're > having some problems after a disk failed; the restore via reset-brick > takes way too long (way over a month) terrible. We have similar setup, and I do not test restoring... How many volumes do you have - one volume on one (*3) disk 10 TB in size - then 4 volumes? > We were thinking about migrating to 3 servers with a RAID10 (HW or > SW), again in a replicate 3 setup. We would waste a lot of space, but > the idea is that, if a hdd fails: > > - the data are still available on the hdd copy > - performance is better than with a failed/restoring hdd > - the restore via SW/HW RAID is faster than the restore via glusterfs Our setup is worse in wasted space - we have a 3 10Tb disks in each server + 1 SSD "for raid controller cache". there is no ability to create raid 10, raid 5 is a no-no on HDDS, and only viable variant is a 1ADM (3*RAID1 + 1 SSD cache) (for using all disks). Or 2*RAID1 + 1 HDD for gluster ... > Any opinions on that? Maybe it would be better to use more servers and > smaller disks, but this isn't possible at the moment. Also interested. We can swap SSDs to HDDs for RAID10, but is it worthless? And what about right way to restore for configs like replica 3 *10Tb volumes? > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > From mwaymack at nsgdv.com Wed Jan 9 20:54:45 2019 From: mwaymack at nsgdv.com (Matt Waymack) Date: Wed, 9 Jan 2019 20:54:45 +0000 Subject: [Gluster-users] Input/output error on FUSE log In-Reply-To: <07ad5733d55f4f0ba019dd0fba606b3e@nsgdv.com> References: <07ad5733d55f4f0ba019dd0fba606b3e@nsgdv.com> Message-ID: <4ae6494cbe324596831833b8858c8228@nsgdv.com> Has anyone any other ideas where to look? This is only affecting FUSE clients. SMB clients are unaffected by this problem. Thanks! From: gluster-users-bounces at gluster.org On Behalf Of Matt Waymack Sent: Monday, January 7, 2019 1:19 PM To: Raghavendra Gowdappa Cc: gluster-users at gluster.org List Subject: Re: [Gluster-users] Input/output error on FUSE log Attached are the logs from when a failure occurred with diagnostics set to trace. Thank you! From: Raghavendra Gowdappa > Sent: Saturday, January 5, 2019 8:32 PM To: Matt Waymack > Cc: gluster-users at gluster.org List > Subject: Re: [Gluster-users] Input/output error on FUSE log On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack > wrote: Hi all, I'm having a problem writing to our volume. When writing files larger than about 2GB, I get an intermittent issue where the write will fail and return Input/Output error. This is also shown in the FUSE log of the client (this is affecting all clients). A snip of a client log is below: [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51040978: WRITE => -1 gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041266: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output error) [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] 0-glusterfs-fuse: 51041548: WRITE => -1 gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output error) [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 22:39:33.925981] and [2019-01-05 22:39:50.451862] The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] This looks to be a DHT issue. Some questions: * Are all subvolumes of DHT up and client is connected to them? Particularly the subvolume which contains the file in question. * Can you get all extended attributes of parent directory of the file from all bricks? * set diagnostics.client-log-level to TRACE, capture these errors again and attach the client log file. I spoke a bit early. dht_writev doesn't search hashed subvolume as its already been looked up in lookup. So, these msgs looks to be of a different issue - not writev failure. This is intermittent for most files, but eventually if a file is large enough it will not write. The workflow is SFTP tot he client which then writes to the volume over FUSE. When files get to a certain point,w e can no longer write to them. The file sizes are different as well, so it's not like they all get to the same size and just stop either. I've ruled out a free space issue, our files at their largest are only a few hundred GB and we have tens of terrabytes free on each brick. We are also sharding at 1GB. I'm not sure where to go from here as the error seems vague and I can only see it on the client log. I'm not seeing these errors on the nodes themselves. This is also seen if I mount the volume via FUSE on any of the nodes as well and it is only reflected in the FUSE log. Here is the volume info: Volume Name: gv1 Type: Distributed-Replicate Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c Status: Started Snapshot Count: 0 Number of Bricks: 8 x (2 + 1) = 24 Transport-type: tcp Bricks: Brick1: tpc-glus4:/exp/b1/gv1 Brick2: tpc-glus2:/exp/b1/gv1 Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) Brick4: tpc-glus2:/exp/b2/gv1 Brick5: tpc-glus4:/exp/b2/gv1 Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) Brick7: tpc-glus4:/exp/b3/gv1 Brick8: tpc-glus2:/exp/b3/gv1 Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) Brick10: tpc-glus4:/exp/b4/gv1 Brick11: tpc-glus2:/exp/b4/gv1 Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) Brick13: tpc-glus1:/exp/b5/gv1 Brick14: tpc-glus3:/exp/b5/gv1 Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) Brick16: tpc-glus1:/exp/b6/gv1 Brick17: tpc-glus3:/exp/b6/gv1 Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) Brick19: tpc-glus1:/exp/b7/gv1 Brick20: tpc-glus3:/exp/b7/gv1 Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) Brick22: tpc-glus1:/exp/b8/gv1 Brick23: tpc-glus3:/exp/b8/gv1 Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) Options Reconfigured: performance.cache-samba-metadata: on performance.cache-invalidation: off features.shard-block-size: 1000MB features.shard: on transport.address-family: inet nfs.disable: on cluster.lookup-optimize: on I'm a bit stumped on this, any help is appreciated. Thank you! _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Thu Jan 10 06:53:19 2019 From: revirii at googlemail.com (Hu Bert) Date: Thu, 10 Jan 2019 07:53:19 +0100 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? In-Reply-To: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> References: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> Message-ID: Hi Mike, > We have similar setup, and I do not test restoring... > How many volumes do you have - one volume on one (*3) disk 10 TB in size > - then 4 volumes? Testing could be quite easy: reset-brick start, then delete&re-create partition/fs/etc., reset-brick commit force - and then watch. We only have 1 big volume over all bricks. Details: Volume Name: shared Type: Distributed-Replicate Number of Bricks: 4 x 3 = 12 Brick1: gluster11:/gluster/bricksda1/shared Brick2: gluster12:/gluster/bricksda1/shared Brick3: gluster13:/gluster/bricksda1/shared Brick4: gluster11:/gluster/bricksdb1/shared Brick5: gluster12:/gluster/bricksdb1/shared Brick6: gluster13:/gluster/bricksdb1/shared Brick7: gluster11:/gluster/bricksdc1/shared Brick8: gluster12:/gluster/bricksdc1/shared Brick9: gluster13:/gluster/bricksdc1/shared Brick10: gluster11:/gluster/bricksdd1/shared Brick11: gluster12:/gluster/bricksdd1_new/shared Brick12: gluster13:/gluster/bricksdd1_new/shared Didn't think about creating more volumes (in order to split data), e.g. 4 volumes with 3*10TB each, or 2 volumes with 6*10TB each. Just curious: after splitting into 2 or more volumes - would that make the volume with the healthy/non-restoring disks better accessable? And only the volume with the once faulty and now restoring disk would be in a "bad mood"? > > Any opinions on that? Maybe it would be better to use more servers and > > smaller disks, but this isn't possible at the moment. > Also interested. We can swap SSDs to HDDs for RAID10, but is it worthless? Yeah, would be interested in how the glusterfs professionsals deal with faulty disks, especially when these are as big as our ones. Thx Hubert From cobanserkan at gmail.com Thu Jan 10 07:26:55 2019 From: cobanserkan at gmail.com (=?UTF-8?Q?Serkan_=C3=87oban?=) Date: Thu, 10 Jan 2019 10:26:55 +0300 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? In-Reply-To: References: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> Message-ID: We ara also using 10TB disks, heal takes 7-8 days. You can play with "cluster.shd-max-threads" setting. It is default 1 I think. I am using it with 4. Below you can find more info: https://access.redhat.com/solutions/882233 On Thu, Jan 10, 2019 at 9:53 AM Hu Bert wrote: > > Hi Mike, > > > We have similar setup, and I do not test restoring... > > How many volumes do you have - one volume on one (*3) disk 10 TB in size > > - then 4 volumes? > > Testing could be quite easy: reset-brick start, then delete&re-create > partition/fs/etc., reset-brick commit force - and then watch. > > We only have 1 big volume over all bricks. Details: > > Volume Name: shared > Type: Distributed-Replicate > Number of Bricks: 4 x 3 = 12 > Brick1: gluster11:/gluster/bricksda1/shared > Brick2: gluster12:/gluster/bricksda1/shared > Brick3: gluster13:/gluster/bricksda1/shared > Brick4: gluster11:/gluster/bricksdb1/shared > Brick5: gluster12:/gluster/bricksdb1/shared > Brick6: gluster13:/gluster/bricksdb1/shared > Brick7: gluster11:/gluster/bricksdc1/shared > Brick8: gluster12:/gluster/bricksdc1/shared > Brick9: gluster13:/gluster/bricksdc1/shared > Brick10: gluster11:/gluster/bricksdd1/shared > Brick11: gluster12:/gluster/bricksdd1_new/shared > Brick12: gluster13:/gluster/bricksdd1_new/shared > > Didn't think about creating more volumes (in order to split data), > e.g. 4 volumes with 3*10TB each, or 2 volumes with 6*10TB each. > > Just curious: after splitting into 2 or more volumes - would that make > the volume with the healthy/non-restoring disks better accessable? And > only the volume with the once faulty and now restoring disk would be > in a "bad mood"? > > > > Any opinions on that? Maybe it would be better to use more servers and > > > smaller disks, but this isn't possible at the moment. > > Also interested. We can swap SSDs to HDDs for RAID10, but is it worthless? > > Yeah, would be interested in how the glusterfs professionsals deal > with faulty disks, especially when these are as big as our ones. > > > Thx > Hubert > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From revirii at googlemail.com Thu Jan 10 08:25:32 2019 From: revirii at googlemail.com (Hu Bert) Date: Thu, 10 Jan 2019 09:25:32 +0100 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? In-Reply-To: <258abbde-5a3a-0df2-988a-cb4d1b8b5347@ya.ru> References: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> <258abbde-5a3a-0df2-988a-cb4d1b8b5347@ya.ru> Message-ID: Hi, > > We ara also using 10TB disks, heal takes 7-8 days. > > You can play with "cluster.shd-max-threads" setting. It is default 1 I > > think. I am using it with 4. > > Below you can find more info: > > https://access.redhat.com/solutions/882233 > cluster.shd-max-threads: 8 > cluster.shd-wait-qlength: 10000 Our setup: cluster.shd-max-threads: 2 cluster.shd-wait-qlength: 10000 > >> Volume Name: shared > >> Type: Distributed-Replicate > A, you have distributed-replicated volume, but I choose only replicated > (for beginning simplicity :) > May be replicated volume are healing faster? Well, maybe our setup with 3 servers and 4 disks=bricks == 12 bricks, resulting in a distributed-replicate volume (all /dev/sd{a,b,c,d} identical) , isn't optimal? And it would be better to create a replicate 3 volume with only 1 (big) brick per server (with 4 disks: either a logical volume or sw/hw raid)? But it would be interesting to know if a replicate volume is healing faster than a distributed-replicate volume - even if there was only 1 faulty brick. Thx Hubert From nbalacha at redhat.com Thu Jan 10 08:41:12 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Thu, 10 Jan 2019 14:11:12 +0530 Subject: [Gluster-users] A broken file that can not be deleted In-Reply-To: References: Message-ID: On Wed, 9 Jan 2019 at 19:49, Dmitry Isakbayev wrote: > I am seeing a broken file that exists on 2 out of 3 nodes. The > application trying to use the file throws file permissions error. ls, rm, > mv, touch all throw "Input/output error" > > $ ls -la > ls: cannot access .download_suspensions.memo: Input/output error > drwxrwxr-x. 2 ossadmin ossadmin 4096 Jan 9 08:06 . > drwxrwxr-x. 5 ossadmin ossadmin 4096 Jan 3 11:36 .. > -?????????? ? ? ? ? ? > .download_suspensions.memo > > $ rm ".download_suspensions.memo" > rm: cannot remove ?.download_suspensions.memo?: Input/output error > > Do you see any errors in the mount log? Regards, Nithya > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From combr at ya.ru Thu Jan 10 07:54:54 2019 From: combr at ya.ru (Mike Lykov) Date: Thu, 10 Jan 2019 11:54:54 +0400 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? In-Reply-To: References: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> Message-ID: <258abbde-5a3a-0df2-988a-cb4d1b8b5347@ya.ru> 10.01.2019 11:26, Serkan ?oban ?????: > We ara also using 10TB disks, heal takes 7-8 days. > You can play with "cluster.shd-max-threads" setting. It is default 1 I > think. I am using it with 4. > Below you can find more info: > https://access.redhat.com/solutions/882233 I'm using ovirt, setup script set this values by default: cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 >> Testing could be quite easy: reset-brick start, then delete&re-create >> partition/fs/etc., reset-brick commit force - and then watch. >> >> We only have 1 big volume over all bricks. Details: >> >> Volume Name: shared >> Type: Distributed-Replicate A, you have distributed-replicated volume, but I choose only replicated (for beginning simplicity :) >> Brick12: gluster13:/gluster/bricksdd1_new/shared >> >> Didn't think about creating more volumes (in order to split data), >> e.g. 4 volumes with 3*10TB each, or 2 volumes with 6*10TB each. May be replicated volume are healing faster? >> Yeah, would be interested in how the glusterfs professionsals deal >> with faulty disks, especially when these are as big as our ones. >> From rgowdapp at redhat.com Thu Jan 10 09:00:07 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Thu, 10 Jan 2019 14:30:07 +0530 Subject: [Gluster-users] A broken file that can not be deleted In-Reply-To: References: Message-ID: On Wed, Jan 9, 2019 at 7:48 PM Dmitry Isakbayev wrote: > I am seeing a broken file that exists on 2 out of 3 nodes. > Wondering whether its a case of split brain. > The application trying to use the file throws file permissions error. ls, > rm, mv, touch all throw "Input/output error" > > $ ls -la > ls: cannot access .download_suspensions.memo: Input/output error > drwxrwxr-x. 2 ossadmin ossadmin 4096 Jan 9 08:06 . > drwxrwxr-x. 5 ossadmin ossadmin 4096 Jan 3 11:36 .. > -?????????? ? ? ? ? ? > .download_suspensions.memo > > $ rm ".download_suspensions.memo" > rm: cannot remove ?.download_suspensions.memo?: Input/output error > > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From davide.obbi at booking.com Thu Jan 10 09:02:40 2019 From: davide.obbi at booking.com (Davide Obbi) Date: Thu, 10 Jan 2019 10:02:40 +0100 Subject: [Gluster-users] [External] Re: A broken file that can not be deleted In-Reply-To: References: Message-ID: does selfheal reports anything? did you re-mount on the client side? how the permissions are displayed for the file on the servers? On Thu, Jan 10, 2019 at 10:00 AM Raghavendra Gowdappa wrote: > > > On Wed, Jan 9, 2019 at 7:48 PM Dmitry Isakbayev wrote: > >> I am seeing a broken file that exists on 2 out of 3 nodes. >> > > Wondering whether its a case of split brain. > > >> The application trying to use the file throws file permissions error. >> ls, rm, mv, touch all throw "Input/output error" >> >> $ ls -la >> ls: cannot access .download_suspensions.memo: Input/output error >> drwxrwxr-x. 2 ossadmin ossadmin 4096 Jan 9 08:06 . >> drwxrwxr-x. 5 ossadmin ossadmin 4096 Jan 3 11:36 .. >> -?????????? ? ? ? ? ? >> .download_suspensions.memo >> >> $ rm ".download_suspensions.memo" >> rm: cannot remove ?.download_suspensions.memo?: Input/output error >> >> >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [image: Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Thu Jan 10 10:47:19 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Thu, 10 Jan 2019 16:17:19 +0530 Subject: [Gluster-users] Input/output error on FUSE log In-Reply-To: <4ae6494cbe324596831833b8858c8228@nsgdv.com> References: <07ad5733d55f4f0ba019dd0fba606b3e@nsgdv.com> <4ae6494cbe324596831833b8858c8228@nsgdv.com> Message-ID: I don't see write failures in the log but I do see fallocate failing with EIO. [2019-01-07 19:16:44.846187] W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] 0-gv1-dht: no subvolume for hash (value) = 1285124113 [2019-01-07 19:16:44.846194] D [MSGID: 0] [dht-helper.c:969:dht_subvol_get_hashed] 0-gv1-dht: No hashed subvolume for path=/.shard/aa3ef10e-95e0-40d3-9464-133d72fa8a95.185 [2019-01-07 19:16:44.846200] D [MSGID: 0] [dht-common.c:7631:dht_mknod] 0-gv1-dht: no subvolume in layout for path=/.shard/aa3ef10e-95e0-40d3-9464-133d72fa8a95.185 <--- *** *DHT failed to find a hashed subvol* *** [2019-01-07 19:16:44.846207] D [MSGID: 0] [dht-common.c:7712:dht_mknod] 0-stack-trace: stack-address: 0x7f6748006778, gv1-dht returned -1 error: Input/output error [Input/output error] [2019-01-07 19:16:44.846215] D [MSGID: 0] [shard.c:3645:shard_common_mknod_cbk] 0-gv1-shard: mknod of shard 185 failed: Input/output error [2019-01-07 19:16:44.846223] D [MSGID: 0] [shard.c:720:shard_common_inode_write_failure_unwind] 0-stack-trace: stack-address: 0x7f6748006778, gv1-shard returned -1 error: Input/output error [Input/output error] [2019-01-07 19:16:44.846234] D [MSGID: 0] [defaults.c:1352:default_fallocate_cbk] 0-stack-trace: stack-address: 0x7f6748006778, gv1-quick-read returned -1 error: Input/output error [Input/output error] [2019-01-07 19:16:44.846244] D [MSGID: 0] [defaults.c:1352:default_fallocate_cbk] 0-stack-trace: stack-address: 0x7f6748006778, gv1-open-behind returned -1 error: Input/output error [Input/output error] [2019-01-07 19:16:44.846254] D [MSGID: 0] [md-cache.c:2715:mdc_fallocate_cbk] 0-stack-trace: stack-address: 0x7f6748006778, gv1-md-cache returned -1 error: Input/output error [Input/output error] [2019-01-07 19:16:44.846264] D [MSGID: 0] [defaults.c:1352:default_fallocate_cbk] 0-stack-trace: stack-address: 0x7f6748006778, gv1-io-threads returned -1 error: Input/output error [Input/output error] [2019-01-07 19:16:44.846274] D [MSGID: 0] [io-stats.c:2528:io_stats_fallocate_cbk] 0-stack-trace: stack-address: 0x7f6748006778, gv1 returned -1 error: Input/output error [Input/output error] [2019-01-07 19:16:44.846284] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 1373: FALLOCATE() ERR => -1 (Input/output error) [2019-01-07 19:16:44.846298] T [fuse-bridge.c:278:send_fuse_iov] 0-glusterfs-fuse: writev() result 16/16 Please get the xattrs on the .shard directory on each brick of the volume so we can check if the layout is complete: getfattr -e hex -m . -d /.shard Thanks, Nithya On Thu, 10 Jan 2019 at 02:25, Matt Waymack wrote: > Has anyone any other ideas where to look? This is only affecting FUSE > clients. SMB clients are unaffected by this problem. > > > > Thanks! > > > > *From:* gluster-users-bounces at gluster.org < > gluster-users-bounces at gluster.org> *On Behalf Of *Matt Waymack > *Sent:* Monday, January 7, 2019 1:19 PM > *To:* Raghavendra Gowdappa > *Cc:* gluster-users at gluster.org List > *Subject:* Re: [Gluster-users] Input/output error on FUSE log > > > > Attached are the logs from when a failure occurred with diagnostics set to > trace. > > > > Thank you! > > > > *From:* Raghavendra Gowdappa > *Sent:* Saturday, January 5, 2019 8:32 PM > *To:* Matt Waymack > *Cc:* gluster-users at gluster.org List > *Subject:* Re: [Gluster-users] Input/output error on FUSE log > > > > > > > > On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa > wrote: > > > > > > On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack wrote: > > Hi all, > > > > I'm having a problem writing to our volume. When writing files larger > than about 2GB, I get an intermittent issue where the write will fail and > return Input/Output error. This is also shown in the FUSE log of the > client (this is affecting all clients). A snip of a client log is below: > > [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51040978: WRITE => -1 > gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041266: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output > error) > > [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041548: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times > between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] > > The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] > 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 > 22:39:33.925981] and [2019-01-05 22:39:50.451862] > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times > between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] > > > > This looks to be a DHT issue. Some questions: > > * Are all subvolumes of DHT up and client is connected to them? > Particularly the subvolume which contains the file in question. > > * Can you get all extended attributes of parent directory of the file from > all bricks? > > * set diagnostics.client-log-level to TRACE, capture these errors again > and attach the client log file. > > > > I spoke a bit early. dht_writev doesn't search hashed subvolume as its > already been looked up in lookup. So, these msgs looks to be of a different > issue - not writev failure. > > > > > > This is intermittent for most files, but eventually if a file is large > enough it will not write. The workflow is SFTP tot he client which then > writes to the volume over FUSE. When files get to a certain point,w e can > no longer write to them. The file sizes are different as well, so it's not > like they all get to the same size and just stop either. I've ruled out a > free space issue, our files at their largest are only a few hundred GB and > we have tens of terrabytes free on each brick. We are also sharding at 1GB. > > > > I'm not sure where to go from here as the error seems vague and I can only > see it on the client log. I'm not seeing these errors on the nodes > themselves. This is also seen if I mount the volume via FUSE on any of the > nodes as well and it is only reflected in the FUSE log. > > > > Here is the volume info: > > Volume Name: gv1 > > Type: Distributed-Replicate > > Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 8 x (2 + 1) = 24 > > Transport-type: tcp > > Bricks: > > Brick1: tpc-glus4:/exp/b1/gv1 > > Brick2: tpc-glus2:/exp/b1/gv1 > > Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) > > Brick4: tpc-glus2:/exp/b2/gv1 > > Brick5: tpc-glus4:/exp/b2/gv1 > > Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) > > Brick7: tpc-glus4:/exp/b3/gv1 > > Brick8: tpc-glus2:/exp/b3/gv1 > > Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) > > Brick10: tpc-glus4:/exp/b4/gv1 > > Brick11: tpc-glus2:/exp/b4/gv1 > > Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) > > Brick13: tpc-glus1:/exp/b5/gv1 > > Brick14: tpc-glus3:/exp/b5/gv1 > > Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) > > Brick16: tpc-glus1:/exp/b6/gv1 > > Brick17: tpc-glus3:/exp/b6/gv1 > > Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) > > Brick19: tpc-glus1:/exp/b7/gv1 > > Brick20: tpc-glus3:/exp/b7/gv1 > > Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) > > Brick22: tpc-glus1:/exp/b8/gv1 > > Brick23: tpc-glus3:/exp/b8/gv1 > > Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) > > Options Reconfigured: > > performance.cache-samba-metadata: on > > performance.cache-invalidation: off > > features.shard-block-size: 1000MB > > features.shard: on > > transport.address-family: inet > > nfs.disable: on > > cluster.lookup-optimize: on > > > > I'm a bit stumped on this, any help is appreciated. Thank you! > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From isakdim at gmail.com Thu Jan 10 12:32:23 2019 From: isakdim at gmail.com (Dmitry Isakbayev) Date: Thu, 10 Jan 2019 07:32:23 -0500 Subject: [Gluster-users] [External] Re: A broken file that can not be deleted In-Reply-To: References: Message-ID: Nithyya, Raghavendra, Davide Thank you for your help. *>how the permissions are displayed for the file on the servers?* Trying to answer this question is what fixed the problem. It looked just fine on all 3 servers. And it looks like running "ls" on the servers fixed it on the clients. I had to repeat it on all 3 servers. It fixed file permissions on 2 clients and made the file show up on the 3rd client. Even though it fixed the file permissions and I could now view contents of the file, the software was still having issues with renaming ".download_suspensions.memo.writing" to ".download_suspensions.memo" When I tried to replace the file manually, I got $ mv .download_suspensions.memo.writing .download_suspensions.memo mv: ?.download_suspensions.memo.writing? and ?.download_suspensions.memo? are the same file I ended up removing both files and having the software rebuild them. > *Wondering whether its a case of split brain.*Very possible. All 3 servers were rebooted. It brought down linux cluster running on the same 3 servers as well. > On Thu, Jan 10, 2019 at 10:00 AM Raghavendra Gowdappa > wrote: > >> >> >> On Wed, Jan 9, 2019 at 7:48 PM Dmitry Isakbayev >> wrote: >> >>> I am seeing a broken file that exists on 2 out of 3 nodes. >>> >> >> Wondering whether its a case of split brain. >> >> >>> The application trying to use the file throws file permissions error. >>> ls, rm, mv, touch all throw "Input/output error" >>> >>> $ ls -la >>> ls: cannot access .download_suspensions.memo: Input/output error >>> drwxrwxr-x. 2 ossadmin ossadmin 4096 Jan 9 08:06 . >>> drwxrwxr-x. 5 ossadmin ossadmin 4096 Jan 3 11:36 .. >>> -?????????? ? ? ? ? ? >>> .download_suspensions.memo >>> >>> $ rm ".download_suspensions.memo" >>> rm: cannot remove ?.download_suspensions.memo?: Input/output error >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Davide Obbi > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > Direct +31207031558 > [image: Booking.com] > Empowering people to experience the world since 1996 > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Fri Jan 11 03:29:06 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 11 Jan 2019 08:59:06 +0530 Subject: [Gluster-users] GCS 0.5 release Message-ID: Today, we are announcing the availability of GCS (Gluster Container Storage) 0.5. Highlights and updates since v0.4: - GCS environment updated to kube 1.13 - CSI deployment moved to 1.0 - Integrated Anthill deployment - Kube & etcd metrics added to prometheus - Tuning of etcd to increase stability - GD2 bug fixes from scale testing effort. Included components: - Glusterd2: https://github.com/gluster/glusterd2 - Gluster CSI driver: https://github.com/gluster/gluster-csi-driver - Gluster-prometheus: https://github.com/gluster/gluster-prometheus - Anthill - https://github.com/gluster/anthill/ - Gluster-Mixins - https://github.com/gluster/gluster-mixins/ For more details on the specific content of this release please refer [3]. If you are interested in contributing, please see [4] or contact the gluster-devel mailing list. We?re always interested in any bugs that you find, pull requests for new features and your feedback. Regards, Team GCS [1] https://github.com/gluster/gcs/releases [2] https://github.com/gluster/gcs/tree/master/deploy [3] https://waffle.io/gluster/gcs?label=GCS%2F0.5 - search for ?Done? lane [4] https://github.com/gluster/gcs -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Fri Jan 11 06:32:46 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Fri, 11 Jan 2019 12:02:46 +0530 Subject: [Gluster-users] Increasing Bitrot speed glusterfs 4.1.6 Message-ID: Hi, How do I increase the speed of bitrot file signature process in glusterfs 4.1.6? Currently, it's processing 250 KB/s. is there any way to do the changes thru gluster cli? regards Amudhan -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.buitelaar at gmail.com Sun Jan 13 16:40:59 2019 From: olaf.buitelaar at gmail.com (Olaf Buitelaar) Date: Sun, 13 Jan 2019 17:40:59 +0100 Subject: [Gluster-users] [Stale file handle] in shard volume In-Reply-To: References: Message-ID: @Krutika if you need any further information, please let me know. Thanks Olaf Op vr 4 jan. 2019 om 07:51 schreef Nithya Balachandran : > Adding Krutika. > > On Wed, 2 Jan 2019 at 20:56, Olaf Buitelaar > wrote: > >> Hi Nithya, >> >> Thank you for your reply. >> >> the VM's using the gluster volumes keeps on getting paused/stopped on >> errors like these; >> [2019-01-02 02:33:44.469132] E [MSGID: 133010] >> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on >> shard 101487 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c >> [Stale file handle] >> [2019-01-02 02:33:44.563288] E [MSGID: 133010] >> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on >> shard 101488 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c >> [Stale file handle] >> >> Krutika, Can you take a look at this? > > >> >> What i'm trying to find out, if i can purge all gluster volumes from all >> possible stale file handles (and hopefully find a method to prevent this in >> the future), so the VM's can start running stable again. >> For this i need to know when the "shard_common_lookup_shards_cbk" >> function considers a file as stale. >> The statement; "Stale file handle errors show up when a file with a >> specified gfid is not found." doesn't seem to cover it all, as i've shown >> in earlier mails the shard file and glusterfs/xx/xx/uuid file do both >> exist, and have the same inode. >> If the criteria i'm using aren't correct, could you please tell me which >> criteria i should use to determine if a file is stale or not? >> these criteria are just based observations i made, moving the stale files >> manually. After removing them i was able to start the VM again..until some >> time later it hangs on another stale shard file unfortunate. >> >> Thanks Olaf >> >> Op wo 2 jan. 2019 om 14:20 schreef Nithya Balachandran < >> nbalacha at redhat.com>: >> >>> >>> >>> On Mon, 31 Dec 2018 at 01:27, Olaf Buitelaar >>> wrote: >>> >>>> Dear All, >>>> >>>> till now a selected group of VM's still seem to produce new stale >>>> file's and getting paused due to this. >>>> I've not updated gluster recently, however i did change the op version >>>> from 31200 to 31202 about a week before this issue arose. >>>> Looking at the .shard directory, i've 100.000+ files sharing the same >>>> characteristics as a stale file. which are found till now, >>>> they all have the sticky bit set, e.g. file permissions; ---------T. >>>> are 0kb in size, and have the trusted.glusterfs.dht.linkto attribute. >>>> >>> >>> These are internal files used by gluster and do not necessarily mean >>> they are stale. They "point" to data files which may be on different bricks >>> (same name, gfid etc but no linkto xattr and no ----T permissions). >>> >>> >>>> These files range from long a go (beginning of the year) till now. >>>> Which makes me suspect this was laying dormant for some time now..and >>>> somehow recently surfaced. >>>> Checking other sub-volumes they contain also 0kb files in the .shard >>>> directory, but don't have the sticky bit and the linkto attribute. >>>> >>>> Does anybody else experience this issue? Could this be a bug or an >>>> environmental issue? >>>> >>> These are most likely valid files- please do not delete them without >>> double-checking. >>> >>> Stale file handle errors show up when a file with a specified gfid is >>> not found. You will need to debug the files for which you see this error by >>> checking the bricks to see if they actually exist. >>> >>>> >>>> Also i wonder if there is any tool or gluster command to clean all >>>> stale file handles? >>>> Otherwise i'm planning to make a simple bash script, which iterates >>>> over the .shard dir, checks each file for the above mentioned criteria, and >>>> (re)moves the file and the corresponding .glusterfs file. >>>> If there are other criteria needed to identify a stale file handle, i >>>> would like to hear that. >>>> If this is a viable and safe operation to do of course. >>>> >>>> Thanks Olaf >>>> >>>> >>>> >>>> Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar < >>>> olaf.buitelaar at gmail.com>: >>>> >>>>> Dear All, >>>>> >>>>> I figured it out, it appeared to be the exact same issue as described >>>>> here; >>>>> https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html >>>>> Another subvolume also had the shard file, only were all 0 bytes and >>>>> had the dht.linkto >>>>> >>>>> for reference; >>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>> >>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>>>> >>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>> >>>>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>>>> >>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>> # file: .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>> >>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>>>> >>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>> >>>>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>>>> >>>>> [root at lease-04 ovirt-backbone-2]# stat >>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>> File: ?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d? >>>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>>> empty file >>>>> Device: fd01h/64769d Inode: 1918631406 Links: 2 >>>>> Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ >>>>> root) >>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>> Access: 2018-12-17 21:43:36.405735296 +0000 >>>>> Modify: 2018-12-17 21:43:36.405735296 +0000 >>>>> Change: 2018-12-17 21:43:36.405735296 +0000 >>>>> Birth: - >>>>> >>>>> removing the shard file and glusterfs file from each node resolved the >>>>> issue. >>>>> >>>>> I also found this thread; >>>>> https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html >>>>> Maybe he suffers from the same issue. >>>>> >>>>> Best Olaf >>>>> >>>>> >>>>> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar < >>>>> olaf.buitelaar at gmail.com>: >>>>> >>>>>> Dear All, >>>>>> >>>>>> It appears i've a stale file in one of the volumes, on 2 files. These >>>>>> files are qemu images (1 raw and 1 qcow2). >>>>>> I'll just focus on 1 file since the situation on the other seems the >>>>>> same. >>>>>> >>>>>> The VM get's paused more or less directly after being booted with >>>>>> error; >>>>>> [2018-12-18 14:05:05.275713] E [MSGID: 133010] >>>>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-backbone-2-shard: >>>>>> Lookup on shard 51500 failed. Base file gfid = >>>>>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file handle] >>>>>> >>>>>> investigating the shard; >>>>>> >>>>>> #on the arbiter node: >>>>>> >>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>> getfattr: Removing leading '/' from absolute path names >>>>>> # file: >>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>> >>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> >>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>> >>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>> >>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> >>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>> >>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>> >>>>>> [root at lease-05 ovirt-backbone-2]# stat >>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>>>> empty file >>>>>> Device: fd01h/64769d Inode: 537277306 Links: 2 >>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>> root) >>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>> Access: 2018-12-17 21:43:36.361984810 +0000 >>>>>> Modify: 2018-12-17 21:43:36.361984810 +0000 >>>>>> Change: 2018-12-18 20:55:29.908647417 +0000 >>>>>> Birth: - >>>>>> >>>>>> [root at lease-05 ovirt-backbone-2]# find . -inum 537277306 >>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> >>>>>> #on the data nodes: >>>>>> >>>>>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string >>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>> getfattr: Removing leading '/' from absolute path names >>>>>> # file: >>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>> >>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> >>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>> >>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>> >>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> >>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>> >>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>> >>>>>> [root at lease-08 ovirt-backbone-2]# stat >>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular >>>>>> file >>>>>> Device: fd03h/64771d Inode: 12893624759 Links: 3 >>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>> root) >>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>> Access: 2018-12-18 18:52:38.070776585 +0000 >>>>>> Modify: 2018-12-17 21:43:36.388054443 +0000 >>>>>> Change: 2018-12-18 21:01:47.810506528 +0000 >>>>>> Birth: - >>>>>> >>>>>> [root at lease-08 ovirt-backbone-2]# find . -inum 12893624759 >>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> >>>>>> ======================== >>>>>> >>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>> getfattr: Removing leading '/' from absolute path names >>>>>> # file: >>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>> >>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> >>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>> >>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>> >>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> >>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>> >>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>> >>>>>> [root at lease-11 ovirt-backbone-2]# stat >>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular >>>>>> file >>>>>> Device: fd03h/64771d Inode: 12956094809 Links: 3 >>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>> root) >>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>> Access: 2018-12-18 20:11:53.595208449 +0000 >>>>>> Modify: 2018-12-17 21:43:36.391580259 +0000 >>>>>> Change: 2018-12-18 19:19:25.888055392 +0000 >>>>>> Birth: - >>>>>> >>>>>> [root at lease-11 ovirt-backbone-2]# find . -inum 12956094809 >>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> >>>>>> ================ >>>>>> >>>>>> I don't really see any inconsistencies, except the dates on the stat. >>>>>> However this is only after i tried moving the file out of the volumes to >>>>>> force a heal, which does happen on the data nodes, but not on the arbiter >>>>>> node. Before that they were also the same. >>>>>> I've also compared the file >>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on the 2 nodes and they >>>>>> are exactly the same. >>>>>> >>>>>> Things i've further tried; >>>>>> - gluster v heal ovirt-backbone-2 full => gluster v heal >>>>>> ovirt-backbone-2 info reports 0 entries on all nodes >>>>>> >>>>>> - stop each glusterd and glusterfsd, pause around 40sec and start >>>>>> them again on each node, 1 at a time, waiting for the heal to recover >>>>>> before moving to the next node >>>>>> >>>>>> - force a heal by stopping glusterd on a node and perform these steps; >>>>>> mkdir /mnt/ovirt-backbone-2/trigger >>>>>> rmdir /mnt/ovirt-backbone-2/trigger >>>>>> setfattr -n trusted.non-existent-key -v abc /mnt/ovirt-backbone-2/ >>>>>> setfattr -x trusted.non-existent-key /mnt/ovirt-backbone-2/ >>>>>> start glusterd >>>>>> >>>>>> - gluster volume rebalance ovirt-backbone-2 start => success >>>>>> >>>>>> Whats further interesting is that according the mount log, the volume >>>>>> is in split-brain; >>>>>> [2018-12-18 10:06:04.606870] E [MSGID: 108008] >>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>> error] >>>>>> [2018-12-18 10:06:04.606908] E [MSGID: 133014] >>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>> [2018-12-18 10:06:04.606927] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>> 0-glusterfs-fuse: 428090: FSTAT() >>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>> [2018-12-18 10:06:05.107729] E [MSGID: 108008] >>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>> error] >>>>>> [2018-12-18 10:06:05.107770] E [MSGID: 133014] >>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>> [2018-12-18 10:06:05.107791] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>> 0-glusterfs-fuse: 428091: FSTAT() >>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>> [2018-12-18 10:06:05.537244] I [MSGID: 108006] >>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>> subvolumes up >>>>>> [2018-12-18 10:06:05.538523] E [MSGID: 108008] >>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid >>>>>> 00000000-0000-0000-0000-000000000001: split-brain observed. [Input/output >>>>>> error] >>>>>> [2018-12-18 10:06:05.538685] I [MSGID: 108006] >>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>> subvolumes up >>>>>> [2018-12-18 10:06:05.538794] I [MSGID: 108006] >>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>> subvolumes up >>>>>> [2018-12-18 10:06:05.539342] I [MSGID: 109063] >>>>>> [dht-layout.c:716:dht_layout_normalize] 0-ovirt-backbone-2-dht: Found >>>>>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732 (gfid = >>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2 overlaps=0 >>>>>> [2018-12-18 10:06:05.539372] W [MSGID: 109005] >>>>>> [dht-selfheal.c:2158:dht_selfheal_directory] 0-ovirt-backbone-2-dht: >>>>>> Directory selfheal failed: 2 subvolumes down.Not fixing. path = >>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732, gfid = >>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8 >>>>>> [2018-12-18 10:06:05.539694] I [MSGID: 108006] >>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>> subvolumes up >>>>>> [2018-12-18 10:06:05.540652] I [MSGID: 108006] >>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>> subvolumes up >>>>>> [2018-12-18 10:06:05.608612] E [MSGID: 108008] >>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>> error] >>>>>> [2018-12-18 10:06:05.608657] E [MSGID: 133014] >>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>> [2018-12-18 10:06:05.608672] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>> 0-glusterfs-fuse: 428096: FSTAT() >>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>> [2018-12-18 10:06:06.109339] E [MSGID: 108008] >>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>> error] >>>>>> [2018-12-18 10:06:06.109378] E [MSGID: 133014] >>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>> [2018-12-18 10:06:06.109399] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>> 0-glusterfs-fuse: 428097: FSTAT() >>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>> >>>>>> #note i'm able to see ; >>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>>>> [root at lease-11 ovirt-backbone-2]# stat >>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>>>> File: >>>>>> ?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids? >>>>>> Size: 1048576 Blocks: 2048 IO Block: 131072 regular >>>>>> file >>>>>> Device: 41h/65d Inode: 10492258721813610344 Links: 1 >>>>>> Access: (0660/-rw-rw----) Uid: ( 36/ vdsm) Gid: ( 36/ >>>>>> kvm) >>>>>> Context: system_u:object_r:fusefs_t:s0 >>>>>> Access: 2018-12-19 20:07:39.917573869 +0000 >>>>>> Modify: 2018-12-19 20:07:39.928573917 +0000 >>>>>> Change: 2018-12-19 20:07:39.929573921 +0000 >>>>>> Birth: - >>>>>> >>>>>> however checking: gluster v heal ovirt-backbone-2 info split-brain >>>>>> reports no entries. >>>>>> >>>>>> I've also tried mounting the qemu image, and this works fine, i'm >>>>>> able to see all contents; >>>>>> losetup /dev/loop0 >>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>> kpartx -a /dev/loop0 >>>>>> vgscan >>>>>> vgchange -ay slave-data >>>>>> mkdir /mnt/slv01 >>>>>> mount /dev/mapper/slave--data-lvol0 /mnt/slv01/ >>>>>> >>>>>> Possible causes for this issue; >>>>>> 1. the machine "lease-11" suffered from a faulty RAM module (ECC), >>>>>> which halted the machine and causes an invalid state. (this machine also >>>>>> hosts other volumes, with similar configurations, which report no issue) >>>>>> 2. after the RAM module was replaced, the VM using the backing qemu >>>>>> image, was restored from a backup (the backup was file based within the VM >>>>>> on a different directory). This is because some files were corrupted. The >>>>>> backup/recovery obviously causes extra IO, possible introducing race >>>>>> conditions? The machine did run for about 12h without issues, and in total >>>>>> for about 36h. >>>>>> 3. since only the client (maybe only gfapi?) reports errors, >>>>>> something is broken there? >>>>>> >>>>>> The volume info; >>>>>> root at lease-06 ~# gluster v info ovirt-backbone-2 >>>>>> >>>>>> Volume Name: ovirt-backbone-2 >>>>>> Type: Distributed-Replicate >>>>>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28 >>>>>> Status: Started >>>>>> Snapshot Count: 0 >>>>>> Number of Bricks: 3 x (2 + 1) = 9 >>>>>> Transport-type: tcp >>>>>> Bricks: >>>>>> Brick1: 10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>>>> Brick2: 10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>>>> Brick3: 10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>>> Brick4: 10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>> Brick5: 10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>> Brick6: 10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>>> Brick7: 10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>> Brick8: 10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>> Brick9: 10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>>> Options Reconfigured: >>>>>> nfs.disable: on >>>>>> transport.address-family: inet >>>>>> performance.quick-read: off >>>>>> performance.read-ahead: off >>>>>> performance.io-cache: off >>>>>> performance.low-prio-threads: 32 >>>>>> network.remote-dio: enable >>>>>> cluster.eager-lock: enable >>>>>> cluster.quorum-type: auto >>>>>> cluster.server-quorum-type: server >>>>>> cluster.data-self-heal-algorithm: full >>>>>> cluster.locking-scheme: granular >>>>>> cluster.shd-max-threads: 8 >>>>>> cluster.shd-wait-qlength: 10000 >>>>>> features.shard: on >>>>>> user.cifs: off >>>>>> storage.owner-uid: 36 >>>>>> storage.owner-gid: 36 >>>>>> features.shard-block-size: 64MB >>>>>> performance.write-behind-window-size: 512MB >>>>>> performance.cache-size: 384MB >>>>>> cluster.brick-multiplex: on >>>>>> >>>>>> The volume status; >>>>>> root at lease-06 ~# gluster v status ovirt-backbone-2 >>>>>> Status of volume: ovirt-backbone-2 >>>>>> Gluster process TCP Port RDMA Port >>>>>> Online Pid >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi >>>>>> rt-backbone-2 49152 0 >>>>>> Y 7727 >>>>>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi >>>>>> rt-backbone-2 49152 0 >>>>>> Y 12620 >>>>>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi >>>>>> rt-backbone-2 49152 0 >>>>>> Y 8794 >>>>>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov >>>>>> irt-backbone-2 49161 0 >>>>>> Y 22333 >>>>>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o >>>>>> virt-backbone-2 49152 0 >>>>>> Y 15030 >>>>>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi >>>>>> rt-backbone-2 49166 0 >>>>>> Y 24592 >>>>>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov >>>>>> irt-backbone-2 49153 0 >>>>>> Y 20148 >>>>>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o >>>>>> virt-backbone-2 49154 0 >>>>>> Y 15413 >>>>>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi >>>>>> rt-backbone-2 49152 0 >>>>>> Y 43120 >>>>>> Self-heal Daemon on localhost N/A N/A >>>>>> Y 44587 >>>>>> Self-heal Daemon on 10.201.0.2 N/A N/A >>>>>> Y 8401 >>>>>> Self-heal Daemon on 10.201.0.5 N/A N/A >>>>>> Y 11038 >>>>>> Self-heal Daemon on 10.201.0.8 N/A N/A >>>>>> Y 9513 >>>>>> Self-heal Daemon on 10.32.9.4 N/A N/A >>>>>> Y 23736 >>>>>> Self-heal Daemon on 10.32.9.20 N/A N/A >>>>>> Y 2738 >>>>>> Self-heal Daemon on 10.32.9.3 N/A N/A >>>>>> Y 25598 >>>>>> Self-heal Daemon on 10.32.9.5 N/A N/A >>>>>> Y 511 >>>>>> Self-heal Daemon on 10.32.9.9 N/A N/A >>>>>> Y 23357 >>>>>> Self-heal Daemon on 10.32.9.8 N/A N/A >>>>>> Y 15225 >>>>>> Self-heal Daemon on 10.32.9.7 N/A N/A >>>>>> Y 25781 >>>>>> Self-heal Daemon on 10.32.9.21 N/A N/A >>>>>> Y 5034 >>>>>> >>>>>> Task Status of Volume ovirt-backbone-2 >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Task : Rebalance >>>>>> ID : 6dfbac43-0125-4568-9ac3-a2c453faaa3d >>>>>> Status : completed >>>>>> >>>>>> gluster version is @3.12.15 and cluster.op-version=31202 >>>>>> >>>>>> ======================== >>>>>> >>>>>> It would be nice to know if it's possible to mark the files as not >>>>>> stale or if i should investigate other things? >>>>>> Or should we consider this volume lost? >>>>>> Also checking the code at; >>>>>> https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c >>>>>> it seems the functions shifted quite some (line 1724 vs. 2243), so maybe >>>>>> it's fixed in a future version? >>>>>> Any thoughts are welcome. >>>>>> >>>>>> Thanks Olaf >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From kdhananj at redhat.com Mon Jan 14 07:15:52 2019 From: kdhananj at redhat.com (Krutika Dhananjay) Date: Mon, 14 Jan 2019 12:45:52 +0530 Subject: [Gluster-users] [Stale file handle] in shard volume In-Reply-To: References: Message-ID: Hi, So the main issue is that certain vms seem to be pausing? Did I understand that right? Could you share the gluster-mount logs around the time the pause was seen? And the brick logs too please? As for ESTALE errors, the real cause of pauses can be determined from errors/warnings logged by fuse. Mere occurrence of ESTALE errors against shard function in logs doesn't necessarily indicate that is the reason for the pause. Also, in this instance, the ESTALE errors it seems are propagated by the lower translators (DHT? protocol/client? Or even bricks?) and shard is merely logging the same. -Krutika On Sun, Jan 13, 2019 at 10:11 PM Olaf Buitelaar wrote: > @Krutika if you need any further information, please let me know. > > Thanks Olaf > > Op vr 4 jan. 2019 om 07:51 schreef Nithya Balachandran < > nbalacha at redhat.com>: > >> Adding Krutika. >> >> On Wed, 2 Jan 2019 at 20:56, Olaf Buitelaar >> wrote: >> >>> Hi Nithya, >>> >>> Thank you for your reply. >>> >>> the VM's using the gluster volumes keeps on getting paused/stopped on >>> errors like these; >>> [2019-01-02 02:33:44.469132] E [MSGID: 133010] >>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on >>> shard 101487 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c >>> [Stale file handle] >>> [2019-01-02 02:33:44.563288] E [MSGID: 133010] >>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on >>> shard 101488 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c >>> [Stale file handle] >>> >>> Krutika, Can you take a look at this? >> >> >>> >>> What i'm trying to find out, if i can purge all gluster volumes from all >>> possible stale file handles (and hopefully find a method to prevent this in >>> the future), so the VM's can start running stable again. >>> For this i need to know when the "shard_common_lookup_shards_cbk" >>> function considers a file as stale. >>> The statement; "Stale file handle errors show up when a file with a >>> specified gfid is not found." doesn't seem to cover it all, as i've shown >>> in earlier mails the shard file and glusterfs/xx/xx/uuid file do both >>> exist, and have the same inode. >>> If the criteria i'm using aren't correct, could you please tell me which >>> criteria i should use to determine if a file is stale or not? >>> these criteria are just based observations i made, moving the stale >>> files manually. After removing them i was able to start the VM again..until >>> some time later it hangs on another stale shard file unfortunate. >>> >>> Thanks Olaf >>> >>> Op wo 2 jan. 2019 om 14:20 schreef Nithya Balachandran < >>> nbalacha at redhat.com>: >>> >>>> >>>> >>>> On Mon, 31 Dec 2018 at 01:27, Olaf Buitelaar >>>> wrote: >>>> >>>>> Dear All, >>>>> >>>>> till now a selected group of VM's still seem to produce new stale >>>>> file's and getting paused due to this. >>>>> I've not updated gluster recently, however i did change the op version >>>>> from 31200 to 31202 about a week before this issue arose. >>>>> Looking at the .shard directory, i've 100.000+ files sharing the same >>>>> characteristics as a stale file. which are found till now, >>>>> they all have the sticky bit set, e.g. file permissions; ---------T. >>>>> are 0kb in size, and have the trusted.glusterfs.dht.linkto attribute. >>>>> >>>> >>>> These are internal files used by gluster and do not necessarily mean >>>> they are stale. They "point" to data files which may be on different bricks >>>> (same name, gfid etc but no linkto xattr and no ----T permissions). >>>> >>>> >>>>> These files range from long a go (beginning of the year) till now. >>>>> Which makes me suspect this was laying dormant for some time now..and >>>>> somehow recently surfaced. >>>>> Checking other sub-volumes they contain also 0kb files in the .shard >>>>> directory, but don't have the sticky bit and the linkto attribute. >>>>> >>>>> Does anybody else experience this issue? Could this be a bug or an >>>>> environmental issue? >>>>> >>>> These are most likely valid files- please do not delete them without >>>> double-checking. >>>> >>>> Stale file handle errors show up when a file with a specified gfid is >>>> not found. You will need to debug the files for which you see this error by >>>> checking the bricks to see if they actually exist. >>>> >>>>> >>>>> Also i wonder if there is any tool or gluster command to clean all >>>>> stale file handles? >>>>> Otherwise i'm planning to make a simple bash script, which iterates >>>>> over the .shard dir, checks each file for the above mentioned criteria, and >>>>> (re)moves the file and the corresponding .glusterfs file. >>>>> If there are other criteria needed to identify a stale file handle, i >>>>> would like to hear that. >>>>> If this is a viable and safe operation to do of course. >>>>> >>>>> Thanks Olaf >>>>> >>>>> >>>>> >>>>> Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar < >>>>> olaf.buitelaar at gmail.com>: >>>>> >>>>>> Dear All, >>>>>> >>>>>> I figured it out, it appeared to be the exact same issue as described >>>>>> here; >>>>>> https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html >>>>>> Another subvolume also had the shard file, only were all 0 bytes and >>>>>> had the dht.linkto >>>>>> >>>>>> for reference; >>>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>> >>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>>>>> >>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>> >>>>>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>>>>> >>>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>>> # file: .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>>> >>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>>>>> >>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>> >>>>>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>>>>> >>>>>> [root at lease-04 ovirt-backbone-2]# stat >>>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>>> File: ?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d? >>>>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>>>> empty file >>>>>> Device: fd01h/64769d Inode: 1918631406 Links: 2 >>>>>> Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ >>>>>> root) >>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>> Access: 2018-12-17 21:43:36.405735296 +0000 >>>>>> Modify: 2018-12-17 21:43:36.405735296 +0000 >>>>>> Change: 2018-12-17 21:43:36.405735296 +0000 >>>>>> Birth: - >>>>>> >>>>>> removing the shard file and glusterfs file from each node resolved >>>>>> the issue. >>>>>> >>>>>> I also found this thread; >>>>>> https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html >>>>>> Maybe he suffers from the same issue. >>>>>> >>>>>> Best Olaf >>>>>> >>>>>> >>>>>> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar < >>>>>> olaf.buitelaar at gmail.com>: >>>>>> >>>>>>> Dear All, >>>>>>> >>>>>>> It appears i've a stale file in one of the volumes, on 2 files. >>>>>>> These files are qemu images (1 raw and 1 qcow2). >>>>>>> I'll just focus on 1 file since the situation on the other seems the >>>>>>> same. >>>>>>> >>>>>>> The VM get's paused more or less directly after being booted with >>>>>>> error; >>>>>>> [2018-12-18 14:05:05.275713] E [MSGID: 133010] >>>>>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-backbone-2-shard: >>>>>>> Lookup on shard 51500 failed. Base file gfid = >>>>>>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file handle] >>>>>>> >>>>>>> investigating the shard; >>>>>>> >>>>>>> #on the arbiter node: >>>>>>> >>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>> getfattr: Removing leading '/' from absolute path names >>>>>>> # file: >>>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>>> >>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> >>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>> >>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>> >>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> >>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>> >>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>> >>>>>>> [root at lease-05 ovirt-backbone-2]# stat >>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>>>>> empty file >>>>>>> Device: fd01h/64769d Inode: 537277306 Links: 2 >>>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>>> root) >>>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>>> Access: 2018-12-17 21:43:36.361984810 +0000 >>>>>>> Modify: 2018-12-17 21:43:36.361984810 +0000 >>>>>>> Change: 2018-12-18 20:55:29.908647417 +0000 >>>>>>> Birth: - >>>>>>> >>>>>>> [root at lease-05 ovirt-backbone-2]# find . -inum 537277306 >>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> >>>>>>> #on the data nodes: >>>>>>> >>>>>>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string >>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>> getfattr: Removing leading '/' from absolute path names >>>>>>> # file: >>>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>>> >>>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> >>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>> >>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>> >>>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> >>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>> >>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>> >>>>>>> [root at lease-08 ovirt-backbone-2]# stat >>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular >>>>>>> file >>>>>>> Device: fd03h/64771d Inode: 12893624759 Links: 3 >>>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>>> root) >>>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>>> Access: 2018-12-18 18:52:38.070776585 +0000 >>>>>>> Modify: 2018-12-17 21:43:36.388054443 +0000 >>>>>>> Change: 2018-12-18 21:01:47.810506528 +0000 >>>>>>> Birth: - >>>>>>> >>>>>>> [root at lease-08 ovirt-backbone-2]# find . -inum 12893624759 >>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> >>>>>>> ======================== >>>>>>> >>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string >>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>> getfattr: Removing leading '/' from absolute path names >>>>>>> # file: >>>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>>> >>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> >>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>> >>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>> >>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> >>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>> >>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>> >>>>>>> [root at lease-11 ovirt-backbone-2]# stat >>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular >>>>>>> file >>>>>>> Device: fd03h/64771d Inode: 12956094809 Links: 3 >>>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>>> root) >>>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>>> Access: 2018-12-18 20:11:53.595208449 +0000 >>>>>>> Modify: 2018-12-17 21:43:36.391580259 +0000 >>>>>>> Change: 2018-12-18 19:19:25.888055392 +0000 >>>>>>> Birth: - >>>>>>> >>>>>>> [root at lease-11 ovirt-backbone-2]# find . -inum 12956094809 >>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> >>>>>>> ================ >>>>>>> >>>>>>> I don't really see any inconsistencies, except the dates on the >>>>>>> stat. However this is only after i tried moving the file out of the volumes >>>>>>> to force a heal, which does happen on the data nodes, but not on the >>>>>>> arbiter node. Before that they were also the same. >>>>>>> I've also compared the file >>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on the 2 nodes and they >>>>>>> are exactly the same. >>>>>>> >>>>>>> Things i've further tried; >>>>>>> - gluster v heal ovirt-backbone-2 full => gluster v heal >>>>>>> ovirt-backbone-2 info reports 0 entries on all nodes >>>>>>> >>>>>>> - stop each glusterd and glusterfsd, pause around 40sec and start >>>>>>> them again on each node, 1 at a time, waiting for the heal to recover >>>>>>> before moving to the next node >>>>>>> >>>>>>> - force a heal by stopping glusterd on a node and perform these >>>>>>> steps; >>>>>>> mkdir /mnt/ovirt-backbone-2/trigger >>>>>>> rmdir /mnt/ovirt-backbone-2/trigger >>>>>>> setfattr -n trusted.non-existent-key -v abc /mnt/ovirt-backbone-2/ >>>>>>> setfattr -x trusted.non-existent-key /mnt/ovirt-backbone-2/ >>>>>>> start glusterd >>>>>>> >>>>>>> - gluster volume rebalance ovirt-backbone-2 start => success >>>>>>> >>>>>>> Whats further interesting is that according the mount log, the >>>>>>> volume is in split-brain; >>>>>>> [2018-12-18 10:06:04.606870] E [MSGID: 108008] >>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>>> error] >>>>>>> [2018-12-18 10:06:04.606908] E [MSGID: 133014] >>>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>>> [2018-12-18 10:06:04.606927] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>>> 0-glusterfs-fuse: 428090: FSTAT() >>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>>> [2018-12-18 10:06:05.107729] E [MSGID: 108008] >>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>>> error] >>>>>>> [2018-12-18 10:06:05.107770] E [MSGID: 133014] >>>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>>> [2018-12-18 10:06:05.107791] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>>> 0-glusterfs-fuse: 428091: FSTAT() >>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>>> [2018-12-18 10:06:05.537244] I [MSGID: 108006] >>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>> subvolumes up >>>>>>> [2018-12-18 10:06:05.538523] E [MSGID: 108008] >>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid >>>>>>> 00000000-0000-0000-0000-000000000001: split-brain observed. [Input/output >>>>>>> error] >>>>>>> [2018-12-18 10:06:05.538685] I [MSGID: 108006] >>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>> subvolumes up >>>>>>> [2018-12-18 10:06:05.538794] I [MSGID: 108006] >>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>> subvolumes up >>>>>>> [2018-12-18 10:06:05.539342] I [MSGID: 109063] >>>>>>> [dht-layout.c:716:dht_layout_normalize] 0-ovirt-backbone-2-dht: Found >>>>>>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732 (gfid = >>>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2 overlaps=0 >>>>>>> [2018-12-18 10:06:05.539372] W [MSGID: 109005] >>>>>>> [dht-selfheal.c:2158:dht_selfheal_directory] 0-ovirt-backbone-2-dht: >>>>>>> Directory selfheal failed: 2 subvolumes down.Not fixing. path = >>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732, gfid = >>>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8 >>>>>>> [2018-12-18 10:06:05.539694] I [MSGID: 108006] >>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>> subvolumes up >>>>>>> [2018-12-18 10:06:05.540652] I [MSGID: 108006] >>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>> subvolumes up >>>>>>> [2018-12-18 10:06:05.608612] E [MSGID: 108008] >>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>>> error] >>>>>>> [2018-12-18 10:06:05.608657] E [MSGID: 133014] >>>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>>> [2018-12-18 10:06:05.608672] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>>> 0-glusterfs-fuse: 428096: FSTAT() >>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>>> [2018-12-18 10:06:06.109339] E [MSGID: 108008] >>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>>> error] >>>>>>> [2018-12-18 10:06:06.109378] E [MSGID: 133014] >>>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>>> [2018-12-18 10:06:06.109399] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>>> 0-glusterfs-fuse: 428097: FSTAT() >>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>>> >>>>>>> #note i'm able to see ; >>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>>>>> [root at lease-11 ovirt-backbone-2]# stat >>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>>>>> File: >>>>>>> ?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids? >>>>>>> Size: 1048576 Blocks: 2048 IO Block: 131072 regular >>>>>>> file >>>>>>> Device: 41h/65d Inode: 10492258721813610344 Links: 1 >>>>>>> Access: (0660/-rw-rw----) Uid: ( 36/ vdsm) Gid: ( 36/ >>>>>>> kvm) >>>>>>> Context: system_u:object_r:fusefs_t:s0 >>>>>>> Access: 2018-12-19 20:07:39.917573869 +0000 >>>>>>> Modify: 2018-12-19 20:07:39.928573917 +0000 >>>>>>> Change: 2018-12-19 20:07:39.929573921 +0000 >>>>>>> Birth: - >>>>>>> >>>>>>> however checking: gluster v heal ovirt-backbone-2 info split-brain >>>>>>> reports no entries. >>>>>>> >>>>>>> I've also tried mounting the qemu image, and this works fine, i'm >>>>>>> able to see all contents; >>>>>>> losetup /dev/loop0 >>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>> kpartx -a /dev/loop0 >>>>>>> vgscan >>>>>>> vgchange -ay slave-data >>>>>>> mkdir /mnt/slv01 >>>>>>> mount /dev/mapper/slave--data-lvol0 /mnt/slv01/ >>>>>>> >>>>>>> Possible causes for this issue; >>>>>>> 1. the machine "lease-11" suffered from a faulty RAM module (ECC), >>>>>>> which halted the machine and causes an invalid state. (this machine also >>>>>>> hosts other volumes, with similar configurations, which report no issue) >>>>>>> 2. after the RAM module was replaced, the VM using the backing qemu >>>>>>> image, was restored from a backup (the backup was file based within the VM >>>>>>> on a different directory). This is because some files were corrupted. The >>>>>>> backup/recovery obviously causes extra IO, possible introducing race >>>>>>> conditions? The machine did run for about 12h without issues, and in total >>>>>>> for about 36h. >>>>>>> 3. since only the client (maybe only gfapi?) reports errors, >>>>>>> something is broken there? >>>>>>> >>>>>>> The volume info; >>>>>>> root at lease-06 ~# gluster v info ovirt-backbone-2 >>>>>>> >>>>>>> Volume Name: ovirt-backbone-2 >>>>>>> Type: Distributed-Replicate >>>>>>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28 >>>>>>> Status: Started >>>>>>> Snapshot Count: 0 >>>>>>> Number of Bricks: 3 x (2 + 1) = 9 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: 10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>> Brick2: 10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>> Brick3: 10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>>>> Brick4: 10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>> Brick5: 10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>> Brick6: 10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>>>> Brick7: 10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>> Brick8: 10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>> Brick9: 10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter) >>>>>>> Options Reconfigured: >>>>>>> nfs.disable: on >>>>>>> transport.address-family: inet >>>>>>> performance.quick-read: off >>>>>>> performance.read-ahead: off >>>>>>> performance.io-cache: off >>>>>>> performance.low-prio-threads: 32 >>>>>>> network.remote-dio: enable >>>>>>> cluster.eager-lock: enable >>>>>>> cluster.quorum-type: auto >>>>>>> cluster.server-quorum-type: server >>>>>>> cluster.data-self-heal-algorithm: full >>>>>>> cluster.locking-scheme: granular >>>>>>> cluster.shd-max-threads: 8 >>>>>>> cluster.shd-wait-qlength: 10000 >>>>>>> features.shard: on >>>>>>> user.cifs: off >>>>>>> storage.owner-uid: 36 >>>>>>> storage.owner-gid: 36 >>>>>>> features.shard-block-size: 64MB >>>>>>> performance.write-behind-window-size: 512MB >>>>>>> performance.cache-size: 384MB >>>>>>> cluster.brick-multiplex: on >>>>>>> >>>>>>> The volume status; >>>>>>> root at lease-06 ~# gluster v status ovirt-backbone-2 >>>>>>> Status of volume: ovirt-backbone-2 >>>>>>> Gluster process TCP Port RDMA Port >>>>>>> Online Pid >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi >>>>>>> rt-backbone-2 49152 0 >>>>>>> Y 7727 >>>>>>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi >>>>>>> rt-backbone-2 49152 0 >>>>>>> Y 12620 >>>>>>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi >>>>>>> rt-backbone-2 49152 0 >>>>>>> Y 8794 >>>>>>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov >>>>>>> irt-backbone-2 49161 0 >>>>>>> Y 22333 >>>>>>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o >>>>>>> virt-backbone-2 49152 0 >>>>>>> Y 15030 >>>>>>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi >>>>>>> rt-backbone-2 49166 0 >>>>>>> Y 24592 >>>>>>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov >>>>>>> irt-backbone-2 49153 0 >>>>>>> Y 20148 >>>>>>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o >>>>>>> virt-backbone-2 49154 0 >>>>>>> Y 15413 >>>>>>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi >>>>>>> rt-backbone-2 49152 0 >>>>>>> Y 43120 >>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>> Y 44587 >>>>>>> Self-heal Daemon on 10.201.0.2 N/A N/A >>>>>>> Y 8401 >>>>>>> Self-heal Daemon on 10.201.0.5 N/A N/A >>>>>>> Y 11038 >>>>>>> Self-heal Daemon on 10.201.0.8 N/A N/A >>>>>>> Y 9513 >>>>>>> Self-heal Daemon on 10.32.9.4 N/A N/A >>>>>>> Y 23736 >>>>>>> Self-heal Daemon on 10.32.9.20 N/A N/A >>>>>>> Y 2738 >>>>>>> Self-heal Daemon on 10.32.9.3 N/A N/A >>>>>>> Y 25598 >>>>>>> Self-heal Daemon on 10.32.9.5 N/A N/A >>>>>>> Y 511 >>>>>>> Self-heal Daemon on 10.32.9.9 N/A N/A >>>>>>> Y 23357 >>>>>>> Self-heal Daemon on 10.32.9.8 N/A N/A >>>>>>> Y 15225 >>>>>>> Self-heal Daemon on 10.32.9.7 N/A N/A >>>>>>> Y 25781 >>>>>>> Self-heal Daemon on 10.32.9.21 N/A N/A >>>>>>> Y 5034 >>>>>>> >>>>>>> Task Status of Volume ovirt-backbone-2 >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Task : Rebalance >>>>>>> ID : 6dfbac43-0125-4568-9ac3-a2c453faaa3d >>>>>>> Status : completed >>>>>>> >>>>>>> gluster version is @3.12.15 and cluster.op-version=31202 >>>>>>> >>>>>>> ======================== >>>>>>> >>>>>>> It would be nice to know if it's possible to mark the files as not >>>>>>> stale or if i should investigate other things? >>>>>>> Or should we consider this volume lost? >>>>>>> Also checking the code at; >>>>>>> https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c >>>>>>> it seems the functions shifted quite some (line 1724 vs. 2243), so maybe >>>>>>> it's fixed in a future version? >>>>>>> Any thoughts are welcome. >>>>>>> >>>>>>> Thanks Olaf >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Mon Jan 14 07:16:50 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Mon, 14 Jan 2019 12:46:50 +0530 Subject: [Gluster-users] Increasing Bitrot speed glusterfs 4.1.6 In-Reply-To: References: Message-ID: Resending mail. I have a total size of 50GB files per node and it has crossed 5 days but till now not completed bitrot signature process? yet 20GB+ files are pending for completion. On Fri, Jan 11, 2019 at 12:02 PM Amudhan P wrote: > Hi, > > How do I increase the speed of bitrot file signature process in > glusterfs 4.1.6? > Currently, it's processing 250 KB/s. is there any way to do the changes > thru gluster cli? > > regards > Amudhan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.buitelaar at gmail.com Mon Jan 14 08:11:19 2019 From: olaf.buitelaar at gmail.com (Olaf Buitelaar) Date: Mon, 14 Jan 2019 09:11:19 +0100 Subject: [Gluster-users] [Stale file handle] in shard volume In-Reply-To: References: Message-ID: Hi Krutika, I think the main problem is that the shard files exists in 2 sub-volumes, 1 being valid and 1 being stale. example; sub-volume-1: node-1: a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101487[stale] node-2: a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101487[stale] node-3: a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101487[stale] sub-volume-2: node-4: a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101487[good] node-5: a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101487[good] node-6: a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101487[good] More or less exactly as you described here; https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html The VMS getting paused is i think a pure side-effect. The issue seems to only surface on volumes with an arbiter brick and sharding enabled. So i suspect something goes wrong or went wrong on the sharding translators layer. I think the log you're interested in is this; [2019-01-02 02:33:44.433169] I [MSGID: 113030] [posix.c:2171:posix_unlink] 0-ovirt-kube-posix: open-fd-key-status: 0 for /data/gfs/bricks/bricka/ovirt-kube/.shard/a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101487 [2019-01-02 02:33:44.433188] I [MSGID: 113031] [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr status: 0 for /data/gfs/bricks/bricka/ovirt-kube/.shard/a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101487 [2019-01-02 02:33:44.475027] I [MSGID: 113030] [posix.c:2171:posix_unlink] 0-ovirt-kube-posix: open-fd-key-status: 0 for /data/gfs/bricks/bricka/ovirt-kube/.shard/a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101488 [2019-01-02 02:33:44.475059] I [MSGID: 113031] [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr status: 0 for /data/gfs/bricks/bricka/ovirt-kube/.shard/a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.101488 [2019-01-02 02:35:36.394536] I [MSGID: 115036] [server.c:535:server_rpc_notify] 0-ovirt-kube-server: disconnecting connection from lease-10.dc01.adsolutions-22506-2018/12/24-04:03:32:698336-ovirt-kube-client-2-0-0 [2019-01-02 02:35:36.394800] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-ovirt-kube-server: Shutting down connection lease-10.dc01.adsolutions-22506-2018/12/24-04:03:32:698336-ovirt-kube-client-2-0-0 This is from the time the the aforementioned machine paused. I've attached also the other logs, unfortunate i cannot access the logs of 1 machine, but if you need those i can gather them later. If you need more samples or info please let me know. Thanks Olaf Op ma 14 jan. 2019 om 08:16 schreef Krutika Dhananjay : > Hi, > > So the main issue is that certain vms seem to be pausing? Did I understand > that right? > Could you share the gluster-mount logs around the time the pause was seen? > And the brick logs too please? > > As for ESTALE errors, the real cause of pauses can be determined from > errors/warnings logged by fuse. Mere occurrence of ESTALE errors against > shard function in logs doesn't necessarily indicate that is the reason for > the pause. Also, in this instance, the ESTALE errors it seems are > propagated by the lower translators (DHT? protocol/client? Or even bricks?) > and shard is merely logging the same. > > -Krutika > > > On Sun, Jan 13, 2019 at 10:11 PM Olaf Buitelaar > wrote: > >> @Krutika if you need any further information, please let me know. >> >> Thanks Olaf >> >> Op vr 4 jan. 2019 om 07:51 schreef Nithya Balachandran < >> nbalacha at redhat.com>: >> >>> Adding Krutika. >>> >>> On Wed, 2 Jan 2019 at 20:56, Olaf Buitelaar >>> wrote: >>> >>>> Hi Nithya, >>>> >>>> Thank you for your reply. >>>> >>>> the VM's using the gluster volumes keeps on getting paused/stopped on >>>> errors like these; >>>> [2019-01-02 02:33:44.469132] E [MSGID: 133010] >>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on >>>> shard 101487 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c >>>> [Stale file handle] >>>> [2019-01-02 02:33:44.563288] E [MSGID: 133010] >>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on >>>> shard 101488 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c >>>> [Stale file handle] >>>> >>>> Krutika, Can you take a look at this? >>> >>> >>>> >>>> What i'm trying to find out, if i can purge all gluster volumes from >>>> all possible stale file handles (and hopefully find a method to prevent >>>> this in the future), so the VM's can start running stable again. >>>> For this i need to know when the "shard_common_lookup_shards_cbk" >>>> function considers a file as stale. >>>> The statement; "Stale file handle errors show up when a file with a >>>> specified gfid is not found." doesn't seem to cover it all, as i've shown >>>> in earlier mails the shard file and glusterfs/xx/xx/uuid file do both >>>> exist, and have the same inode. >>>> If the criteria i'm using aren't correct, could you please tell me >>>> which criteria i should use to determine if a file is stale or not? >>>> these criteria are just based observations i made, moving the stale >>>> files manually. After removing them i was able to start the VM again..until >>>> some time later it hangs on another stale shard file unfortunate. >>>> >>>> Thanks Olaf >>>> >>>> Op wo 2 jan. 2019 om 14:20 schreef Nithya Balachandran < >>>> nbalacha at redhat.com>: >>>> >>>>> >>>>> >>>>> On Mon, 31 Dec 2018 at 01:27, Olaf Buitelaar >>>>> wrote: >>>>> >>>>>> Dear All, >>>>>> >>>>>> till now a selected group of VM's still seem to produce new stale >>>>>> file's and getting paused due to this. >>>>>> I've not updated gluster recently, however i did change the op >>>>>> version from 31200 to 31202 about a week before this issue arose. >>>>>> Looking at the .shard directory, i've 100.000+ files sharing the same >>>>>> characteristics as a stale file. which are found till now, >>>>>> they all have the sticky bit set, e.g. file permissions; ---------T. >>>>>> are 0kb in size, and have the trusted.glusterfs.dht.linkto attribute. >>>>>> >>>>> >>>>> These are internal files used by gluster and do not necessarily mean >>>>> they are stale. They "point" to data files which may be on different bricks >>>>> (same name, gfid etc but no linkto xattr and no ----T permissions). >>>>> >>>>> >>>>>> These files range from long a go (beginning of the year) till now. >>>>>> Which makes me suspect this was laying dormant for some time now..and >>>>>> somehow recently surfaced. >>>>>> Checking other sub-volumes they contain also 0kb files in the .shard >>>>>> directory, but don't have the sticky bit and the linkto attribute. >>>>>> >>>>>> Does anybody else experience this issue? Could this be a bug or an >>>>>> environmental issue? >>>>>> >>>>> These are most likely valid files- please do not delete them without >>>>> double-checking. >>>>> >>>>> Stale file handle errors show up when a file with a specified gfid is >>>>> not found. You will need to debug the files for which you see this error by >>>>> checking the bricks to see if they actually exist. >>>>> >>>>>> >>>>>> Also i wonder if there is any tool or gluster command to clean all >>>>>> stale file handles? >>>>>> Otherwise i'm planning to make a simple bash script, which iterates >>>>>> over the .shard dir, checks each file for the above mentioned criteria, and >>>>>> (re)moves the file and the corresponding .glusterfs file. >>>>>> If there are other criteria needed to identify a stale file handle, i >>>>>> would like to hear that. >>>>>> If this is a viable and safe operation to do of course. >>>>>> >>>>>> Thanks Olaf >>>>>> >>>>>> >>>>>> >>>>>> Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar < >>>>>> olaf.buitelaar at gmail.com>: >>>>>> >>>>>>> Dear All, >>>>>>> >>>>>>> I figured it out, it appeared to be the exact same issue as >>>>>>> described here; >>>>>>> https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html >>>>>>> Another subvolume also had the shard file, only were all 0 bytes and >>>>>>> had the dht.linkto >>>>>>> >>>>>>> for reference; >>>>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>> >>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>>>>>> >>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>> >>>>>>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>>>>>> >>>>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>>>> # file: .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>>>> >>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d >>>>>>> >>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>> >>>>>>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100 >>>>>>> >>>>>>> [root at lease-04 ovirt-backbone-2]# stat >>>>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d >>>>>>> File: ?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d? >>>>>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>>>>> empty file >>>>>>> Device: fd01h/64769d Inode: 1918631406 Links: 2 >>>>>>> Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ >>>>>>> root) >>>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>>> Access: 2018-12-17 21:43:36.405735296 +0000 >>>>>>> Modify: 2018-12-17 21:43:36.405735296 +0000 >>>>>>> Change: 2018-12-17 21:43:36.405735296 +0000 >>>>>>> Birth: - >>>>>>> >>>>>>> removing the shard file and glusterfs file from each node resolved >>>>>>> the issue. >>>>>>> >>>>>>> I also found this thread; >>>>>>> https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html >>>>>>> Maybe he suffers from the same issue. >>>>>>> >>>>>>> Best Olaf >>>>>>> >>>>>>> >>>>>>> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar < >>>>>>> olaf.buitelaar at gmail.com>: >>>>>>> >>>>>>>> Dear All, >>>>>>>> >>>>>>>> It appears i've a stale file in one of the volumes, on 2 files. >>>>>>>> These files are qemu images (1 raw and 1 qcow2). >>>>>>>> I'll just focus on 1 file since the situation on the other seems >>>>>>>> the same. >>>>>>>> >>>>>>>> The VM get's paused more or less directly after being booted with >>>>>>>> error; >>>>>>>> [2018-12-18 14:05:05.275713] E [MSGID: 133010] >>>>>>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-backbone-2-shard: >>>>>>>> Lookup on shard 51500 failed. Base file gfid = >>>>>>>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file handle] >>>>>>>> >>>>>>>> investigating the shard; >>>>>>>> >>>>>>>> #on the arbiter node: >>>>>>>> >>>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -n >>>>>>>> glusterfs.gfid.string >>>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>>> getfattr: Removing leading '/' from absolute path names >>>>>>>> # file: >>>>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>>>> >>>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> >>>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>>> >>>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>>> >>>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> >>>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>>> >>>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>>> >>>>>>>> [root at lease-05 ovirt-backbone-2]# stat >>>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>>>> Size: 0 Blocks: 0 IO Block: 4096 regular >>>>>>>> empty file >>>>>>>> Device: fd01h/64769d Inode: 537277306 Links: 2 >>>>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>>>> root) >>>>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>>>> Access: 2018-12-17 21:43:36.361984810 +0000 >>>>>>>> Modify: 2018-12-17 21:43:36.361984810 +0000 >>>>>>>> Change: 2018-12-18 20:55:29.908647417 +0000 >>>>>>>> Birth: - >>>>>>>> >>>>>>>> [root at lease-05 ovirt-backbone-2]# find . -inum 537277306 >>>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> >>>>>>>> #on the data nodes: >>>>>>>> >>>>>>>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string >>>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>>> getfattr: Removing leading '/' from absolute path names >>>>>>>> # file: >>>>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>>>> >>>>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> >>>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>>> >>>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>>> >>>>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> >>>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>>> >>>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>>> >>>>>>>> [root at lease-08 ovirt-backbone-2]# stat >>>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular >>>>>>>> file >>>>>>>> Device: fd03h/64771d Inode: 12893624759 Links: 3 >>>>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>>>> root) >>>>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>>>> Access: 2018-12-18 18:52:38.070776585 +0000 >>>>>>>> Modify: 2018-12-17 21:43:36.388054443 +0000 >>>>>>>> Change: 2018-12-18 21:01:47.810506528 +0000 >>>>>>>> Birth: - >>>>>>>> >>>>>>>> [root at lease-08 ovirt-backbone-2]# find . -inum 12893624759 >>>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> >>>>>>>> ======================== >>>>>>>> >>>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -n >>>>>>>> glusterfs.gfid.string >>>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>>> getfattr: Removing leading '/' from absolute path names >>>>>>>> # file: >>>>>>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40" >>>>>>>> >>>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> >>>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>>> >>>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>>> >>>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex >>>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> >>>>>>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000 >>>>>>>> trusted.afr.dirty=0x000000000000000000000000 >>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0 >>>>>>>> >>>>>>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030 >>>>>>>> >>>>>>>> [root at lease-11 ovirt-backbone-2]# stat >>>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> File: ?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0? >>>>>>>> Size: 2166784 Blocks: 4128 IO Block: 4096 regular >>>>>>>> file >>>>>>>> Device: fd03h/64771d Inode: 12956094809 Links: 3 >>>>>>>> Access: (0660/-rw-rw----) Uid: ( 0/ root) Gid: ( 0/ >>>>>>>> root) >>>>>>>> Context: system_u:object_r:etc_runtime_t:s0 >>>>>>>> Access: 2018-12-18 20:11:53.595208449 +0000 >>>>>>>> Modify: 2018-12-17 21:43:36.391580259 +0000 >>>>>>>> Change: 2018-12-18 19:19:25.888055392 +0000 >>>>>>>> Birth: - >>>>>>>> >>>>>>>> [root at lease-11 ovirt-backbone-2]# find . -inum 12956094809 >>>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0 >>>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 >>>>>>>> >>>>>>>> ================ >>>>>>>> >>>>>>>> I don't really see any inconsistencies, except the dates on the >>>>>>>> stat. However this is only after i tried moving the file out of the volumes >>>>>>>> to force a heal, which does happen on the data nodes, but not on the >>>>>>>> arbiter node. Before that they were also the same. >>>>>>>> I've also compared the file >>>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on the 2 nodes and they >>>>>>>> are exactly the same. >>>>>>>> >>>>>>>> Things i've further tried; >>>>>>>> - gluster v heal ovirt-backbone-2 full => gluster v heal >>>>>>>> ovirt-backbone-2 info reports 0 entries on all nodes >>>>>>>> >>>>>>>> - stop each glusterd and glusterfsd, pause around 40sec and start >>>>>>>> them again on each node, 1 at a time, waiting for the heal to recover >>>>>>>> before moving to the next node >>>>>>>> >>>>>>>> - force a heal by stopping glusterd on a node and perform these >>>>>>>> steps; >>>>>>>> mkdir /mnt/ovirt-backbone-2/trigger >>>>>>>> rmdir /mnt/ovirt-backbone-2/trigger >>>>>>>> setfattr -n trusted.non-existent-key -v abc /mnt/ovirt-backbone-2/ >>>>>>>> setfattr -x trusted.non-existent-key /mnt/ovirt-backbone-2/ >>>>>>>> start glusterd >>>>>>>> >>>>>>>> - gluster volume rebalance ovirt-backbone-2 start => success >>>>>>>> >>>>>>>> Whats further interesting is that according the mount log, the >>>>>>>> volume is in split-brain; >>>>>>>> [2018-12-18 10:06:04.606870] E [MSGID: 108008] >>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>>>> error] >>>>>>>> [2018-12-18 10:06:04.606908] E [MSGID: 133014] >>>>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>>>> [2018-12-18 10:06:04.606927] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>>>> 0-glusterfs-fuse: 428090: FSTAT() >>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>>>> [2018-12-18 10:06:05.107729] E [MSGID: 108008] >>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>>>> error] >>>>>>>> [2018-12-18 10:06:05.107770] E [MSGID: 133014] >>>>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>>>> [2018-12-18 10:06:05.107791] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>>>> 0-glusterfs-fuse: 428091: FSTAT() >>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>>>> [2018-12-18 10:06:05.537244] I [MSGID: 108006] >>>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>>> subvolumes up >>>>>>>> [2018-12-18 10:06:05.538523] E [MSGID: 108008] >>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid >>>>>>>> 00000000-0000-0000-0000-000000000001: split-brain observed. [Input/output >>>>>>>> error] >>>>>>>> [2018-12-18 10:06:05.538685] I [MSGID: 108006] >>>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>>> subvolumes up >>>>>>>> [2018-12-18 10:06:05.538794] I [MSGID: 108006] >>>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>>> subvolumes up >>>>>>>> [2018-12-18 10:06:05.539342] I [MSGID: 109063] >>>>>>>> [dht-layout.c:716:dht_layout_normalize] 0-ovirt-backbone-2-dht: Found >>>>>>>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732 (gfid = >>>>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2 overlaps=0 >>>>>>>> [2018-12-18 10:06:05.539372] W [MSGID: 109005] >>>>>>>> [dht-selfheal.c:2158:dht_selfheal_directory] 0-ovirt-backbone-2-dht: >>>>>>>> Directory selfheal failed: 2 subvolumes down.Not fixing. path = >>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732, gfid = >>>>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8 >>>>>>>> [2018-12-18 10:06:05.539694] I [MSGID: 108006] >>>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>>> subvolumes up >>>>>>>> [2018-12-18 10:06:05.540652] I [MSGID: 108006] >>>>>>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no >>>>>>>> subvolumes up >>>>>>>> [2018-12-18 10:06:05.608612] E [MSGID: 108008] >>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>>>> error] >>>>>>>> [2018-12-18 10:06:05.608657] E [MSGID: 133014] >>>>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>>>> [2018-12-18 10:06:05.608672] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>>>> 0-glusterfs-fuse: 428096: FSTAT() >>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>>>> [2018-12-18 10:06:06.109339] E [MSGID: 108008] >>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done] >>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid >>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output >>>>>>>> error] >>>>>>>> [2018-12-18 10:06:06.109378] E [MSGID: 133014] >>>>>>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed: >>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error] >>>>>>>> [2018-12-18 10:06:06.109399] W [fuse-bridge.c:871:fuse_attr_cbk] >>>>>>>> 0-glusterfs-fuse: 428097: FSTAT() >>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error) >>>>>>>> >>>>>>>> #note i'm able to see ; >>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>>>>>> [root at lease-11 ovirt-backbone-2]# stat >>>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids >>>>>>>> File: >>>>>>>> ?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids? >>>>>>>> Size: 1048576 Blocks: 2048 IO Block: 131072 regular >>>>>>>> file >>>>>>>> Device: 41h/65d Inode: 10492258721813610344 Links: 1 >>>>>>>> Access: (0660/-rw-rw----) Uid: ( 36/ vdsm) Gid: ( 36/ >>>>>>>> kvm) >>>>>>>> Context: system_u:object_r:fusefs_t:s0 >>>>>>>> Access: 2018-12-19 20:07:39.917573869 +0000 >>>>>>>> Modify: 2018-12-19 20:07:39.928573917 +0000 >>>>>>>> Change: 2018-12-19 20:07:39.929573921 +0000 >>>>>>>> Birth: - >>>>>>>> >>>>>>>> however checking: gluster v heal ovirt-backbone-2 info split-brain >>>>>>>> reports no entries. >>>>>>>> >>>>>>>> I've also tried mounting the qemu image, and this works fine, i'm >>>>>>>> able to see all contents; >>>>>>>> losetup /dev/loop0 >>>>>>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425 >>>>>>>> kpartx -a /dev/loop0 >>>>>>>> vgscan >>>>>>>> vgchange -ay slave-data >>>>>>>> mkdir /mnt/slv01 >>>>>>>> mount /dev/mapper/slave--data-lvol0 /mnt/slv01/ >>>>>>>> >>>>>>>> Possible causes for this issue; >>>>>>>> 1. the machine "lease-11" suffered from a faulty RAM module (ECC), >>>>>>>> which halted the machine and causes an invalid state. (this machine also >>>>>>>> hosts other volumes, with similar configurations, which report no issue) >>>>>>>> 2. after the RAM module was replaced, the VM using the backing qemu >>>>>>>> image, was restored from a backup (the backup was file based within the VM >>>>>>>> on a different directory). This is because some files were corrupted. The >>>>>>>> backup/recovery obviously causes extra IO, possible introducing race >>>>>>>> conditions? The machine did run for about 12h without issues, and in total >>>>>>>> for about 36h. >>>>>>>> 3. since only the client (maybe only gfapi?) reports errors, >>>>>>>> something is broken there? >>>>>>>> >>>>>>>> The volume info; >>>>>>>> root at lease-06 ~# gluster v info ovirt-backbone-2 >>>>>>>> >>>>>>>> Volume Name: ovirt-backbone-2 >>>>>>>> Type: Distributed-Replicate >>>>>>>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28 >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 3 x (2 + 1) = 9 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: 10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>>> Brick2: 10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>>> Brick3: 10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 >>>>>>>> (arbiter) >>>>>>>> Brick4: 10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>>> Brick5: 10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>>> Brick6: 10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 >>>>>>>> (arbiter) >>>>>>>> Brick7: 10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>>> Brick8: 10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2 >>>>>>>> Brick9: 10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 >>>>>>>> (arbiter) >>>>>>>> Options Reconfigured: >>>>>>>> nfs.disable: on >>>>>>>> transport.address-family: inet >>>>>>>> performance.quick-read: off >>>>>>>> performance.read-ahead: off >>>>>>>> performance.io-cache: off >>>>>>>> performance.low-prio-threads: 32 >>>>>>>> network.remote-dio: enable >>>>>>>> cluster.eager-lock: enable >>>>>>>> cluster.quorum-type: auto >>>>>>>> cluster.server-quorum-type: server >>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>> cluster.locking-scheme: granular >>>>>>>> cluster.shd-max-threads: 8 >>>>>>>> cluster.shd-wait-qlength: 10000 >>>>>>>> features.shard: on >>>>>>>> user.cifs: off >>>>>>>> storage.owner-uid: 36 >>>>>>>> storage.owner-gid: 36 >>>>>>>> features.shard-block-size: 64MB >>>>>>>> performance.write-behind-window-size: 512MB >>>>>>>> performance.cache-size: 384MB >>>>>>>> cluster.brick-multiplex: on >>>>>>>> >>>>>>>> The volume status; >>>>>>>> root at lease-06 ~# gluster v status ovirt-backbone-2 >>>>>>>> Status of volume: ovirt-backbone-2 >>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>> Online Pid >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi >>>>>>>> rt-backbone-2 49152 0 >>>>>>>> Y 7727 >>>>>>>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi >>>>>>>> rt-backbone-2 49152 0 >>>>>>>> Y 12620 >>>>>>>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi >>>>>>>> rt-backbone-2 49152 0 >>>>>>>> Y 8794 >>>>>>>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov >>>>>>>> irt-backbone-2 49161 0 >>>>>>>> Y 22333 >>>>>>>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o >>>>>>>> virt-backbone-2 49152 0 >>>>>>>> Y 15030 >>>>>>>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi >>>>>>>> rt-backbone-2 49166 0 >>>>>>>> Y 24592 >>>>>>>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov >>>>>>>> irt-backbone-2 49153 0 >>>>>>>> Y 20148 >>>>>>>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o >>>>>>>> virt-backbone-2 49154 0 >>>>>>>> Y 15413 >>>>>>>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi >>>>>>>> rt-backbone-2 49152 0 >>>>>>>> Y 43120 >>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>> Y 44587 >>>>>>>> Self-heal Daemon on 10.201.0.2 N/A N/A >>>>>>>> Y 8401 >>>>>>>> Self-heal Daemon on 10.201.0.5 N/A N/A >>>>>>>> Y 11038 >>>>>>>> Self-heal Daemon on 10.201.0.8 N/A N/A >>>>>>>> Y 9513 >>>>>>>> Self-heal Daemon on 10.32.9.4 N/A N/A >>>>>>>> Y 23736 >>>>>>>> Self-heal Daemon on 10.32.9.20 N/A N/A >>>>>>>> Y 2738 >>>>>>>> Self-heal Daemon on 10.32.9.3 N/A N/A >>>>>>>> Y 25598 >>>>>>>> Self-heal Daemon on 10.32.9.5 N/A N/A >>>>>>>> Y 511 >>>>>>>> Self-heal Daemon on 10.32.9.9 N/A N/A >>>>>>>> Y 23357 >>>>>>>> Self-heal Daemon on 10.32.9.8 N/A N/A >>>>>>>> Y 15225 >>>>>>>> Self-heal Daemon on 10.32.9.7 N/A N/A >>>>>>>> Y 25781 >>>>>>>> Self-heal Daemon on 10.32.9.21 N/A N/A >>>>>>>> Y 5034 >>>>>>>> >>>>>>>> Task Status of Volume ovirt-backbone-2 >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Task : Rebalance >>>>>>>> ID : 6dfbac43-0125-4568-9ac3-a2c453faaa3d >>>>>>>> Status : completed >>>>>>>> >>>>>>>> gluster version is @3.12.15 and cluster.op-version=31202 >>>>>>>> >>>>>>>> ======================== >>>>>>>> >>>>>>>> It would be nice to know if it's possible to mark the files as not >>>>>>>> stale or if i should investigate other things? >>>>>>>> Or should we consider this volume lost? >>>>>>>> Also checking the code at; >>>>>>>> https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c >>>>>>>> it seems the functions shifted quite some (line 1724 vs. 2243), so maybe >>>>>>>> it's fixed in a future version? >>>>>>>> Any thoughts are welcome. >>>>>>>> >>>>>>>> Thanks Olaf >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: data-gfs-bricks-bricka-ovirt-kube.log-20190106.gz Type: application/x-gzip Size: 144647 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: l7-data-gfs-bricks-brick1-ovirt-kube.log-20190106.gz Type: application/x-gzip Size: 146483 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: l10-data-gfs-bricks-brick1-ovirt-kube.log-20190106.gz Type: application/x-gzip Size: 146605 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: l5-data-gfs-bricks-bricka-ovirt-kube.log-20190106.gz Type: application/x-gzip Size: 144387 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: l8-data-gfs-bricks-brick1-ovirt-kube.log-20190106.gz Type: application/x-gzip Size: 302224 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: rhev-data-center-mnt-glusterSD-10.32.9.20?_ovirt-kube.log-20190106.gz Type: application/x-gzip Size: 646 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: l11-data-gfs-bricks-brick1-ovirt-kube.log-20190106.gz Type: application/x-gzip Size: 146474 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: l11-data-gfs-bricks-bricka-ovirt-kube.log-20190106.gz Type: application/x-gzip Size: 303508 bytes Desc: not available URL: From mauro.list at yahoo.com Mon Jan 14 11:19:44 2019 From: mauro.list at yahoo.com (Mauro Gatti) Date: Mon, 14 Jan 2019 11:19:44 +0000 (UTC) Subject: [Gluster-users] HELP: Commit failed on localhost. Please check the log file for more details References: <132267624.831700.1547464784997.ref@mail.yahoo.com> Message-ID: <132267624.831700.1547464784997@mail.yahoo.com> Hello,I'm just installed Gluster on my raspberry Pi.I am trying to do some test on a USB stick I mounted.Unfortunally I'm stuck on a error that says: root at europa:/var/log/glusterfs# gluster volume create prova transport tcp europa:/mnt/usb1/prova volume create: prova: failed: Commit failed on localhost. Please check the log file for more details. I tried to see if in /var/log/glusterfs there is something that could help but I didn't found anything special. I know I miss something obvious for an experienced users but I cannot figure out what.Could you hep me troubleshooting this issue? Thank You -------------- next part -------------- An HTML attachment was scrubbed... URL: From dijuremo at gmail.com Tue Jan 15 01:18:11 2019 From: dijuremo at gmail.com (Diego Remolina) Date: Mon, 14 Jan 2019 20:18:11 -0500 Subject: [Gluster-users] To good to be truth speed improvements? Message-ID: Dear all, I was running gluster 3.10.12 on a pair of servers and recently upgraded to 4.1.6. There is a cron job that runs nightly in one machine, which rsyncs the data on the servers over to another machine for backup purposes. The rsync operation runs on one of the gluster servers, which mounts the gluster volume via fuse on /export. When using 3.10.12, this process would start at 8:00PM nightly, and usually end up at around 4:30AM when the servers had been freshly rebooted. From this point, things would start taking a bit longer and stabilize ending at around 7-9AM depending on actual file changes and at some point the servers would start eating up so much ram (up to 30GB) and I would have to reboot them to bring things back to normal as the file system would become extremely slow (perhaps the memory leak I have read was present on 3.10.x). After upgrading to 4.1.6 over the weekend, I was shocked to see the rsync process finish in about 1 hour and 26 minutes. This is compared to 8 hours 30 mins with the older version. This is a nice speed up, however, I can only ask myself what has changed so drastically that this process is now so fast. Have there really been improvements in 4.1.6 that could speed this up so dramatically? In both of my test cases, there would had not really been a lot to copy via rsync given the fresh reboots are done on Saturday after the sync has finished from the day before. In general, the servers (which are accessed via samba for windows clients) are much faster and responsive since the update to 4.1.6. Tonight I will have the first rsync run which will actually have to copy the day's changes and will have another point of comparison. I am still using fuse mounts for samba, due to prior problems with vsf =gluster, which are currently present in Samba 4.8.3-4, and already documented in bugs, for which patches exist, but no official updated samba packages have been released yet. Since I was going from 3.10.12 to 4.1.6 I also did not want to change other things to make sure I could track any issues just related to the change in gluster versions and eliminate other complexity. The file system currently has about 16TB of data in 5142816 files and 696544 directories I've just ran the following code to count files and dirs and it took 67mins 38.957 secs to complete in this gluster volume: https://github.com/ChristopherSchultz/fast-file-count # time ( /root/sbin/dircnt /export ) /export contains 5142816 files and 696544 directories real 67m38.957s user 0m6.225s sys 0m48.939s The gluster options set on the volume are: https://termbin.com/yxtd # gluster v status export Status of volume: export Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.0.1.7:/bricks/hdds/brick 49157 0 Y 13986 Brick 10.0.1.6:/bricks/hdds/brick 49153 0 Y 9953 Self-heal Daemon on localhost N/A N/A Y 21934 Self-heal Daemon on 10.0.1.5 N/A N/A Y 4598 Self-heal Daemon on 10.0.1.6 N/A N/A Y 14485 Task Status of Volume export ------------------------------------------------------------------------------ There are no active volume tasks Truth, there is a 3rd server here, but no bricks on it. Thoughts? Diego Virus-free. www.avast.com <#m_-6479459361629161759_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> -------------- next part -------------- An HTML attachment was scrubbed... URL: From davide.obbi at booking.com Tue Jan 15 10:02:56 2019 From: davide.obbi at booking.com (Davide Obbi) Date: Tue, 15 Jan 2019 11:02:56 +0100 Subject: [Gluster-users] [External] To good to be truth speed improvements? In-Reply-To: References: Message-ID: On Tue, Jan 15, 2019 at 2:18 AM Diego Remolina wrote: > Dear all, > > I was running gluster 3.10.12 on a pair of servers and recently upgraded > to 4.1.6. There is a cron job that runs nightly in one machine, which > rsyncs the data on the servers over to another machine for backup purposes. > The rsync operation runs on one of the gluster servers, which mounts the > gluster volume via fuse on /export. > > When using 3.10.12, this process would start at 8:00PM nightly, and > usually end up at around 4:30AM when the servers had been freshly rebooted. > From this point, things would start taking a bit longer and stabilize > ending at around 7-9AM depending on actual file changes and at some point > the servers would start eating up so much ram (up to 30GB) and I would have > to reboot them to bring things back to normal as the file system would > become extremely slow (perhaps the memory leak I have read was present on > 3.10.x). > > After upgrading to 4.1.6 over the weekend, I was shocked to see the rsync > process finish in about 1 hour and 26 minutes. This is compared to 8 hours > 30 mins with the older version. This is a nice speed up, however, I can > only ask myself what has changed so drastically that this process is now so > fast. Have there really been improvements in 4.1.6 that could speed this up > so dramatically? In both of my test cases, there would had not really been > a lot to copy via rsync given the fresh reboots are done on Saturday after > the sync has finished from the day before. > > In general, the servers (which are accessed via samba for windows clients) > are much faster and responsive since the update to 4.1.6. Tonight I will > have the first rsync run which will actually have to copy the day's changes > and will have another point of comparison. > > I am still using fuse mounts for samba, due to prior problems with vsf > =gluster, which are currently present in Samba 4.8.3-4, and already > documented in bugs, for which patches exist, but no official updated samba > packages have been released yet. Since I was going from 3.10.12 to 4.1.6 I > also did not want to change other things to make sure I could track any > issues just related to the change in gluster versions and eliminate other > complexity. > > The file system currently has about 16TB of data in > 5142816 files and 696544 directories > > I've just ran the following code to count files and dirs and it took > 67mins 38.957 secs to complete in this gluster volume: > https://github.com/ChristopherSchultz/fast-file-count > > # time ( /root/sbin/dircnt /export ) > /export contains 5142816 files and 696544 directories > > real 67m38.957s > user 0m6.225s > sys 0m48.939s > > The gluster options set on the volume are: > https://termbin.com/yxtd > > # gluster v status export > Status of volume: export > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 10.0.1.7:/bricks/hdds/brick 49157 0 Y > 13986 > Brick 10.0.1.6:/bricks/hdds/brick 49153 0 Y > 9953 > Self-heal Daemon on localhost N/A N/A Y > 21934 > Self-heal Daemon on 10.0.1.5 N/A N/A Y > 4598 > Self-heal Daemon on 10.0.1.6 N/A N/A Y > 14485 > > Task Status of Volume export > > ------------------------------------------------------------------------------ > There are no active volume tasks > > Truth, there is a 3rd server here, but no bricks on it. > > Thoughts? > > Diego > > > Virus-free. > www.avast.com > > <#m_8084651329793795211_m_7462352325940458688_m_-6479459361629161759_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Hi Diego, Besides the actual improvements made in the code i think new releases might implement volume options by default that before might have had different setting. I would have been interesting to diff "gluster volume get all" befor and after the upgrade. Just for curiosity and i am trying to figure out volume options for rsync kind of workloads can you share the command output anyway along with gluster volume info ? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From dijuremo at gmail.com Tue Jan 15 13:28:17 2019 From: dijuremo at gmail.com (Diego Remolina) Date: Tue, 15 Jan 2019 08:28:17 -0500 Subject: [Gluster-users] [External] To good to be truth speed improvements? In-Reply-To: References: Message-ID: Hi Davide, The options information is already provided in prior e-mail, see the termbin.con link for the options of the volume after the 4.1.6 upgrade. The gluster options set on the volume are: https://termbin.com/yxtd This is the other piece: # gluster v info export Volume Name: export Type: Replicate Volume ID: b4353b3f-6ef6-4813-819a-8e85e5a95cff Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.0.1.7:/bricks/hdds/brick Brick2: 10.0.1.6:/bricks/hdds/brick Options Reconfigured: performance.stat-prefetch: on performance.cache-min-file-size: 0 network.inode-lru-limit: 65536 performance.cache-invalidation: on features.cache-invalidation: on performance.md-cache-timeout: 600 features.cache-invalidation-timeout: 600 performance.cache-samba-metadata: on transport.address-family: inet server.allow-insecure: on performance.cache-size: 10GB cluster.server-quorum-type: server nfs.disable: on performance.io-thread-count: 64 performance.io-cache: on cluster.lookup-optimize: on cluster.readdir-optimize: on server.event-threads: 5 client.event-threads: 5 performance.cache-max-file-size: 256MB diagnostics.client-log-level: INFO diagnostics.brick-log-level: INFO cluster.server-quorum-ratio: 51% Now I did create a backup of /var/lib/glusterd so if you tell me how to pull information from there to compare I can do it. I compared the file /var/lib/glusterd/vols/export/info and it is the same in both, though entries are in different order. Diego On Tue, Jan 15, 2019 at 5:03 AM Davide Obbi wrote: > > > On Tue, Jan 15, 2019 at 2:18 AM Diego Remolina wrote: > >> Dear all, >> >> I was running gluster 3.10.12 on a pair of servers and recently upgraded >> to 4.1.6. There is a cron job that runs nightly in one machine, which >> rsyncs the data on the servers over to another machine for backup purposes. >> The rsync operation runs on one of the gluster servers, which mounts the >> gluster volume via fuse on /export. >> >> When using 3.10.12, this process would start at 8:00PM nightly, and >> usually end up at around 4:30AM when the servers had been freshly rebooted. >> From this point, things would start taking a bit longer and stabilize >> ending at around 7-9AM depending on actual file changes and at some point >> the servers would start eating up so much ram (up to 30GB) and I would have >> to reboot them to bring things back to normal as the file system would >> become extremely slow (perhaps the memory leak I have read was present on >> 3.10.x). >> >> After upgrading to 4.1.6 over the weekend, I was shocked to see the rsync >> process finish in about 1 hour and 26 minutes. This is compared to 8 hours >> 30 mins with the older version. This is a nice speed up, however, I can >> only ask myself what has changed so drastically that this process is now so >> fast. Have there really been improvements in 4.1.6 that could speed this up >> so dramatically? In both of my test cases, there would had not really been >> a lot to copy via rsync given the fresh reboots are done on Saturday after >> the sync has finished from the day before. >> >> In general, the servers (which are accessed via samba for windows >> clients) are much faster and responsive since the update to 4.1.6. Tonight >> I will have the first rsync run which will actually have to copy the day's >> changes and will have another point of comparison. >> >> I am still using fuse mounts for samba, due to prior problems with vsf >> =gluster, which are currently present in Samba 4.8.3-4, and already >> documented in bugs, for which patches exist, but no official updated samba >> packages have been released yet. Since I was going from 3.10.12 to 4.1.6 I >> also did not want to change other things to make sure I could track any >> issues just related to the change in gluster versions and eliminate other >> complexity. >> >> The file system currently has about 16TB of data in >> 5142816 files and 696544 directories >> >> I've just ran the following code to count files and dirs and it took >> 67mins 38.957 secs to complete in this gluster volume: >> https://github.com/ChristopherSchultz/fast-file-count >> >> # time ( /root/sbin/dircnt /export ) >> /export contains 5142816 files and 696544 directories >> >> real 67m38.957s >> user 0m6.225s >> sys 0m48.939s >> >> The gluster options set on the volume are: >> https://termbin.com/yxtd >> >> # gluster v status export >> Status of volume: export >> Gluster process TCP Port RDMA Port Online >> Pid >> >> ------------------------------------------------------------------------------ >> Brick 10.0.1.7:/bricks/hdds/brick 49157 0 Y >> 13986 >> Brick 10.0.1.6:/bricks/hdds/brick 49153 0 Y >> 9953 >> Self-heal Daemon on localhost N/A N/A Y >> 21934 >> Self-heal Daemon on 10.0.1.5 N/A N/A Y >> 4598 >> Self-heal Daemon on 10.0.1.6 N/A N/A Y >> 14485 >> >> Task Status of Volume export >> >> ------------------------------------------------------------------------------ >> There are no active volume tasks >> >> Truth, there is a 3rd server here, but no bricks on it. >> >> Thoughts? >> >> Diego >> >> >> Virus-free. >> www.avast.com >> >> <#m_-4021393732076721680_m_8084651329793795211_m_7462352325940458688_m_-6479459361629161759_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > Hi Diego, > > Besides the actual improvements made in the code i think new releases > might implement volume options by default that before might have had > different setting. I would have been interesting to diff "gluster volume > get all" befor and after the upgrade. Just for curiosity and i am > trying to figure out volume options for rsync kind of workloads can you > share the command output anyway along with gluster volume info ? > > thanks > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davide.obbi at booking.com Tue Jan 15 19:03:57 2019 From: davide.obbi at booking.com (Davide Obbi) Date: Tue, 15 Jan 2019 20:03:57 +0100 Subject: [Gluster-users] [External] To good to be truth speed improvements? In-Reply-To: References: Message-ID: i think you can find the volume options doing a grep -R option /var/lib/glusterd/vols/ and the .vol files show the options On Tue, Jan 15, 2019 at 2:28 PM Diego Remolina wrote: > Hi Davide, > > The options information is already provided in prior e-mail, see the > termbin.con link for the options of the volume after the 4.1.6 upgrade. > > The gluster options set on the volume are: > https://termbin.com/yxtd > > This is the other piece: > > # gluster v info export > > Volume Name: export > Type: Replicate > Volume ID: b4353b3f-6ef6-4813-819a-8e85e5a95cff > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: 10.0.1.7:/bricks/hdds/brick > Brick2: 10.0.1.6:/bricks/hdds/brick > Options Reconfigured: > performance.stat-prefetch: on > performance.cache-min-file-size: 0 > network.inode-lru-limit: 65536 > performance.cache-invalidation: on > features.cache-invalidation: on > performance.md-cache-timeout: 600 > features.cache-invalidation-timeout: 600 > performance.cache-samba-metadata: on > transport.address-family: inet > server.allow-insecure: on > performance.cache-size: 10GB > cluster.server-quorum-type: server > nfs.disable: on > performance.io-thread-count: 64 > performance.io-cache: on > cluster.lookup-optimize: on > cluster.readdir-optimize: on > server.event-threads: 5 > client.event-threads: 5 > performance.cache-max-file-size: 256MB > diagnostics.client-log-level: INFO > diagnostics.brick-log-level: INFO > cluster.server-quorum-ratio: 51% > > Now I did create a backup of /var/lib/glusterd so if you tell me how to > pull information from there to compare I can do it. > > I compared the file /var/lib/glusterd/vols/export/info and it is the same > in both, though entries are in different order. > > Diego > > > > > On Tue, Jan 15, 2019 at 5:03 AM Davide Obbi > wrote: > >> >> >> On Tue, Jan 15, 2019 at 2:18 AM Diego Remolina >> wrote: >> >>> Dear all, >>> >>> I was running gluster 3.10.12 on a pair of servers and recently upgraded >>> to 4.1.6. There is a cron job that runs nightly in one machine, which >>> rsyncs the data on the servers over to another machine for backup purposes. >>> The rsync operation runs on one of the gluster servers, which mounts the >>> gluster volume via fuse on /export. >>> >>> When using 3.10.12, this process would start at 8:00PM nightly, and >>> usually end up at around 4:30AM when the servers had been freshly rebooted. >>> From this point, things would start taking a bit longer and stabilize >>> ending at around 7-9AM depending on actual file changes and at some point >>> the servers would start eating up so much ram (up to 30GB) and I would have >>> to reboot them to bring things back to normal as the file system would >>> become extremely slow (perhaps the memory leak I have read was present on >>> 3.10.x). >>> >>> After upgrading to 4.1.6 over the weekend, I was shocked to see the >>> rsync process finish in about 1 hour and 26 minutes. This is compared to 8 >>> hours 30 mins with the older version. This is a nice speed up, however, I >>> can only ask myself what has changed so drastically that this process is >>> now so fast. Have there really been improvements in 4.1.6 that could speed >>> this up so dramatically? In both of my test cases, there would had not >>> really been a lot to copy via rsync given the fresh reboots are done on >>> Saturday after the sync has finished from the day before. >>> >>> In general, the servers (which are accessed via samba for windows >>> clients) are much faster and responsive since the update to 4.1.6. Tonight >>> I will have the first rsync run which will actually have to copy the day's >>> changes and will have another point of comparison. >>> >>> I am still using fuse mounts for samba, due to prior problems with vsf >>> =gluster, which are currently present in Samba 4.8.3-4, and already >>> documented in bugs, for which patches exist, but no official updated samba >>> packages have been released yet. Since I was going from 3.10.12 to 4.1.6 I >>> also did not want to change other things to make sure I could track any >>> issues just related to the change in gluster versions and eliminate other >>> complexity. >>> >>> The file system currently has about 16TB of data in >>> 5142816 files and 696544 directories >>> >>> I've just ran the following code to count files and dirs and it took >>> 67mins 38.957 secs to complete in this gluster volume: >>> https://github.com/ChristopherSchultz/fast-file-count >>> >>> # time ( /root/sbin/dircnt /export ) >>> /export contains 5142816 files and 696544 directories >>> >>> real 67m38.957s >>> user 0m6.225s >>> sys 0m48.939s >>> >>> The gluster options set on the volume are: >>> https://termbin.com/yxtd >>> >>> # gluster v status export >>> Status of volume: export >>> Gluster process TCP Port RDMA Port Online >>> Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick 10.0.1.7:/bricks/hdds/brick 49157 0 Y >>> 13986 >>> Brick 10.0.1.6:/bricks/hdds/brick 49153 0 Y >>> 9953 >>> Self-heal Daemon on localhost N/A N/A Y >>> 21934 >>> Self-heal Daemon on 10.0.1.5 N/A N/A Y >>> 4598 >>> Self-heal Daemon on 10.0.1.6 N/A N/A Y >>> 14485 >>> >>> Task Status of Volume export >>> >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> Truth, there is a 3rd server here, but no bricks on it. >>> >>> Thoughts? >>> >>> Diego >>> >>> >>> Virus-free. >>> www.avast.com >>> >>> <#m_-657708050556748564_m_-2130281720557425520_m_-4021393732076721680_m_8084651329793795211_m_7462352325940458688_m_-6479459361629161759_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> Hi Diego, >> >> Besides the actual improvements made in the code i think new releases >> might implement volume options by default that before might have had >> different setting. I would have been interesting to diff "gluster volume >> get all" befor and after the upgrade. Just for curiosity and i am >> trying to figure out volume options for rsync kind of workloads can you >> share the command output anyway along with gluster volume info ? >> >> thanks >> >> -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [image: Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dijuremo at gmail.com Tue Jan 15 19:41:55 2019 From: dijuremo at gmail.com (Diego Remolina) Date: Tue, 15 Jan 2019 14:41:55 -0500 Subject: [Gluster-users] [External] To good to be truth speed improvements? In-Reply-To: References: Message-ID: This is what I came up with: < Corresponds to currently running 4.1.6 > Corresponds to old 3.10.12 + diff /var/lib/glusterd/vols/export/export.10.0.1.6.bricks-hdds-brick.vol /var/lib/glusterd-20190112/vols/export/export.10.0.1.6.bricks-hdds-brick.vol 3d2 < option shared-brick-count 0 45d43 < option bitrot disable 62d59 < option worm-files-deletable on 93,98d89 < volume export-selinux < type features/selinux < option selinux on < subvolumes export-io-threads < end-volume < 108c99 < subvolumes export-selinux --- > subvolumes export-io-threads 128a120 > option timeout 0 150d141 < option transport.listen-backlog 1024 158a150 > option ping-timeout 42 + diff /var/lib/glusterd/vols/export/export.10.0.1.7.bricks-hdds-brick.vol /var/lib/glusterd-20190112/vols/export/export.10.0.1.7.bricks-hdds-brick.vol 3d2 < option shared-brick-count 1 45d43 < option bitrot disable 62d59 < option worm-files-deletable on 93,98d89 < volume export-selinux < type features/selinux < option selinux on < subvolumes export-io-threads < end-volume < 108c99 < subvolumes export-selinux --- > subvolumes export-io-threads 128a120 > option timeout 0 150d141 < option transport.listen-backlog 1024 158a150 > option ping-timeout 42 + diff /var/lib/glusterd/vols/export/export.tcp-fuse.vol /var/lib/glusterd-20190112/vols/export/export.tcp-fuse.vol 40d39 < option force-migration off 75d73 < option cache-invalidation on + diff /var/lib/glusterd/vols/export/trusted-export.tcp-fuse.vol /var/lib/glusterd-20190112/vols/export/trusted-export.tcp-fuse.vol 44d43 < option force-migration off 79d77 < option cache-invalidation on Any other volume files were the same. HTH, Diego On Tue, Jan 15, 2019 at 2:04 PM Davide Obbi wrote: > i think you can find the volume options doing a grep -R option > /var/lib/glusterd/vols/ and the .vol files show the options > > On Tue, Jan 15, 2019 at 2:28 PM Diego Remolina wrote: > >> Hi Davide, >> >> The options information is already provided in prior e-mail, see the >> termbin.con link for the options of the volume after the 4.1.6 upgrade. >> >> The gluster options set on the volume are: >> https://termbin.com/yxtd >> >> This is the other piece: >> >> # gluster v info export >> >> Volume Name: export >> Type: Replicate >> Volume ID: b4353b3f-6ef6-4813-819a-8e85e5a95cff >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 1 x 2 = 2 >> Transport-type: tcp >> Bricks: >> Brick1: 10.0.1.7:/bricks/hdds/brick >> Brick2: 10.0.1.6:/bricks/hdds/brick >> Options Reconfigured: >> performance.stat-prefetch: on >> performance.cache-min-file-size: 0 >> network.inode-lru-limit: 65536 >> performance.cache-invalidation: on >> features.cache-invalidation: on >> performance.md-cache-timeout: 600 >> features.cache-invalidation-timeout: 600 >> performance.cache-samba-metadata: on >> transport.address-family: inet >> server.allow-insecure: on >> performance.cache-size: 10GB >> cluster.server-quorum-type: server >> nfs.disable: on >> performance.io-thread-count: 64 >> performance.io-cache: on >> cluster.lookup-optimize: on >> cluster.readdir-optimize: on >> server.event-threads: 5 >> client.event-threads: 5 >> performance.cache-max-file-size: 256MB >> diagnostics.client-log-level: INFO >> diagnostics.brick-log-level: INFO >> cluster.server-quorum-ratio: 51% >> >> Now I did create a backup of /var/lib/glusterd so if you tell me how to >> pull information from there to compare I can do it. >> >> I compared the file /var/lib/glusterd/vols/export/info and it is the same >> in both, though entries are in different order. >> >> Diego >> >> >> >> >> On Tue, Jan 15, 2019 at 5:03 AM Davide Obbi >> wrote: >> >>> >>> >>> On Tue, Jan 15, 2019 at 2:18 AM Diego Remolina >>> wrote: >>> >>>> Dear all, >>>> >>>> I was running gluster 3.10.12 on a pair of servers and recently >>>> upgraded to 4.1.6. There is a cron job that runs nightly in one machine, >>>> which rsyncs the data on the servers over to another machine for backup >>>> purposes. The rsync operation runs on one of the gluster servers, which >>>> mounts the gluster volume via fuse on /export. >>>> >>>> When using 3.10.12, this process would start at 8:00PM nightly, and >>>> usually end up at around 4:30AM when the servers had been freshly rebooted. >>>> From this point, things would start taking a bit longer and stabilize >>>> ending at around 7-9AM depending on actual file changes and at some point >>>> the servers would start eating up so much ram (up to 30GB) and I would have >>>> to reboot them to bring things back to normal as the file system would >>>> become extremely slow (perhaps the memory leak I have read was present on >>>> 3.10.x). >>>> >>>> After upgrading to 4.1.6 over the weekend, I was shocked to see the >>>> rsync process finish in about 1 hour and 26 minutes. This is compared to 8 >>>> hours 30 mins with the older version. This is a nice speed up, however, I >>>> can only ask myself what has changed so drastically that this process is >>>> now so fast. Have there really been improvements in 4.1.6 that could speed >>>> this up so dramatically? In both of my test cases, there would had not >>>> really been a lot to copy via rsync given the fresh reboots are done on >>>> Saturday after the sync has finished from the day before. >>>> >>>> In general, the servers (which are accessed via samba for windows >>>> clients) are much faster and responsive since the update to 4.1.6. Tonight >>>> I will have the first rsync run which will actually have to copy the day's >>>> changes and will have another point of comparison. >>>> >>>> I am still using fuse mounts for samba, due to prior problems with vsf >>>> =gluster, which are currently present in Samba 4.8.3-4, and already >>>> documented in bugs, for which patches exist, but no official updated samba >>>> packages have been released yet. Since I was going from 3.10.12 to 4.1.6 I >>>> also did not want to change other things to make sure I could track any >>>> issues just related to the change in gluster versions and eliminate other >>>> complexity. >>>> >>>> The file system currently has about 16TB of data in >>>> 5142816 files and 696544 directories >>>> >>>> I've just ran the following code to count files and dirs and it took >>>> 67mins 38.957 secs to complete in this gluster volume: >>>> https://github.com/ChristopherSchultz/fast-file-count >>>> >>>> # time ( /root/sbin/dircnt /export ) >>>> /export contains 5142816 files and 696544 directories >>>> >>>> real 67m38.957s >>>> user 0m6.225s >>>> sys 0m48.939s >>>> >>>> The gluster options set on the volume are: >>>> https://termbin.com/yxtd >>>> >>>> # gluster v status export >>>> Status of volume: export >>>> Gluster process TCP Port RDMA Port >>>> Online Pid >>>> >>>> ------------------------------------------------------------------------------ >>>> Brick 10.0.1.7:/bricks/hdds/brick 49157 0 Y >>>> 13986 >>>> Brick 10.0.1.6:/bricks/hdds/brick 49153 0 Y >>>> 9953 >>>> Self-heal Daemon on localhost N/A N/A Y >>>> 21934 >>>> Self-heal Daemon on 10.0.1.5 N/A N/A Y >>>> 4598 >>>> Self-heal Daemon on 10.0.1.6 N/A N/A Y >>>> 14485 >>>> >>>> Task Status of Volume export >>>> >>>> ------------------------------------------------------------------------------ >>>> There are no active volume tasks >>>> >>>> Truth, there is a 3rd server here, but no bricks on it. >>>> >>>> Thoughts? >>>> >>>> Diego >>>> >>>> >>>> Virus-free. >>>> www.avast.com >>>> >>>> <#m_1235711297687490073_m_-657708050556748564_m_-2130281720557425520_m_-4021393732076721680_m_8084651329793795211_m_7462352325940458688_m_-6479459361629161759_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> Hi Diego, >>> >>> Besides the actual improvements made in the code i think new releases >>> might implement volume options by default that before might have had >>> different setting. I would have been interesting to diff "gluster volume >>> get all" befor and after the upgrade. Just for curiosity and i am >>> trying to figure out volume options for rsync kind of workloads can you >>> share the command output anyway along with gluster volume info ? >>> >>> thanks >>> >>> > > -- > Davide Obbi > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > Direct +31207031558 > [image: Booking.com] > Empowering people to experience the world since 1996 > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amye at redhat.com Tue Jan 15 21:09:18 2019 From: amye at redhat.com (Amye Scavarda) Date: Tue, 15 Jan 2019 11:09:18 -1000 Subject: [Gluster-users] Community Meeting Host Needed, 16 Jan at 15:00 UTC Message-ID: Your friendly neighborhood community meeting host has a conflict for tomorrow's meeting, anyone want to take it on? https://bit.ly/gluster-community-meetings has the agenda. Time: - 15:00 UTC to 15:30 UTC - or in your local shell/terminal: `date -d "15:00 UTC"` Thanks! - amye -- Amye Scavarda | amye at redhat.com | Gluster Community Lead -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Wed Jan 16 10:19:01 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Wed, 16 Jan 2019 15:49:01 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service Message-ID: Hi, In short, when I started glusterd service I am getting following error msg in the glusterd.log file in one server. what needs to be done? error logged in glusterd.log [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536 [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory [2019-01-15 17:50:13.964437] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2019-01-15 17:50:14.967681] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 [2019-01-15 17:50:14.973931] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: d6bf51a7-c296-492f-8dac-e81efa9dd22d [2019-01-15 17:50:15.046620] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such file or directory] [2019-01-15 17:50:15.046685] E [MSGID: 106201] [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: gfs-tst [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2019-01-15 17:50:15.046732] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed [2019-01-15 17:50:15.046741] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] (-->/usr/local/sbin/glusterd(glusterfs_volumes In long, I am trying to simulate a situation. where volume stoped abnormally and entire cluster restarted with some missing disks. My test cluster is set up with 3 nodes and each has four disks, I have setup a volume with disperse 4+2. In Node-3 2 disks have failed, to replace I have shutdown all system below are the steps done. 1. umount from client machine 2. shutdown all system by running `shutdown -h now` command ( without stopping volume and stop service) 3. replace faulty disk in Node-3 4. powered ON all system 5. format replaced drives, and mount all drives 6. start glusterd service in all node (success) 7. Now running `voulume status` command from node-3 output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for details. 8. running `voulume start gfs-tst` command from node-3 output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : Volume gfs-tst already started 9. running `gluster v status` in other node. showing all brick available but 'self-heal daemon' not running @gfstst-node2:~$ sudo gluster v status Status of volume: gfs-tst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 Self-heal Daemon on localhost N/A N/A Y 2662 Self-heal Daemon on IP.4 N/A N/A Y 2786 10. in the above output 'volume already started'. so, running `reset-brick` command v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : /media/disk3/brick3 is already part of a volume 11. reset-brick command was not working, so, tried stopping volume and start with force command output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : FAILED : Pre-validation failed on localhost. Please check log file for details 12. now stopped service in all node and tried starting again. except node-3 other nodes service started successfully without any issues. in node-3 receiving following message. sudo service glusterd start * Starting glusterd service glusterd [fail] /usr/local/sbin/glusterd: option requires an argument -- 'f' Try `glusterd --help' or `glusterd --usage' for more information. 13. checking glusterd log file found that OS drive was running out of space output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space left on device] [2019-01-15 16:51:37.210874] E [MSGID: 106190] [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: Unable to write volume values for gfs-tst 14. cleared some space in OS drive but still, service is not running. below is the error logged in glusterd.log [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536 [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory [2019-01-15 17:50:13.964437] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2019-01-15 17:50:14.967681] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 [2019-01-15 17:50:14.973931] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: d6bf51a7-c296-492f-8dac-e81efa9dd22d [2019-01-15 17:50:15.046620] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such file or directory] [2019-01-15 17:50:15.046685] E [MSGID: 106201] [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: gfs-tst [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2019-01-15 17:50:15.046732] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed [2019-01-15 17:50:15.046741] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: received signum (-1), shutting down 15. In other node running `volume status' still shows bricks node3 is live but 'peer status' showing node-3 disconnected @gfstst-node2:~$ sudo gluster v status Status of volume: gfs-tst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 Self-heal Daemon on localhost N/A N/A Y 2662 Self-heal Daemon on IP.4 N/A N/A Y 2786 Task Status of Volume gfs-tst ------------------------------------------------------------------------------ There are no active volume tasks root at gfstst-node2:~$ sudo gluster pool list UUID Hostname State d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected root at gfstst-node2:~$ sudo gluster peer status Number of Peers: 2 Hostname: IP.3 Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d State: Peer in Cluster (Disconnected) Hostname: IP.4 Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 State: Peer in Cluster (Connected) regards Amudhan -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed Jan 16 11:04:18 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 16 Jan 2019 16:34:18 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: This is a case of partial write of a transaction and as the host ran out of space for the root partition where all the glusterd related configurations are persisted, the transaction couldn't be written and hence the new (replaced) brick's information wasn't persisted in the configuration. The workaround for this is to copy the content of /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted storage pool to the node where glusterd service fails to come up and post that restarting the glusterd service should be able to make peer status reporting all nodes healthy and connected. On Wed, Jan 16, 2019 at 3:49 PM Amudhan P wrote: > Hi, > > In short, when I started glusterd service I am getting following error msg > in the glusterd.log file in one server. > what needs to be done? > > error logged in glusterd.log > > [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] > 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd > version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) > [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] > 0-management: Maximum allowed open file descriptors set to 65536 > [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] > 0-management: Using /var/lib/glusterd as working directory > [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] > 0-management: Using /var/run/gluster as pid file working directory > [2019-01-15 17:50:13.964437] W [MSGID: 103071] > [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event > channel creation failed [No such device] > [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] > 0-rdma.management: Failed to initialize IB Device > [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] > 0-rpc-transport: 'rdma' initialization failed > [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] > 0-rpc-service: cannot create listener, initing the transport failed > [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] > 0-management: creation of 1 listeners failed, continuing with succeeded > transport > [2019-01-15 17:50:14.967681] I [MSGID: 106513] > [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved > op-version: 40100 > [2019-01-15 17:50:14.973931] I [MSGID: 106544] > [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: > d6bf51a7-c296-492f-8dac-e81efa9dd22d > [2019-01-15 17:50:15.046620] E [MSGID: 101032] > [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to > /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such > file or directory] > [2019-01-15 17:50:15.046685] E [MSGID: 106201] > [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: > Unable to restore volume: gfs-tst > [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] > 0-management: Initialization of volume 'management' failed, review your > volfile again > [2019-01-15 17:50:15.046732] E [MSGID: 101066] > [graph.c:367:glusterfs_graph_init] 0-management: initializing translator > failed > [2019-01-15 17:50:15.046741] E [MSGID: 101176] > [graph.c:738:glusterfs_graph_activate] 0-graph: init failed > [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] > (-->/usr/local/sbin/glusterd(glusterfs_volumes > > > > In long, I am trying to simulate a situation. where volume stoped > abnormally and > entire cluster restarted with some missing disks. > > My test cluster is set up with 3 nodes and each has four disks, I have > setup a volume with disperse 4+2. > In Node-3 2 disks have failed, to replace I have shutdown all system > > below are the steps done. > > 1. umount from client machine > 2. shutdown all system by running `shutdown -h now` command ( without > stopping volume and stop service) > 3. replace faulty disk in Node-3 > 4. powered ON all system > 5. format replaced drives, and mount all drives > 6. start glusterd service in all node (success) > 7. Now running `voulume status` command from node-3 > output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging > failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for > details. > 8. running `voulume start gfs-tst` command from node-3 > output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : Volume > gfs-tst already started > > 9. running `gluster v status` in other node. showing all brick available > but 'self-heal daemon' not running > @gfstst-node2:~$ sudo gluster v status > Status of volume: gfs-tst > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 > Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 > Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 > Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 > Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 > Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 > Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 > Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 > Self-heal Daemon on localhost N/A N/A Y > 2662 > Self-heal Daemon on IP.4 N/A N/A Y 2786 > > 10. in the above output 'volume already started'. so, running > `reset-brick` command > v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 > commit force > > output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst > IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : > /media/disk3/brick3 is already part of a volume > > 11. reset-brick command was not working, so, tried stopping volume and > start with force command > output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : FAILED : > Pre-validation failed on localhost. Please check log file for details > > 12. now stopped service in all node and tried starting again. except > node-3 other nodes service started successfully without any issues. > > in node-3 receiving following message. > > sudo service glusterd start > * Starting glusterd service glusterd > > [fail] > /usr/local/sbin/glusterd: option requires an argument -- 'f' > Try `glusterd --help' or `glusterd --usage' for more information. > > 13. checking glusterd log file found that OS drive was running out of space > output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] > [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space > left on device] > [2019-01-15 16:51:37.210874] E [MSGID: 106190] > [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: > Unable to write volume values for gfs-tst > > 14. cleared some space in OS drive but still, service is not running. > below is the error logged in glusterd.log > > [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] > 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd > version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) > [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] > 0-management: Maximum allowed open file descriptors set to 65536 > [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] > 0-management: Using /var/lib/glusterd as working directory > [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] > 0-management: Using /var/run/gluster as pid file working directory > [2019-01-15 17:50:13.964437] W [MSGID: 103071] > [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event > channel creation failed [No such device] > [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] > 0-rdma.management: Failed to initialize IB Device > [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] > 0-rpc-transport: 'rdma' initialization failed > [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] > 0-rpc-service: cannot create listener, initing the transport failed > [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] > 0-management: creation of 1 listeners failed, continuing with succeeded > transport > [2019-01-15 17:50:14.967681] I [MSGID: 106513] > [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved > op-version: 40100 > [2019-01-15 17:50:14.973931] I [MSGID: 106544] > [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: > d6bf51a7-c296-492f-8dac-e81efa9dd22d > [2019-01-15 17:50:15.046620] E [MSGID: 101032] > [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to > /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such > file or directory] > [2019-01-15 17:50:15.046685] E [MSGID: 106201] > [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: > Unable to restore volume: gfs-tst > [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] > 0-management: Initialization of volume 'management' failed, review your > volfile again > [2019-01-15 17:50:15.046732] E [MSGID: 101066] > [graph.c:367:glusterfs_graph_init] 0-management: initializing translator > failed > [2019-01-15 17:50:15.046741] E [MSGID: 101176] > [graph.c:738:glusterfs_graph_activate] 0-graph: init failed > [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] > (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] > -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] > -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: > received signum (-1), shutting down > > > 15. In other node running `volume status' still shows bricks node3 is live > but 'peer status' showing node-3 disconnected > > @gfstst-node2:~$ sudo gluster v status > Status of volume: gfs-tst > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 > Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 > Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 > Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 > Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 > Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 > Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 > Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 > Self-heal Daemon on localhost N/A N/A Y 2662 > Self-heal Daemon on IP.4 N/A N/A Y 2786 > > Task Status of Volume gfs-tst > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > root at gfstst-node2:~$ sudo gluster pool list > UUID Hostname State > d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected > c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected > 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected > > root at gfstst-node2:~$ sudo gluster peer status > Number of Peers: 2 > > Hostname: IP.3 > Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d > State: Peer in Cluster (Disconnected) > > Hostname: IP.4 > Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 > State: Peer in Cluster (Connected) > > > regards > Amudhan > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Wed Jan 16 11:32:54 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Wed, 16 Jan 2019 17:02:54 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Atin, I have copied the content of 'gfs-tst' from vol folder in another node. when starting service again fails with error msg in glusterd.log file. [2019-01-15 20:16:59.513023] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) [2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536 [2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory [2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory [2019-01-15 20:16:59.521508] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device [2019-01-15 20:16:59.521562] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2019-01-15 20:17:00.529390] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 [2019-01-15 20:17:00.608354] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: d6bf51a7-c296-492f-8dac-e81efa9dd22d [2019-01-15 20:17:00.650911] W [MSGID: 106425] [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed to get statfs() call on brick /media/disk4/brick4 [No such file or directory] [2019-01-15 20:17:00.691240] I [MSGID: 106498] [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2019-01-15 20:17:00.691307] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2019-01-15 20:17:00.691331] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2019-01-15 20:17:00.692547] E [MSGID: 106187] [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2019-01-15 20:17:00.692582] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2019-01-15 20:17:00.692597] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed [2019-01-15 20:17:00.692607] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: received signum (-1), shutting down On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee wrote: > This is a case of partial write of a transaction and as the host ran out > of space for the root partition where all the glusterd related > configurations are persisted, the transaction couldn't be written and hence > the new (replaced) brick's information wasn't persisted in the > configuration. The workaround for this is to copy the content of > /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted > storage pool to the node where glusterd service fails to come up and post > that restarting the glusterd service should be able to make peer status > reporting all nodes healthy and connected. > > On Wed, Jan 16, 2019 at 3:49 PM Amudhan P wrote: > >> Hi, >> >> In short, when I started glusterd service I am getting following error >> msg in the glusterd.log file in one server. >> what needs to be done? >> >> error logged in glusterd.log >> >> [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] >> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >> 0-management: Maximum allowed open file descriptors set to 65536 >> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >> 0-management: Using /var/lib/glusterd as working directory >> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >> 0-management: Using /var/run/gluster as pid file working directory >> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >> channel creation failed [No such device] >> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >> 0-rdma.management: Failed to initialize IB Device >> [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] >> 0-rpc-transport: 'rdma' initialization failed >> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >> 0-rpc-service: cannot create listener, initing the transport failed >> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >> 0-management: creation of 1 listeners failed, continuing with succeeded >> transport >> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >> op-version: 40100 >> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >> d6bf51a7-c296-492f-8dac-e81efa9dd22d >> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >> file or directory] >> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >> Unable to restore volume: gfs-tst >> [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] >> 0-management: Initialization of volume 'management' failed, review your >> volfile again >> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >> failed >> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >> (-->/usr/local/sbin/glusterd(glusterfs_volumes >> >> >> >> In long, I am trying to simulate a situation. where volume stoped >> abnormally and >> entire cluster restarted with some missing disks. >> >> My test cluster is set up with 3 nodes and each has four disks, I have >> setup a volume with disperse 4+2. >> In Node-3 2 disks have failed, to replace I have shutdown all system >> >> below are the steps done. >> >> 1. umount from client machine >> 2. shutdown all system by running `shutdown -h now` command ( without >> stopping volume and stop service) >> 3. replace faulty disk in Node-3 >> 4. powered ON all system >> 5. format replaced drives, and mount all drives >> 6. start glusterd service in all node (success) >> 7. Now running `voulume status` command from node-3 >> output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging >> failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for >> details. >> 8. running `voulume start gfs-tst` command from node-3 >> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : >> Volume gfs-tst already started >> >> 9. running `gluster v status` in other node. showing all brick available >> but 'self-heal daemon' not running >> @gfstst-node2:~$ sudo gluster v status >> Status of volume: gfs-tst >> Gluster process TCP Port RDMA Port Online >> Pid >> >> ------------------------------------------------------------------------------ >> Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 >> Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 >> Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 >> Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 >> Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 >> Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 >> Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 >> Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 >> Self-heal Daemon on localhost N/A N/A Y >> 2662 >> Self-heal Daemon on IP.4 N/A N/A Y 2786 >> >> 10. in the above output 'volume already started'. so, running >> `reset-brick` command >> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >> IP.3:/media/disk3/brick3 commit force >> >> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >> /media/disk3/brick3 is already part of a volume >> >> 11. reset-brick command was not working, so, tried stopping volume and >> start with force command >> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : FAILED : >> Pre-validation failed on localhost. Please check log file for details >> >> 12. now stopped service in all node and tried starting again. except >> node-3 other nodes service started successfully without any issues. >> >> in node-3 receiving following message. >> >> sudo service glusterd start >> * Starting glusterd service glusterd >> >> [fail] >> /usr/local/sbin/glusterd: option requires an argument -- 'f' >> Try `glusterd --help' or `glusterd --usage' for more information. >> >> 13. checking glusterd log file found that OS drive was running out of >> space >> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >> left on device] >> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >> Unable to write volume values for gfs-tst >> >> 14. cleared some space in OS drive but still, service is not running. >> below is the error logged in glusterd.log >> >> [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] >> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >> 0-management: Maximum allowed open file descriptors set to 65536 >> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >> 0-management: Using /var/lib/glusterd as working directory >> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >> 0-management: Using /var/run/gluster as pid file working directory >> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >> channel creation failed [No such device] >> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >> 0-rdma.management: Failed to initialize IB Device >> [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] >> 0-rpc-transport: 'rdma' initialization failed >> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >> 0-rpc-service: cannot create listener, initing the transport failed >> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >> 0-management: creation of 1 listeners failed, continuing with succeeded >> transport >> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >> op-version: 40100 >> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >> d6bf51a7-c296-492f-8dac-e81efa9dd22d >> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >> file or directory] >> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >> Unable to restore volume: gfs-tst >> [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] >> 0-management: Initialization of volume 'management' failed, review your >> volfile again >> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >> failed >> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >> received signum (-1), shutting down >> >> >> 15. In other node running `volume status' still shows bricks node3 is >> live >> but 'peer status' showing node-3 disconnected >> >> @gfstst-node2:~$ sudo gluster v status >> Status of volume: gfs-tst >> Gluster process TCP Port RDMA Port Online >> Pid >> >> ------------------------------------------------------------------------------ >> Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 >> Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 >> Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 >> Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 >> Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 >> Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 >> Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 >> Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 >> Self-heal Daemon on localhost N/A N/A Y 2662 >> Self-heal Daemon on IP.4 N/A N/A Y 2786 >> >> Task Status of Volume gfs-tst >> >> ------------------------------------------------------------------------------ >> There are no active volume tasks >> >> >> root at gfstst-node2:~$ sudo gluster pool list >> UUID Hostname State >> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >> >> root at gfstst-node2:~$ sudo gluster peer status >> Number of Peers: 2 >> >> Hostname: IP.3 >> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >> State: Peer in Cluster (Disconnected) >> >> Hostname: IP.4 >> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >> State: Peer in Cluster (Connected) >> >> >> regards >> Amudhan >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed Jan 16 11:34:56 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 16 Jan 2019 17:04:56 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: On Wed, Jan 16, 2019 at 5:02 PM Amudhan P wrote: > Atin, > I have copied the content of 'gfs-tst' from vol folder in another node. > when starting service again fails with error msg in glusterd.log file. > > [2019-01-15 20:16:59.513023] I [MSGID: 100030] [glusterfsd.c:2741:main] > 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd > version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) > [2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init] > 0-management: Maximum allowed open file descriptors set to 65536 > [2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init] > 0-management: Using /var/lib/glusterd as working directory > [2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init] > 0-management: Using /var/run/gluster as pid file working directory > [2019-01-15 20:16:59.521508] W [MSGID: 103071] > [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event > channel creation failed [No such device] > [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] > 0-rdma.management: Failed to initialize IB Device > [2019-01-15 20:16:59.521562] W [rpc-transport.c:351:rpc_transport_load] > 0-rpc-transport: 'rdma' initialization failed > [2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener] > 0-rpc-service: cannot create listener, initing the transport failed > [2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init] > 0-management: creation of 1 listeners failed, continuing with succeeded > transport > [2019-01-15 20:17:00.529390] I [MSGID: 106513] > [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved > op-version: 40100 > [2019-01-15 20:17:00.608354] I [MSGID: 106544] > [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: > d6bf51a7-c296-492f-8dac-e81efa9dd22d > [2019-01-15 20:17:00.650911] W [MSGID: 106425] > [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed > to get statfs() call on brick /media/disk4/brick4 [No such file or > directory] > This means that underlying brick /media/disk4/brick4 doesn't exist. You already mentioned that you had replaced the faulty disk, but have you not mounted it yet? > [2019-01-15 20:17:00.691240] I [MSGID: 106498] > [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: > connect returned 0 > [2019-01-15 20:17:00.691307] W [MSGID: 106061] > [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: > Failed to get tcp-user-timeout > [2019-01-15 20:17:00.691331] I [rpc-clnt.c:1059:rpc_clnt_connection_init] > 0-management: setting frame-timeout to 600 > [2019-01-15 20:17:00.692547] E [MSGID: 106187] > [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve > brick failed in restore > [2019-01-15 20:17:00.692582] E [MSGID: 101019] [xlator.c:720:xlator_init] > 0-management: Initialization of volume 'management' failed, review your > volfile again > [2019-01-15 20:17:00.692597] E [MSGID: 101066] > [graph.c:367:glusterfs_graph_init] 0-management: initializing translator > failed > [2019-01-15 20:17:00.692607] E [MSGID: 101176] > [graph.c:738:glusterfs_graph_activate] 0-graph: init failed > [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] > (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] > -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] > -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: > received signum (-1), shutting down > > > On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee > wrote: > >> This is a case of partial write of a transaction and as the host ran out >> of space for the root partition where all the glusterd related >> configurations are persisted, the transaction couldn't be written and hence >> the new (replaced) brick's information wasn't persisted in the >> configuration. The workaround for this is to copy the content of >> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >> storage pool to the node where glusterd service fails to come up and post >> that restarting the glusterd service should be able to make peer status >> reporting all nodes healthy and connected. >> >> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P wrote: >> >>> Hi, >>> >>> In short, when I started glusterd service I am getting following error >>> msg in the glusterd.log file in one server. >>> what needs to be done? >>> >>> error logged in glusterd.log >>> >>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] >>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >>> 0-management: Maximum allowed open file descriptors set to 65536 >>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >>> 0-management: Using /var/lib/glusterd as working directory >>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >>> 0-management: Using /var/run/gluster as pid file working directory >>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>> channel creation failed [No such device] >>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>> 0-rdma.management: Failed to initialize IB Device >>> [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] >>> 0-rpc-transport: 'rdma' initialization failed >>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >>> 0-rpc-service: cannot create listener, initing the transport failed >>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >>> 0-management: creation of 1 listeners failed, continuing with succeeded >>> transport >>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>> op-version: 40100 >>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>> file or directory] >>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>> Unable to restore volume: gfs-tst >>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>> 'management' failed, review your volfile again >>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>> failed >>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>> >>> >>> >>> In long, I am trying to simulate a situation. where volume stoped >>> abnormally and >>> entire cluster restarted with some missing disks. >>> >>> My test cluster is set up with 3 nodes and each has four disks, I have >>> setup a volume with disperse 4+2. >>> In Node-3 2 disks have failed, to replace I have shutdown all system >>> >>> below are the steps done. >>> >>> 1. umount from client machine >>> 2. shutdown all system by running `shutdown -h now` command ( without >>> stopping volume and stop service) >>> 3. replace faulty disk in Node-3 >>> 4. powered ON all system >>> 5. format replaced drives, and mount all drives >>> 6. start glusterd service in all node (success) >>> 7. Now running `voulume status` command from node-3 >>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging >>> failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for >>> details. >>> 8. running `voulume start gfs-tst` command from node-3 >>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : >>> Volume gfs-tst already started >>> >>> 9. running `gluster v status` in other node. showing all brick available >>> but 'self-heal daemon' not running >>> @gfstst-node2:~$ sudo gluster v status >>> Status of volume: gfs-tst >>> Gluster process TCP Port RDMA Port Online >>> Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 >>> Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 >>> Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 >>> Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 >>> Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 >>> Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 >>> Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 >>> Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 >>> Self-heal Daemon on localhost N/A N/A Y >>> 2662 >>> Self-heal Daemon on IP.4 N/A N/A Y 2786 >>> >>> 10. in the above output 'volume already started'. so, running >>> `reset-brick` command >>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>> IP.3:/media/disk3/brick3 commit force >>> >>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>> /media/disk3/brick3 is already part of a volume >>> >>> 11. reset-brick command was not working, so, tried stopping volume and >>> start with force command >>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : FAILED >>> : Pre-validation failed on localhost. Please check log file for details >>> >>> 12. now stopped service in all node and tried starting again. except >>> node-3 other nodes service started successfully without any issues. >>> >>> in node-3 receiving following message. >>> >>> sudo service glusterd start >>> * Starting glusterd service glusterd >>> >>> [fail] >>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>> Try `glusterd --help' or `glusterd --usage' for more information. >>> >>> 13. checking glusterd log file found that OS drive was running out of >>> space >>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>> left on device] >>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>> Unable to write volume values for gfs-tst >>> >>> 14. cleared some space in OS drive but still, service is not running. >>> below is the error logged in glusterd.log >>> >>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] >>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >>> 0-management: Maximum allowed open file descriptors set to 65536 >>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >>> 0-management: Using /var/lib/glusterd as working directory >>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >>> 0-management: Using /var/run/gluster as pid file working directory >>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>> channel creation failed [No such device] >>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>> 0-rdma.management: Failed to initialize IB Device >>> [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] >>> 0-rpc-transport: 'rdma' initialization failed >>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >>> 0-rpc-service: cannot create listener, initing the transport failed >>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >>> 0-management: creation of 1 listeners failed, continuing with succeeded >>> transport >>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>> op-version: 40100 >>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>> file or directory] >>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>> Unable to restore volume: gfs-tst >>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>> 'management' failed, review your volfile again >>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>> failed >>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>> received signum (-1), shutting down >>> >>> >>> 15. In other node running `volume status' still shows bricks node3 is >>> live >>> but 'peer status' showing node-3 disconnected >>> >>> @gfstst-node2:~$ sudo gluster v status >>> Status of volume: gfs-tst >>> Gluster process TCP Port RDMA Port Online >>> Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 >>> Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 >>> Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 >>> Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 >>> Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 >>> Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 >>> Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 >>> Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 >>> Self-heal Daemon on localhost N/A N/A Y 2662 >>> Self-heal Daemon on IP.4 N/A N/A Y 2786 >>> >>> Task Status of Volume gfs-tst >>> >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> >>> root at gfstst-node2:~$ sudo gluster pool list >>> UUID Hostname State >>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>> >>> root at gfstst-node2:~$ sudo gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: IP.3 >>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>> State: Peer in Cluster (Disconnected) >>> >>> Hostname: IP.4 >>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>> State: Peer in Cluster (Connected) >>> >>> >>> regards >>> Amudhan >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Wed Jan 16 11:54:52 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Wed, 16 Jan 2019 17:24:52 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Yes, I did mount bricks but the folder 'brick4' was still not created inside the brick. Do I need to create this folder because when I run replace-brick it will create folder inside the brick. I have seen this behavior before when running replace-brick or heal begins. On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee wrote: > > > On Wed, Jan 16, 2019 at 5:02 PM Amudhan P wrote: > >> Atin, >> I have copied the content of 'gfs-tst' from vol folder in another node. >> when starting service again fails with error msg in glusterd.log file. >> >> [2019-01-15 20:16:59.513023] I [MSGID: 100030] [glusterfsd.c:2741:main] >> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >> [2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init] >> 0-management: Maximum allowed open file descriptors set to 65536 >> [2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init] >> 0-management: Using /var/lib/glusterd as working directory >> [2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init] >> 0-management: Using /var/run/gluster as pid file working directory >> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >> channel creation failed [No such device] >> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >> 0-rdma.management: Failed to initialize IB Device >> [2019-01-15 20:16:59.521562] W [rpc-transport.c:351:rpc_transport_load] >> 0-rpc-transport: 'rdma' initialization failed >> [2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener] >> 0-rpc-service: cannot create listener, initing the transport failed >> [2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init] >> 0-management: creation of 1 listeners failed, continuing with succeeded >> transport >> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >> op-version: 40100 >> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >> d6bf51a7-c296-492f-8dac-e81efa9dd22d >> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >> to get statfs() call on brick /media/disk4/brick4 [No such file or >> directory] >> > > This means that underlying brick /media/disk4/brick4 doesn't exist. You > already mentioned that you had replaced the faulty disk, but have you not > mounted it yet? > > >> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >> connect returned 0 >> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >> Failed to get tcp-user-timeout >> [2019-01-15 20:17:00.691331] I [rpc-clnt.c:1059:rpc_clnt_connection_init] >> 0-management: setting frame-timeout to 600 >> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >> brick failed in restore >> [2019-01-15 20:17:00.692582] E [MSGID: 101019] [xlator.c:720:xlator_init] >> 0-management: Initialization of volume 'management' failed, review your >> volfile again >> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >> failed >> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >> received signum (-1), shutting down >> >> >> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >> wrote: >> >>> This is a case of partial write of a transaction and as the host ran out >>> of space for the root partition where all the glusterd related >>> configurations are persisted, the transaction couldn't be written and hence >>> the new (replaced) brick's information wasn't persisted in the >>> configuration. The workaround for this is to copy the content of >>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>> storage pool to the node where glusterd service fails to come up and post >>> that restarting the glusterd service should be able to make peer status >>> reporting all nodes healthy and connected. >>> >>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P wrote: >>> >>>> Hi, >>>> >>>> In short, when I started glusterd service I am getting following error >>>> msg in the glusterd.log file in one server. >>>> what needs to be done? >>>> >>>> error logged in glusterd.log >>>> >>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] >>>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >>>> 0-management: Using /var/lib/glusterd as working directory >>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >>>> 0-management: Using /var/run/gluster as pid file working directory >>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>> channel creation failed [No such device] >>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>> 0-rdma.management: Failed to initialize IB Device >>>> [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] >>>> 0-rpc-transport: 'rdma' initialization failed >>>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>> 0-rpc-service: cannot create listener, initing the transport failed >>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>> transport >>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>> op-version: 40100 >>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>> file or directory] >>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>> Unable to restore volume: gfs-tst >>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>> 'management' failed, review your volfile again >>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>> failed >>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>> >>>> >>>> >>>> In long, I am trying to simulate a situation. where volume stoped >>>> abnormally and >>>> entire cluster restarted with some missing disks. >>>> >>>> My test cluster is set up with 3 nodes and each has four disks, I have >>>> setup a volume with disperse 4+2. >>>> In Node-3 2 disks have failed, to replace I have shutdown all system >>>> >>>> below are the steps done. >>>> >>>> 1. umount from client machine >>>> 2. shutdown all system by running `shutdown -h now` command ( without >>>> stopping volume and stop service) >>>> 3. replace faulty disk in Node-3 >>>> 4. powered ON all system >>>> 5. format replaced drives, and mount all drives >>>> 6. start glusterd service in all node (success) >>>> 7. Now running `voulume status` command from node-3 >>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging >>>> failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for >>>> details. >>>> 8. running `voulume start gfs-tst` command from node-3 >>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : >>>> Volume gfs-tst already started >>>> >>>> 9. running `gluster v status` in other node. showing all brick >>>> available but 'self-heal daemon' not running >>>> @gfstst-node2:~$ sudo gluster v status >>>> Status of volume: gfs-tst >>>> Gluster process TCP Port RDMA Port >>>> Online Pid >>>> >>>> ------------------------------------------------------------------------------ >>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>> 1517 >>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>> 1668 >>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>> 1522 >>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>> 1678 >>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>> 1527 >>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>> 1677 >>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>> 1541 >>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>> 1683 >>>> Self-heal Daemon on localhost N/A N/A Y >>>> 2662 >>>> Self-heal Daemon on IP.4 N/A N/A Y >>>> 2786 >>>> >>>> 10. in the above output 'volume already started'. so, running >>>> `reset-brick` command >>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>> IP.3:/media/disk3/brick3 commit force >>>> >>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>> /media/disk3/brick3 is already part of a volume >>>> >>>> 11. reset-brick command was not working, so, tried stopping volume and >>>> start with force command >>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : FAILED >>>> : Pre-validation failed on localhost. Please check log file for details >>>> >>>> 12. now stopped service in all node and tried starting again. except >>>> node-3 other nodes service started successfully without any issues. >>>> >>>> in node-3 receiving following message. >>>> >>>> sudo service glusterd start >>>> * Starting glusterd service glusterd >>>> >>>> [fail] >>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>> >>>> 13. checking glusterd log file found that OS drive was running out of >>>> space >>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>> left on device] >>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>> Unable to write volume values for gfs-tst >>>> >>>> 14. cleared some space in OS drive but still, service is not running. >>>> below is the error logged in glusterd.log >>>> >>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] >>>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >>>> 0-management: Using /var/lib/glusterd as working directory >>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >>>> 0-management: Using /var/run/gluster as pid file working directory >>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>> channel creation failed [No such device] >>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>> 0-rdma.management: Failed to initialize IB Device >>>> [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] >>>> 0-rpc-transport: 'rdma' initialization failed >>>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>> 0-rpc-service: cannot create listener, initing the transport failed >>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>> transport >>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>> op-version: 40100 >>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>> file or directory] >>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>> Unable to restore volume: gfs-tst >>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>> 'management' failed, review your volfile again >>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>> failed >>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>> received signum (-1), shutting down >>>> >>>> >>>> 15. In other node running `volume status' still shows bricks node3 is >>>> live >>>> but 'peer status' showing node-3 disconnected >>>> >>>> @gfstst-node2:~$ sudo gluster v status >>>> Status of volume: gfs-tst >>>> Gluster process TCP Port RDMA Port >>>> Online Pid >>>> >>>> ------------------------------------------------------------------------------ >>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>> 1517 >>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>> 1668 >>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>> 1522 >>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>> 1678 >>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>> 1527 >>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>> 1677 >>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>> 1541 >>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>> 1683 >>>> Self-heal Daemon on localhost N/A N/A Y >>>> 2662 >>>> Self-heal Daemon on IP.4 N/A N/A Y >>>> 2786 >>>> >>>> Task Status of Volume gfs-tst >>>> >>>> ------------------------------------------------------------------------------ >>>> There are no active volume tasks >>>> >>>> >>>> root at gfstst-node2:~$ sudo gluster pool list >>>> UUID Hostname State >>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>> >>>> root at gfstst-node2:~$ sudo gluster peer status >>>> Number of Peers: 2 >>>> >>>> Hostname: IP.3 >>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>> State: Peer in Cluster (Disconnected) >>>> >>>> Hostname: IP.4 >>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>> State: Peer in Cluster (Connected) >>>> >>>> >>>> regards >>>> Amudhan >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From spisla80 at gmail.com Wed Jan 16 16:17:20 2019 From: spisla80 at gmail.com (David Spisla) Date: Wed, 16 Jan 2019 17:17:20 +0100 Subject: [Gluster-users] VolumeOpt Set fails of a freshly created volume Message-ID: Dear Gluster Community, i created a replica 4 volume from gluster-node1 on a 4-Node Cluster with SSL/TLS network encryption . During setting the 'cluster.use-compound-fops' option, i got the error: $ volume set: failed: Commit failed on gluster-node2. Please check log file for details. Here is the glusterd.log from gluster-node1: *[2019-01-15 15:18:36.813034] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) [0x7fc24d91cd2a] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) [0x7fc24d91c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) [0x7fc253dce0b5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh --volname=integration-archive1 -o cluster.use-compound-fops=on --gd-workdir=/var/lib/glusterd* [2019-01-15 15:18:36.821193] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) [0x7fc24d91cd2a] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) [0x7fc24d91c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) [0x7fc253dce0b5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=integration-archive1 -o cluster.use-compound-fops=on --gd-workdir=/var/lib/glusterd [2019-01-15 15:18:36.842383] W [socket.c:719:__socket_rwv] 0-management: readv on 10.10.12.42:24007 failed (Input/output error) *[2019-01-15 15:18:36.842415] E [socket.c:246:ssl_dump_error_stack] 0-management: error:140943F2:SSL routines:ssl3_read_bytes:sslv3 alert unexpected message* The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 81 times between [2019-01-15 15:18:30.735508] and [2019-01-15 15:18:36.808994] [2019-01-15 15:18:36.842439] I [MSGID: 106004] [glusterd-handler.c:6430:__glusterd_peer_rpc_notify] 0-management: Peer < gluster-node2> (<02724bb6-cb34-4ec3-8306-c2950e0acf9b>), in state , has disconnected from glusterd. [2019-01-15 15:18:36.842638] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) [0x7fc24d866349] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) [0x7fc24d86f950] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) [0x7fc24d922239] ) 0-management: Lock for vol archive1 not held [2019-01-15 15:18:36.842656] W [MSGID: 106117] [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not released for archive1 [2019-01-15 15:18:36.842674] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) [0x7fc24d866349] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) [0x7fc24d86f950] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) [0x7fc24d922239] ) 0-management: Lock for vol archive2 not held [2019-01-15 15:18:36.842680] W [MSGID: 106117] [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not released for archive2 [2019-01-15 15:18:36.842694] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) [0x7fc24d866349] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) [0x7fc24d86f950] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) [0x7fc24d922239] ) 0-management: Lock for vol gluster_shared_storage not held [2019-01-15 15:18:36.842702] W [MSGID: 106117] [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not released for gluster_shared_storage [2019-01-15 15:18:36.842719] W [glusterd-locks.c:806:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) [0x7fc24d866349] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) [0x7fc24d86f950] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0074) [0x7fc24d922074] ) 0-management: Lock owner mismatch. Lock for vol integration-archive1 held by ffdaa400-82cc-4ada-8ea7-144bf3714269 [2019-01-15 15:18:36.842727] W [MSGID: 106117] [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not released for integration-archive1 [2019-01-15 15:18:36.842970] E [rpc-clnt.c:346:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fc253d7f18d] (--> /usr/lib64/libgfrpc.so.0(+0xca3d)[0x7fc253b46a3d] (--> /usr/lib64/libgfrpc.so.0(+0xcb5e)[0x7fc253b46b5e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8b)[0x7fc253b480bb] (--> /usr/lib64/libgfrpc.so.0(+0xec68)[0x7fc253b48c68] ))))) 0-management: forced unwinding frame type(glusterd mgmt) op(--(4)) called at 2019-01-15 15:18:36.802613 (xid=0x6da) [2019-01-15 15:18:36.842994] E [MSGID: 106152] [glusterd-syncop.c:104:gd_collate_errors] 0-glusterd: Commit failed on gluster-node2. Please check log file for details. And here glusterd.log from gluster-node2: *[2019-01-15 15:18:36.901788] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) [0x7f9fba02cd2a] -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) [0x7f9fba02c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) [0x7f9fc04de0b5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=integration-archive1 -o cluster.use-compound-fops=on --gd-workdir=/var/lib/glusterd* The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 35 times between [2019-01-15 15:18:24.832023] and [2019-01-15 15:18:47.049407] [2019-01-15 15:18:47.049443] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 50000 [2019-01-15 15:18:47.053439] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-01-15 15:18:47.053479] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-01-15 15:18:47.059899] I [MSGID: 106490] [glusterd-handler.c:2586:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 [2019-01-15 15:18:47.063471] I [MSGID: 106493] [glusterd-handler.c:3843:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to fs-lrunning-c1-n1 (0), ret: 0, op_ret: 0 [2019-01-15 15:18:47.066148] I [MSGID: 106492] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 [2019-01-15 15:18:47.067264] I [MSGID: 106502] [glusterd-handler.c:2812:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-01-15 15:18:47.078696] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 [2019-01-15 15:19:05.377216] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 3 times between [2019-01-15 15:19:05.377216] and [2019-01-15 15:19:06.124297] Maybe there was only a temporarily network interruption but on the other side there is a ssl error message in the log file from gluster-node1. Any ideas? Regards David Spisla -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Thu Jan 17 02:36:40 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 17 Jan 2019 08:06:40 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: If gluster volume info/status shows the brick to be /media/disk4/brick4 then you'd need to mount the same path and hence you'd need to create the brick4 directory explicitly. I fail to understand the rationale how only /media/disk4 can be used as the mount path for the brick. On Wed, Jan 16, 2019 at 5:24 PM Amudhan P wrote: > Yes, I did mount bricks but the folder 'brick4' was still not created > inside the brick. > Do I need to create this folder because when I run replace-brick it will > create folder inside the brick. I have seen this behavior before when > running replace-brick or heal begins. > > On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee > wrote: > >> >> >> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P wrote: >> >>> Atin, >>> I have copied the content of 'gfs-tst' from vol folder in another node. >>> when starting service again fails with error msg in glusterd.log file. >>> >>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] [glusterfsd.c:2741:main] >>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init] >>> 0-management: Maximum allowed open file descriptors set to 65536 >>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init] >>> 0-management: Using /var/lib/glusterd as working directory >>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init] >>> 0-management: Using /var/run/gluster as pid file working directory >>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>> channel creation failed [No such device] >>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>> 0-rdma.management: Failed to initialize IB Device >>> [2019-01-15 20:16:59.521562] W [rpc-transport.c:351:rpc_transport_load] >>> 0-rpc-transport: 'rdma' initialization failed >>> [2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener] >>> 0-rpc-service: cannot create listener, initing the transport failed >>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init] >>> 0-management: creation of 1 listeners failed, continuing with succeeded >>> transport >>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>> op-version: 40100 >>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>> directory] >>> >> >> This means that underlying brick /media/disk4/brick4 doesn't exist. You >> already mentioned that you had replaced the faulty disk, but have you not >> mounted it yet? >> >> >>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>> connect returned 0 >>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>> Failed to get tcp-user-timeout >>> [2019-01-15 20:17:00.691331] I >>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>> frame-timeout to 600 >>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>> brick failed in restore >>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>> 'management' failed, review your volfile again >>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>> failed >>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>> received signum (-1), shutting down >>> >>> >>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >>> wrote: >>> >>>> This is a case of partial write of a transaction and as the host ran >>>> out of space for the root partition where all the glusterd related >>>> configurations are persisted, the transaction couldn't be written and hence >>>> the new (replaced) brick's information wasn't persisted in the >>>> configuration. The workaround for this is to copy the content of >>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>> storage pool to the node where glusterd service fails to come up and post >>>> that restarting the glusterd service should be able to make peer status >>>> reporting all nodes healthy and connected. >>>> >>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P wrote: >>>> >>>>> Hi, >>>>> >>>>> In short, when I started glusterd service I am getting following error >>>>> msg in the glusterd.log file in one server. >>>>> what needs to be done? >>>>> >>>>> error logged in glusterd.log >>>>> >>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>> /var/run/glusterd.pid) >>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >>>>> 0-management: Using /var/lib/glusterd as working directory >>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>> channel creation failed [No such device] >>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>> 0-rdma.management: Failed to initialize IB Device >>>>> [2019-01-15 17:50:13.964491] W >>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>> initialization failed >>>>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>> transport >>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>> op-version: 40100 >>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>> file or directory] >>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>> Unable to restore volume: gfs-tst >>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>> 'management' failed, review your volfile again >>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>> failed >>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>> >>>>> >>>>> >>>>> In long, I am trying to simulate a situation. where volume stoped >>>>> abnormally and >>>>> entire cluster restarted with some missing disks. >>>>> >>>>> My test cluster is set up with 3 nodes and each has four disks, I have >>>>> setup a volume with disperse 4+2. >>>>> In Node-3 2 disks have failed, to replace I have shutdown all system >>>>> >>>>> below are the steps done. >>>>> >>>>> 1. umount from client machine >>>>> 2. shutdown all system by running `shutdown -h now` command ( without >>>>> stopping volume and stop service) >>>>> 3. replace faulty disk in Node-3 >>>>> 4. powered ON all system >>>>> 5. format replaced drives, and mount all drives >>>>> 6. start glusterd service in all node (success) >>>>> 7. Now running `voulume status` command from node-3 >>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging >>>>> failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for >>>>> details. >>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : >>>>> Volume gfs-tst already started >>>>> >>>>> 9. running `gluster v status` in other node. showing all brick >>>>> available but 'self-heal daemon' not running >>>>> @gfstst-node2:~$ sudo gluster v status >>>>> Status of volume: gfs-tst >>>>> Gluster process TCP Port RDMA Port >>>>> Online Pid >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>> 1517 >>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>> 1668 >>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>> 1522 >>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>> 1678 >>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>> 1527 >>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>> 1677 >>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>> 1541 >>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>> 1683 >>>>> Self-heal Daemon on localhost N/A N/A Y >>>>> 2662 >>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>> 2786 >>>>> >>>>> 10. in the above output 'volume already started'. so, running >>>>> `reset-brick` command >>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>> IP.3:/media/disk3/brick3 commit force >>>>> >>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>> /media/disk3/brick3 is already part of a volume >>>>> >>>>> 11. reset-brick command was not working, so, tried stopping volume and >>>>> start with force command >>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>> details >>>>> >>>>> 12. now stopped service in all node and tried starting again. except >>>>> node-3 other nodes service started successfully without any issues. >>>>> >>>>> in node-3 receiving following message. >>>>> >>>>> sudo service glusterd start >>>>> * Starting glusterd service glusterd >>>>> >>>>> [fail] >>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>> >>>>> 13. checking glusterd log file found that OS drive was running out of >>>>> space >>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>> left on device] >>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>> Unable to write volume values for gfs-tst >>>>> >>>>> 14. cleared some space in OS drive but still, service is not running. >>>>> below is the error logged in glusterd.log >>>>> >>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>> /var/run/glusterd.pid) >>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >>>>> 0-management: Using /var/lib/glusterd as working directory >>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>> channel creation failed [No such device] >>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>> 0-rdma.management: Failed to initialize IB Device >>>>> [2019-01-15 17:50:13.964491] W >>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>> initialization failed >>>>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>> transport >>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>> op-version: 40100 >>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>> file or directory] >>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>> Unable to restore volume: gfs-tst >>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>> 'management' failed, review your volfile again >>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>> failed >>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>> received signum (-1), shutting down >>>>> >>>>> >>>>> 15. In other node running `volume status' still shows bricks node3 is >>>>> live >>>>> but 'peer status' showing node-3 disconnected >>>>> >>>>> @gfstst-node2:~$ sudo gluster v status >>>>> Status of volume: gfs-tst >>>>> Gluster process TCP Port RDMA Port >>>>> Online Pid >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>> 1517 >>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>> 1668 >>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>> 1522 >>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>> 1678 >>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>> 1527 >>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>> 1677 >>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>> 1541 >>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>> 1683 >>>>> Self-heal Daemon on localhost N/A N/A Y >>>>> 2662 >>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>> 2786 >>>>> >>>>> Task Status of Volume gfs-tst >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> There are no active volume tasks >>>>> >>>>> >>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>> UUID Hostname State >>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>> >>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>> Number of Peers: 2 >>>>> >>>>> Hostname: IP.3 >>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>> State: Peer in Cluster (Disconnected) >>>>> >>>>> Hostname: IP.4 >>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>> State: Peer in Cluster (Connected) >>>>> >>>>> >>>>> regards >>>>> Amudhan >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Thu Jan 17 02:42:18 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 17 Jan 2019 08:12:18 +0530 Subject: [Gluster-users] VolumeOpt Set fails of a freshly created volume In-Reply-To: References: Message-ID: On Wed, Jan 16, 2019 at 9:48 PM David Spisla wrote: > Dear Gluster Community, > > i created a replica 4 volume from gluster-node1 on a 4-Node Cluster with > SSL/TLS network encryption . During setting the 'cluster.use-compound-fops' > option, i got the error: > > $ volume set: failed: Commit failed on gluster-node2. Please check log > file for details. > > Here is the glusterd.log from gluster-node1: > > *[2019-01-15 15:18:36.813034] I [run.c:242:runner_log] > (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) > [0x7fc24d91cd2a] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) > [0x7fc24d91c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) > [0x7fc253dce0b5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh > --volname=integration-archive1 -o cluster.use-compound-fops=on > --gd-workdir=/var/lib/glusterd* > [2019-01-15 15:18:36.821193] I [run.c:242:runner_log] > (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) > [0x7fc24d91cd2a] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) > [0x7fc24d91c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) > [0x7fc253dce0b5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh > --volname=integration-archive1 -o cluster.use-compound-fops=on > --gd-workdir=/var/lib/glusterd > [2019-01-15 15:18:36.842383] W [socket.c:719:__socket_rwv] 0-management: > readv on 10.10.12.42:24007 failed (Input/output error) > *[2019-01-15 15:18:36.842415] E [socket.c:246:ssl_dump_error_stack] > 0-management: error:140943F2:SSL routines:ssl3_read_bytes:sslv3 alert > unexpected message* > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 81 times between [2019-01-15 15:18:30.735508] and > [2019-01-15 15:18:36.808994] > [2019-01-15 15:18:36.842439] I [MSGID: 106004] > [glusterd-handler.c:6430:__glusterd_peer_rpc_notify] 0-management: Peer < > gluster-node2> (<02724bb6-cb34-4ec3-8306-c2950e0acf9b>), in state in Cluster>, has disconnected from glusterd. > The above shows there was a peer disconnect event received from gluster-node2 and this sequence might have happened while the commit operation was in-flight and hence the volume set failed on gluster-node2. Related to ssl error, I'd request Milind to comment. [2019-01-15 15:18:36.842638] W > [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) > [0x7fc24d866349] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) > [0x7fc24d86f950] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) > [0x7fc24d922239] ) 0-management: Lock for vol archive1 not held > [2019-01-15 15:18:36.842656] W [MSGID: 106117] > [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not > released for archive1 > [2019-01-15 15:18:36.842674] W > [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) > [0x7fc24d866349] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) > [0x7fc24d86f950] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) > [0x7fc24d922239] ) 0-management: Lock for vol archive2 not held > [2019-01-15 15:18:36.842680] W [MSGID: 106117] > [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not > released for archive2 > [2019-01-15 15:18:36.842694] W > [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) > [0x7fc24d866349] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) > [0x7fc24d86f950] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) > [0x7fc24d922239] ) 0-management: Lock for vol gluster_shared_storage not > held > [2019-01-15 15:18:36.842702] W [MSGID: 106117] > [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not > released for gluster_shared_storage > [2019-01-15 15:18:36.842719] W > [glusterd-locks.c:806:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) > [0x7fc24d866349] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) > [0x7fc24d86f950] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0074) > [0x7fc24d922074] ) 0-management: Lock owner mismatch. Lock for vol > integration-archive1 held by ffdaa400-82cc-4ada-8ea7-144bf3714269 > [2019-01-15 15:18:36.842727] W [MSGID: 106117] > [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not > released for integration-archive1 > [2019-01-15 15:18:36.842970] E [rpc-clnt.c:346:saved_frames_unwind] (--> > /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fc253d7f18d] (--> > /usr/lib64/libgfrpc.so.0(+0xca3d)[0x7fc253b46a3d] (--> > /usr/lib64/libgfrpc.so.0(+0xcb5e)[0x7fc253b46b5e] (--> > /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8b)[0x7fc253b480bb] > (--> /usr/lib64/libgfrpc.so.0(+0xec68)[0x7fc253b48c68] ))))) 0-management: > forced unwinding frame type(glusterd mgmt) op(--(4)) called at 2019-01-15 > 15:18:36.802613 (xid=0x6da) > [2019-01-15 15:18:36.842994] E [MSGID: 106152] > [glusterd-syncop.c:104:gd_collate_errors] 0-glusterd: Commit failed on > gluster-node2. Please check log file for details. > > And here glusterd.log from gluster-node2: > > *[2019-01-15 15:18:36.901788] I [run.c:242:runner_log] > (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) > [0x7f9fba02cd2a] > -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) > [0x7f9fba02c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) > [0x7f9fc04de0b5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh > --volname=integration-archive1 -o cluster.use-compound-fops=on > --gd-workdir=/var/lib/glusterd* > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 35 times between [2019-01-15 15:18:24.832023] and > [2019-01-15 15:18:47.049407] > [2019-01-15 15:18:47.049443] I [MSGID: 106163] > [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] > 0-management: using the op-version 50000 > [2019-01-15 15:18:47.053439] E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > [2019-01-15 15:18:47.053479] E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > [2019-01-15 15:18:47.059899] I [MSGID: 106490] > [glusterd-handler.c:2586:__glusterd_handle_incoming_friend_req] 0-glusterd: > Received probe from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 > [2019-01-15 15:18:47.063471] I [MSGID: 106493] > [glusterd-handler.c:3843:glusterd_xfer_friend_add_resp] 0-glusterd: > Responded to fs-lrunning-c1-n1 (0), ret: 0, op_ret: 0 > [2019-01-15 15:18:47.066148] I [MSGID: 106492] > [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-glusterd: > Received friend update from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 > [2019-01-15 15:18:47.067264] I [MSGID: 106502] > [glusterd-handler.c:2812:__glusterd_handle_friend_update] 0-management: > Received my uuid as Friend > [2019-01-15 15:18:47.078696] I [MSGID: 106493] > [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: > Received ACC from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 > [2019-01-15 15:19:05.377216] E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 3 times between [2019-01-15 15:19:05.377216] and > [2019-01-15 15:19:06.124297] > > Maybe there was only a temporarily network interruption but on the other > side there is a ssl error message in the log file from gluster-node1. > Any ideas? > > Regards > David Spisla > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Thu Jan 17 06:04:40 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Thu, 17 Jan 2019 11:34:40 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: I have created the folder in the path as said but still, service failed to start below is the error msg in glusterd.log [2019-01-16 14:50:14.555742] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536 [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory [2019-01-16 14:50:14.563834] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device [2019-01-16 14:50:14.563882] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2019-01-16 14:50:15.565868] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 [2019-01-16 14:50:15.642532] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: d6bf51a7-c296-492f-8dac-e81efa9dd22d [2019-01-16 14:50:15.675333] I [MSGID: 106498] [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2019-01-16 14:50:15.675421] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2019-01-16 14:50:15.675451] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 *[2019-01-16 14:50:15.676912] E [MSGID: 106187] [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore* *[2019-01-16 14:50:15.676956] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again* [2019-01-16 14:50:15.676973] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed [2019-01-16 14:50:15.676986] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: received signum (-1), shutting down On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee wrote: > If gluster volume info/status shows the brick to be /media/disk4/brick4 > then you'd need to mount the same path and hence you'd need to create the > brick4 directory explicitly. I fail to understand the rationale how only > /media/disk4 can be used as the mount path for the brick. > > On Wed, Jan 16, 2019 at 5:24 PM Amudhan P wrote: > >> Yes, I did mount bricks but the folder 'brick4' was still not created >> inside the brick. >> Do I need to create this folder because when I run replace-brick it will >> create folder inside the brick. I have seen this behavior before when >> running replace-brick or heal begins. >> >> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >> wrote: >> >>> >>> >>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P wrote: >>> >>>> Atin, >>>> I have copied the content of 'gfs-tst' from vol folder in another node. >>>> when starting service again fails with error msg in glusterd.log file. >>>> >>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] [glusterfsd.c:2741:main] >>>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init] >>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init] >>>> 0-management: Using /var/lib/glusterd as working directory >>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init] >>>> 0-management: Using /var/run/gluster as pid file working directory >>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>> channel creation failed [No such device] >>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>> 0-rdma.management: Failed to initialize IB Device >>>> [2019-01-15 20:16:59.521562] W [rpc-transport.c:351:rpc_transport_load] >>>> 0-rpc-transport: 'rdma' initialization failed >>>> [2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>> 0-rpc-service: cannot create listener, initing the transport failed >>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init] >>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>> transport >>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>> op-version: 40100 >>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>> directory] >>>> >>> >>> This means that underlying brick /media/disk4/brick4 doesn't exist. You >>> already mentioned that you had replaced the faulty disk, but have you not >>> mounted it yet? >>> >>> >>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>> connect returned 0 >>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>> Failed to get tcp-user-timeout >>>> [2019-01-15 20:17:00.691331] I >>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>> frame-timeout to 600 >>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>> brick failed in restore >>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>> 'management' failed, review your volfile again >>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>> failed >>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>> received signum (-1), shutting down >>>> >>>> >>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >>>> wrote: >>>> >>>>> This is a case of partial write of a transaction and as the host ran >>>>> out of space for the root partition where all the glusterd related >>>>> configurations are persisted, the transaction couldn't be written and hence >>>>> the new (replaced) brick's information wasn't persisted in the >>>>> configuration. The workaround for this is to copy the content of >>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>> storage pool to the node where glusterd service fails to come up and post >>>>> that restarting the glusterd service should be able to make peer status >>>>> reporting all nodes healthy and connected. >>>>> >>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> In short, when I started glusterd service I am getting following >>>>>> error msg in the glusterd.log file in one server. >>>>>> what needs to be done? >>>>>> >>>>>> error logged in glusterd.log >>>>>> >>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>> /var/run/glusterd.pid) >>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >>>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >>>>>> 0-management: Using /var/lib/glusterd as working directory >>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >>>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>> channel creation failed [No such device] >>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>> [2019-01-15 17:50:13.964491] W >>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>> initialization failed >>>>>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >>>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>>> transport >>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>> op-version: 40100 >>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>> file or directory] >>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>> Unable to restore volume: gfs-tst >>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>> 'management' failed, review your volfile again >>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>> failed >>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>> >>>>>> >>>>>> >>>>>> In long, I am trying to simulate a situation. where volume stoped >>>>>> abnormally and >>>>>> entire cluster restarted with some missing disks. >>>>>> >>>>>> My test cluster is set up with 3 nodes and each has four disks, I >>>>>> have setup a volume with disperse 4+2. >>>>>> In Node-3 2 disks have failed, to replace I have shutdown all system >>>>>> >>>>>> below are the steps done. >>>>>> >>>>>> 1. umount from client machine >>>>>> 2. shutdown all system by running `shutdown -h now` command ( without >>>>>> stopping volume and stop service) >>>>>> 3. replace faulty disk in Node-3 >>>>>> 4. powered ON all system >>>>>> 5. format replaced drives, and mount all drives >>>>>> 6. start glusterd service in all node (success) >>>>>> 7. Now running `voulume status` command from node-3 >>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging >>>>>> failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for >>>>>> details. >>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : >>>>>> Volume gfs-tst already started >>>>>> >>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>> available but 'self-heal daemon' not running >>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>> Status of volume: gfs-tst >>>>>> Gluster process TCP Port RDMA Port >>>>>> Online Pid >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>> 1517 >>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>> 1668 >>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>> 1522 >>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>> 1678 >>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>> 1527 >>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>> 1677 >>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>> 1541 >>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>> 1683 >>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>> 2662 >>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>> 2786 >>>>>> >>>>>> 10. in the above output 'volume already started'. so, running >>>>>> `reset-brick` command >>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>> IP.3:/media/disk3/brick3 commit force >>>>>> >>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>> /media/disk3/brick3 is already part of a volume >>>>>> >>>>>> 11. reset-brick command was not working, so, tried stopping volume >>>>>> and start with force command >>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>> details >>>>>> >>>>>> 12. now stopped service in all node and tried starting again. except >>>>>> node-3 other nodes service started successfully without any issues. >>>>>> >>>>>> in node-3 receiving following message. >>>>>> >>>>>> sudo service glusterd start >>>>>> * Starting glusterd service glusterd >>>>>> >>>>>> [fail] >>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>> >>>>>> 13. checking glusterd log file found that OS drive was running out of >>>>>> space >>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>> left on device] >>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>> Unable to write volume values for gfs-tst >>>>>> >>>>>> 14. cleared some space in OS drive but still, service is not running. >>>>>> below is the error logged in glusterd.log >>>>>> >>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>> /var/run/glusterd.pid) >>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] >>>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] >>>>>> 0-management: Using /var/lib/glusterd as working directory >>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] >>>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>> channel creation failed [No such device] >>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>> [2019-01-15 17:50:13.964491] W >>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>> initialization failed >>>>>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] >>>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>>> transport >>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>> op-version: 40100 >>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>> file or directory] >>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>> Unable to restore volume: gfs-tst >>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>> 'management' failed, review your volfile again >>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>> failed >>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>> received signum (-1), shutting down >>>>>> >>>>>> >>>>>> 15. In other node running `volume status' still shows bricks node3 is >>>>>> live >>>>>> but 'peer status' showing node-3 disconnected >>>>>> >>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>> Status of volume: gfs-tst >>>>>> Gluster process TCP Port RDMA Port >>>>>> Online Pid >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>> 1517 >>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>> 1668 >>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>> 1522 >>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>> 1678 >>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>> 1527 >>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>> 1677 >>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>> 1541 >>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>> 1683 >>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>> 2662 >>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>> 2786 >>>>>> >>>>>> Task Status of Volume gfs-tst >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> There are no active volume tasks >>>>>> >>>>>> >>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>> UUID Hostname State >>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>> >>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>> Number of Peers: 2 >>>>>> >>>>>> Hostname: IP.3 >>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>> State: Peer in Cluster (Disconnected) >>>>>> >>>>>> Hostname: IP.4 >>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>> State: Peer in Cluster (Connected) >>>>>> >>>>>> >>>>>> regards >>>>>> Amudhan >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Thu Jan 17 10:13:19 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 17 Jan 2019 15:43:19 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? Instead of doing too many back and forth I suggest you to share the content of /var/lib/glusterd from all the nodes. Also do mention which particular node the glusterd service is unable to come up. On Thu, Jan 17, 2019 at 11:34 AM Amudhan P wrote: > I have created the folder in the path as said but still, service failed to > start below is the error msg in glusterd.log > > [2019-01-16 14:50:14.555742] I [MSGID: 100030] [glusterfsd.c:2741:main] > 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd > version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) > [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] > 0-management: Maximum allowed open file descriptors set to 65536 > [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] > 0-management: Using /var/lib/glusterd as working directory > [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] > 0-management: Using /var/run/gluster as pid file working directory > [2019-01-16 14:50:14.563834] W [MSGID: 103071] > [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event > channel creation failed [No such device] > [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] > 0-rdma.management: Failed to initialize IB Device > [2019-01-16 14:50:14.563882] W [rpc-transport.c:351:rpc_transport_load] > 0-rpc-transport: 'rdma' initialization failed > [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] > 0-rpc-service: cannot create listener, initing the transport failed > [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] > 0-management: creation of 1 listeners failed, continuing with succeeded > transport > [2019-01-16 14:50:15.565868] I [MSGID: 106513] > [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved > op-version: 40100 > [2019-01-16 14:50:15.642532] I [MSGID: 106544] > [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: > d6bf51a7-c296-492f-8dac-e81efa9dd22d > [2019-01-16 14:50:15.675333] I [MSGID: 106498] > [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: > connect returned 0 > [2019-01-16 14:50:15.675421] W [MSGID: 106061] > [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: > Failed to get tcp-user-timeout > [2019-01-16 14:50:15.675451] I [rpc-clnt.c:1059:rpc_clnt_connection_init] > 0-management: setting frame-timeout to 600 > *[2019-01-16 14:50:15.676912] E [MSGID: 106187] > [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve > brick failed in restore* > *[2019-01-16 14:50:15.676956] E [MSGID: 101019] [xlator.c:720:xlator_init] > 0-management: Initialization of volume 'management' failed, review your > volfile again* > [2019-01-16 14:50:15.676973] E [MSGID: 101066] > [graph.c:367:glusterfs_graph_init] 0-management: initializing translator > failed > [2019-01-16 14:50:15.676986] E [MSGID: 101176] > [graph.c:738:glusterfs_graph_activate] 0-graph: init failed > [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] > (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] > -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] > -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: > received signum (-1), shutting down > > > On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee > wrote: > >> If gluster volume info/status shows the brick to be /media/disk4/brick4 >> then you'd need to mount the same path and hence you'd need to create the >> brick4 directory explicitly. I fail to understand the rationale how only >> /media/disk4 can be used as the mount path for the brick. >> >> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P wrote: >> >>> Yes, I did mount bricks but the folder 'brick4' was still not created >>> inside the brick. >>> Do I need to create this folder because when I run replace-brick it will >>> create folder inside the brick. I have seen this behavior before when >>> running replace-brick or heal begins. >>> >>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >>> wrote: >>> >>>> >>>> >>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P wrote: >>>> >>>>> Atin, >>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>> >>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>> /var/run/glusterd.pid) >>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init] >>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init] >>>>> 0-management: Using /var/lib/glusterd as working directory >>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init] >>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>> channel creation failed [No such device] >>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>> 0-rdma.management: Failed to initialize IB Device >>>>> [2019-01-15 20:16:59.521562] W >>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>> initialization failed >>>>> [2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init] >>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>> transport >>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>> op-version: 40100 >>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>> directory] >>>>> >>>> >>>> This means that underlying brick /media/disk4/brick4 doesn't exist. You >>>> already mentioned that you had replaced the faulty disk, but have you not >>>> mounted it yet? >>>> >>>> >>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>> connect returned 0 >>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>> Failed to get tcp-user-timeout >>>>> [2019-01-15 20:17:00.691331] I >>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>> frame-timeout to 600 >>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>> brick failed in restore >>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>> 'management' failed, review your volfile again >>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>> failed >>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>> received signum (-1), shutting down >>>>> >>>>> >>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >>>>> wrote: >>>>> >>>>>> This is a case of partial write of a transaction and as the host ran >>>>>> out of space for the root partition where all the glusterd related >>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>> configuration. The workaround for this is to copy the content of >>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>> that restarting the glusterd service should be able to make peer status >>>>>> reporting all nodes healthy and connected. >>>>>> >>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> In short, when I started glusterd service I am getting following >>>>>>> error msg in the glusterd.log file in one server. >>>>>>> what needs to be done? >>>>>>> >>>>>>> error logged in glusterd.log >>>>>>> >>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>> /var/run/glusterd.pid) >>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>> set to 65536 >>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>> directory >>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>> working directory >>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>> channel creation failed [No such device] >>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>> initialization failed >>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>> listener, initing the transport failed >>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>> continuing with succeeded transport >>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>> op-version: 40100 >>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>> file or directory] >>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>> Unable to restore volume: gfs-tst >>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>> 'management' failed, review your volfile again >>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>> failed >>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>> >>>>>>> >>>>>>> >>>>>>> In long, I am trying to simulate a situation. where volume stoped >>>>>>> abnormally and >>>>>>> entire cluster restarted with some missing disks. >>>>>>> >>>>>>> My test cluster is set up with 3 nodes and each has four disks, I >>>>>>> have setup a volume with disperse 4+2. >>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all system >>>>>>> >>>>>>> below are the steps done. >>>>>>> >>>>>>> 1. umount from client machine >>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>> without stopping volume and stop service) >>>>>>> 3. replace faulty disk in Node-3 >>>>>>> 4. powered ON all system >>>>>>> 5. format replaced drives, and mount all drives >>>>>>> 6. start glusterd service in all node (success) >>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging >>>>>>> failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for >>>>>>> details. >>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : >>>>>>> Volume gfs-tst already started >>>>>>> >>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>> available but 'self-heal daemon' not running >>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>> Status of volume: gfs-tst >>>>>>> Gluster process TCP Port RDMA Port >>>>>>> Online Pid >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>> 1517 >>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>> 1668 >>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>> 1522 >>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>> 1678 >>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>> 1527 >>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>> 1677 >>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>> 1541 >>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>> 1683 >>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>> 2662 >>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>> 2786 >>>>>>> >>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>> `reset-brick` command >>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>> >>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>> >>>>>>> 11. reset-brick command was not working, so, tried stopping volume >>>>>>> and start with force command >>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>> details >>>>>>> >>>>>>> 12. now stopped service in all node and tried starting again. except >>>>>>> node-3 other nodes service started successfully without any issues. >>>>>>> >>>>>>> in node-3 receiving following message. >>>>>>> >>>>>>> sudo service glusterd start >>>>>>> * Starting glusterd service glusterd >>>>>>> >>>>>>> [fail] >>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>> >>>>>>> 13. checking glusterd log file found that OS drive was running out >>>>>>> of space >>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>> left on device] >>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>> Unable to write volume values for gfs-tst >>>>>>> >>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>> running. below is the error logged in glusterd.log >>>>>>> >>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>> /var/run/glusterd.pid) >>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>> set to 65536 >>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>> directory >>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>> working directory >>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>> channel creation failed [No such device] >>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>> initialization failed >>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>> listener, initing the transport failed >>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>> continuing with succeeded transport >>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>> op-version: 40100 >>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>> file or directory] >>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>> Unable to restore volume: gfs-tst >>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>> 'management' failed, review your volfile again >>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>> failed >>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>> received signum (-1), shutting down >>>>>>> >>>>>>> >>>>>>> 15. In other node running `volume status' still shows bricks node3 >>>>>>> is live >>>>>>> but 'peer status' showing node-3 disconnected >>>>>>> >>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>> Status of volume: gfs-tst >>>>>>> Gluster process TCP Port RDMA Port >>>>>>> Online Pid >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>> 1517 >>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>> 1668 >>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>> 1522 >>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>> 1678 >>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>> 1527 >>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>> 1677 >>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>> 1541 >>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>> 1683 >>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>> 2662 >>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>> 2786 >>>>>>> >>>>>>> Task Status of Volume gfs-tst >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> There are no active volume tasks >>>>>>> >>>>>>> >>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>> UUID Hostname State >>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>> >>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>> Number of Peers: 2 >>>>>>> >>>>>>> Hostname: IP.3 >>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>> State: Peer in Cluster (Disconnected) >>>>>>> >>>>>>> Hostname: IP.4 >>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>> State: Peer in Cluster (Connected) >>>>>>> >>>>>>> >>>>>>> regards >>>>>>> Amudhan >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwaymack at nsgdv.com Thu Jan 17 19:58:23 2019 From: mwaymack at nsgdv.com (Matt Waymack) Date: Thu, 17 Jan 2019 19:58:23 +0000 Subject: [Gluster-users] Unable to create new files or folders using samba and vfs_glusterfs In-Reply-To: References: <4b12886d1d03eeac4e113d5b218525297f1f8b14.camel@cryptolab.net> <4a648d22-ccad-44c7-b1c4-1b1abdc6bc69@email.android.com> Message-ID: I've been using these for a few weeks now without any issues, thank you! -----Original Message----- From: gluster-users-bounces at gluster.org On Behalf Of Matt Waymack Sent: Thursday, December 27, 2018 10:56 AM To: Diego Remolina Cc: gluster-users at gluster.org List Subject: Re: [Gluster-users] Unable to create new files or folders using samba and vfs_glusterfs OK, I'm back from the holiday and updated using the following packages: libsmbclient-4.8.3-4.el7.0.1.x86_64.rpm libwbclient-4.8.3-4.el7.0.1.x86_64.rpm samba-4.8.3-4.el7.0.1.x86_64.rpm samba-client-4.8.3-4.el7.0.1.x86_64.rpm samba-client-libs-4.8.3-4.el7.0.1.x86_64.rpm samba-common-4.8.3-4.el7.0.1.noarch.rpm samba-common-libs-4.8.3-4.el7.0.1.x86_64.rpm samba-common-tools-4.8.3-4.el7.0.1.x86_64.rpm samba-libs-4.8.3-4.el7.0.1.x86_64.rpm samba-vfs-glusterfs-4.8.3-4.el7.0.1.x86_64.rpm First impressions are good! We're able to create files/folders. I'll keep you updated with stability. Thank you! -----Original Message----- From: Diego Remolina Sent: Thursday, December 20, 2018 1:36 PM To: Matt Waymack Cc: gluster-users at gluster.org List Subject: Re: [Gluster-users] Unable to create new files or folders using samba and vfs_glusterfs Hi Matt, The update is slightly different, has the .1 at the end: Fast-track -> samba-4.8.3-4.el7.0.1.x86_64.rpm vs general -> samba-4.8.3-4.el7.x86_64 I think these are built, but not pushed to fasttrack repo until they get feedback the packages are good. So you may need to use wget to download them and update your packages with these for the test. Diego On Thu, Dec 20, 2018 at 1:06 PM Matt Waymack wrote: > > Hi all, > > > > I?m looking to update Samba from fasttrack, but I only still se 4.8.3 and yum is not wanting to update. The test build is also showing 4.8.3. > > > > Thank you! > > > > > > From: gluster-users-bounces at gluster.org > On Behalf Of Matt Waymack > Sent: Sunday, December 16, 2018 1:55 PM > To: Diego Remolina > Cc: gluster-users at gluster.org List > Subject: Re: [Gluster-users] Unable to create new files or folders > using samba and vfs_glusterfs > > > > Hi all, sorry for the delayed response. > > > > I can test this out and will report back. It may be as late as Tuesday before I can test the build. > > > > Thank you! > > > > On Dec 15, 2018 7:46 AM, Diego Remolina wrote: > > Matt, > > > > Can you test the updated samba packages that the CentOS team has built for FasTrack? > > > > A NOTE has been added to this issue. > > ---------------------------------------------------------------------- > (0033351) pgreco (developer) - 2018-12-15 13:43 > https://bugs.centos.org/view.php?id=15586#c33351 > ---------------------------------------------------------------------- > @dijuremo at gmail.com > Here's the link for the test build > https://buildlogs.centos.org/c7-fasttrack.x86_64/samba/20181214164659/ > 4.8.3-4.el7.0.1.x86_64/ > . > Please let us know how it goes. Thanks for testing! > Pablo. > ---------------------------------------------------------------------- > > Diego > > > > > On Fri, Dec 14, 2018 at 12:52 AM Anoop C S wrote: > > > > On Thu, 2018-12-13 at 15:31 +0000, Matt Waymack wrote: > > > Hi all, > > > > > > I?m having an issue on Windows clients accessing shares via smb > > > when using vfs_glusterfs. They are unable to create any file or > > > folders at the root of the share and get the error ?The file is > > > too large for the destination file system.? When I change from > > > vfs_glusterfs to just using a filesystem path to the same > > > location, it works fine (except for the performance hit). All my > > > searches have led to bug 1619108, and that seems to be the > > > symptom, but there doesn?t appear to be any clear resolution. > > > > You figured out the right bug and following is the upstream Samba bug: > > > > https://bugzilla.samba.org/show_bug.cgi?id=13585 > > > > Unfortunately it is only available with v4.8.6 and higher. If > > required I can patch it up and provide a build. > > > > > I?m on the latest version of samba available on CentOS 7 (4.8.3) > > > and I?m on the latest available glusterfs 4.1 (4.1.6). Is there > > > something simple I?m missing to get this going? > > > > > > Thank you! > > > _______________________________________________ > > > Gluster-users mailing list > > > Gluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users From archon810 at gmail.com Thu Jan 17 20:18:03 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Thu, 17 Jan 2019 12:18:03 -0800 Subject: [Gluster-users] To good to be truth speed improvements? Message-ID: When we first started with glusterfs and version 3 last year, we also had a ton of performance issues, especially with small files. I've made several reports at the time, hopefully some of them helped. However, at some point, possibly after updating to v4 (currently using 4.0.2), the performance issues went away. Poof. I imagine when we upgrade further, performance may improve even more. When we were running v3, I was desperately considering alternatives/competitors. Not anymore, gluster has performed perfectly and without a single durability issue for almost a year now. Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR -------------- next part -------------- An HTML attachment was scrubbed... URL: From mauro.tridici at cmcc.it Fri Jan 18 08:54:04 2019 From: mauro.tridici at cmcc.it (Mauro Tridici) Date: Fri, 18 Jan 2019 09:54:04 +0100 Subject: [Gluster-users] invisible files in some directory Message-ID: <96B07283-D8AB-4F06-909D-E00424625528@cmcc.it> Dear Users, I?m facing with a new problem on our gluster volume (v. 3.12.14). Sometime it happen that ?ls? command execution, in a specified directory, return empty output. ?ls? command output is empty, but I know that the involved directory contains some files and subdirectories. In fact, if I try to execute ?ls? command against a specified file (in the same folder) I can see that the file is there. In a few words: ?ls" command output executed in a particular folder is empty; "ls filename? command output executed in the same folder is ok. There is something I can do in order to identify the cause of this issue? You can find below some information about the volume. Thank you in advance, Mauro Tridici [root at s01 ~]# gluster volume info Volume Name: tier2 Type: Distributed-Disperse Volume ID: a28d88c5-3295-4e35-98d4-210b3af9358c Status: Started Snapshot Count: 0 Number of Bricks: 12 x (4 + 2) = 72 Transport-type: tcp Bricks: Brick1: s01-stg:/gluster/mnt1/brick Brick2: s02-stg:/gluster/mnt1/brick Brick3: s03-stg:/gluster/mnt1/brick Brick4: s01-stg:/gluster/mnt2/brick Brick5: s02-stg:/gluster/mnt2/brick Brick6: s03-stg:/gluster/mnt2/brick Brick7: s01-stg:/gluster/mnt3/brick Brick8: s02-stg:/gluster/mnt3/brick Brick9: s03-stg:/gluster/mnt3/brick Brick10: s01-stg:/gluster/mnt4/brick Brick11: s02-stg:/gluster/mnt4/brick Brick12: s03-stg:/gluster/mnt4/brick Brick13: s01-stg:/gluster/mnt5/brick Brick14: s02-stg:/gluster/mnt5/brick Brick15: s03-stg:/gluster/mnt5/brick Brick16: s01-stg:/gluster/mnt6/brick Brick17: s02-stg:/gluster/mnt6/brick Brick18: s03-stg:/gluster/mnt6/brick Brick19: s01-stg:/gluster/mnt7/brick Brick20: s02-stg:/gluster/mnt7/brick Brick21: s03-stg:/gluster/mnt7/brick Brick22: s01-stg:/gluster/mnt8/brick Brick23: s02-stg:/gluster/mnt8/brick Brick24: s03-stg:/gluster/mnt8/brick Brick25: s01-stg:/gluster/mnt9/brick Brick26: s02-stg:/gluster/mnt9/brick Brick27: s03-stg:/gluster/mnt9/brick Brick28: s01-stg:/gluster/mnt10/brick Brick29: s02-stg:/gluster/mnt10/brick Brick30: s03-stg:/gluster/mnt10/brick Brick31: s01-stg:/gluster/mnt11/brick Brick32: s02-stg:/gluster/mnt11/brick Brick33: s03-stg:/gluster/mnt11/brick Brick34: s01-stg:/gluster/mnt12/brick Brick35: s02-stg:/gluster/mnt12/brick Brick36: s03-stg:/gluster/mnt12/brick Brick37: s04-stg:/gluster/mnt1/brick Brick38: s05-stg:/gluster/mnt1/brick Brick39: s06-stg:/gluster/mnt1/brick Brick40: s04-stg:/gluster/mnt2/brick Brick41: s05-stg:/gluster/mnt2/brick Brick42: s06-stg:/gluster/mnt2/brick Brick43: s04-stg:/gluster/mnt3/brick Brick44: s05-stg:/gluster/mnt3/brick Brick45: s06-stg:/gluster/mnt3/brick Brick46: s04-stg:/gluster/mnt4/brick Brick47: s05-stg:/gluster/mnt4/brick Brick48: s06-stg:/gluster/mnt4/brick Brick49: s04-stg:/gluster/mnt5/brick Brick50: s05-stg:/gluster/mnt5/brick Brick51: s06-stg:/gluster/mnt5/brick Brick52: s04-stg:/gluster/mnt6/brick Brick53: s05-stg:/gluster/mnt6/brick Brick54: s06-stg:/gluster/mnt6/brick Brick55: s04-stg:/gluster/mnt7/brick Brick56: s05-stg:/gluster/mnt7/brick Brick57: s06-stg:/gluster/mnt7/brick Brick58: s04-stg:/gluster/mnt8/brick Brick59: s05-stg:/gluster/mnt8/brick Brick60: s06-stg:/gluster/mnt8/brick Brick61: s04-stg:/gluster/mnt9/brick Brick62: s05-stg:/gluster/mnt9/brick Brick63: s06-stg:/gluster/mnt9/brick Brick64: s04-stg:/gluster/mnt10/brick Brick65: s05-stg:/gluster/mnt10/brick Brick66: s06-stg:/gluster/mnt10/brick Brick67: s04-stg:/gluster/mnt11/brick Brick68: s05-stg:/gluster/mnt11/brick Brick69: s06-stg:/gluster/mnt11/brick Brick70: s04-stg:/gluster/mnt12/brick Brick71: s05-stg:/gluster/mnt12/brick Brick72: s06-stg:/gluster/mnt12/brick Options Reconfigured: disperse.eager-lock: off diagnostics.count-fop-hits: on diagnostics.latency-measurement: on cluster.server-quorum-type: server features.default-soft-limit: 90 features.quota-deem-statfs: on performance.io-thread-count: 16 disperse.cpu-extensions: auto performance.io-cache: off network.inode-lru-limit: 50000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on cluster.readdir-optimize: on performance.parallel-readdir: off performance.readdir-ahead: on cluster.lookup-optimize: on client.event-threads: 4 server.event-threads: 4 nfs.disable: on transport.address-family: inet cluster.quorum-type: none cluster.min-free-disk: 10 performance.client-io-threads: on features.quota: on features.inode-quota: on features.bitrot: on features.scrub: Active network.ping-timeout: 0 cluster.brick-multiplex: off cluster.server-quorum-ratio: 51 -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Fri Jan 18 09:14:25 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Fri, 18 Jan 2019 14:44:25 +0530 Subject: [Gluster-users] invisible files in some directory In-Reply-To: <96B07283-D8AB-4F06-909D-E00424625528@cmcc.it> References: <96B07283-D8AB-4F06-909D-E00424625528@cmcc.it> Message-ID: On Fri, 18 Jan 2019 at 14:25, Mauro Tridici wrote: > Dear Users, > > I?m facing with a new problem on our gluster volume (v. 3.12.14). > Sometime it happen that ?ls? command execution, in a specified directory, > return empty output. > ?ls? command output is empty, but I know that the involved directory > contains some files and subdirectories. > In fact, if I try to execute ?ls? command against a specified file (in the > same folder) I can see that the file is there. > > In a few words: > > ?ls" command output executed in a particular folder is empty; > "ls filename? command output executed in the same folder is ok. > > There is something I can do in order to identify the cause of this issue? > > Yes, please take a tcpdump of the client when running the ls on the problematic directory and send it across. tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22 We have seen such issues when the gfid handle for the directory is missing on the bricks. Regards, Nithya > You can find below some information about the volume. > Thank you in advance, > Mauro Tridici > > [root at s01 ~]# gluster volume info > > > Volume Name: tier2 > Type: Distributed-Disperse > Volume ID: a28d88c5-3295-4e35-98d4-210b3af9358c > Status: Started > Snapshot Count: 0 > Number of Bricks: 12 x (4 + 2) = 72 > Transport-type: tcp > Bricks: > Brick1: s01-stg:/gluster/mnt1/brick > Brick2: s02-stg:/gluster/mnt1/brick > Brick3: s03-stg:/gluster/mnt1/brick > Brick4: s01-stg:/gluster/mnt2/brick > Brick5: s02-stg:/gluster/mnt2/brick > Brick6: s03-stg:/gluster/mnt2/brick > Brick7: s01-stg:/gluster/mnt3/brick > Brick8: s02-stg:/gluster/mnt3/brick > Brick9: s03-stg:/gluster/mnt3/brick > Brick10: s01-stg:/gluster/mnt4/brick > Brick11: s02-stg:/gluster/mnt4/brick > Brick12: s03-stg:/gluster/mnt4/brick > Brick13: s01-stg:/gluster/mnt5/brick > Brick14: s02-stg:/gluster/mnt5/brick > Brick15: s03-stg:/gluster/mnt5/brick > Brick16: s01-stg:/gluster/mnt6/brick > Brick17: s02-stg:/gluster/mnt6/brick > Brick18: s03-stg:/gluster/mnt6/brick > Brick19: s01-stg:/gluster/mnt7/brick > Brick20: s02-stg:/gluster/mnt7/brick > Brick21: s03-stg:/gluster/mnt7/brick > Brick22: s01-stg:/gluster/mnt8/brick > Brick23: s02-stg:/gluster/mnt8/brick > Brick24: s03-stg:/gluster/mnt8/brick > Brick25: s01-stg:/gluster/mnt9/brick > Brick26: s02-stg:/gluster/mnt9/brick > Brick27: s03-stg:/gluster/mnt9/brick > Brick28: s01-stg:/gluster/mnt10/brick > Brick29: s02-stg:/gluster/mnt10/brick > Brick30: s03-stg:/gluster/mnt10/brick > Brick31: s01-stg:/gluster/mnt11/brick > Brick32: s02-stg:/gluster/mnt11/brick > Brick33: s03-stg:/gluster/mnt11/brick > Brick34: s01-stg:/gluster/mnt12/brick > Brick35: s02-stg:/gluster/mnt12/brick > Brick36: s03-stg:/gluster/mnt12/brick > Brick37: s04-stg:/gluster/mnt1/brick > Brick38: s05-stg:/gluster/mnt1/brick > Brick39: s06-stg:/gluster/mnt1/brick > Brick40: s04-stg:/gluster/mnt2/brick > Brick41: s05-stg:/gluster/mnt2/brick > Brick42: s06-stg:/gluster/mnt2/brick > Brick43: s04-stg:/gluster/mnt3/brick > Brick44: s05-stg:/gluster/mnt3/brick > Brick45: s06-stg:/gluster/mnt3/brick > Brick46: s04-stg:/gluster/mnt4/brick > Brick47: s05-stg:/gluster/mnt4/brick > Brick48: s06-stg:/gluster/mnt4/brick > Brick49: s04-stg:/gluster/mnt5/brick > Brick50: s05-stg:/gluster/mnt5/brick > Brick51: s06-stg:/gluster/mnt5/brick > Brick52: s04-stg:/gluster/mnt6/brick > Brick53: s05-stg:/gluster/mnt6/brick > Brick54: s06-stg:/gluster/mnt6/brick > Brick55: s04-stg:/gluster/mnt7/brick > Brick56: s05-stg:/gluster/mnt7/brick > Brick57: s06-stg:/gluster/mnt7/brick > Brick58: s04-stg:/gluster/mnt8/brick > Brick59: s05-stg:/gluster/mnt8/brick > Brick60: s06-stg:/gluster/mnt8/brick > Brick61: s04-stg:/gluster/mnt9/brick > Brick62: s05-stg:/gluster/mnt9/brick > Brick63: s06-stg:/gluster/mnt9/brick > Brick64: s04-stg:/gluster/mnt10/brick > Brick65: s05-stg:/gluster/mnt10/brick > Brick66: s06-stg:/gluster/mnt10/brick > Brick67: s04-stg:/gluster/mnt11/brick > Brick68: s05-stg:/gluster/mnt11/brick > Brick69: s06-stg:/gluster/mnt11/brick > Brick70: s04-stg:/gluster/mnt12/brick > Brick71: s05-stg:/gluster/mnt12/brick > Brick72: s06-stg:/gluster/mnt12/brick > Options Reconfigured: > disperse.eager-lock: off > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > cluster.server-quorum-type: server > features.default-soft-limit: 90 > features.quota-deem-statfs: on > performance.io-thread-count: 16 > disperse.cpu-extensions: auto > performance.io-cache: off > network.inode-lru-limit: 50000 > performance.md-cache-timeout: 600 > performance.cache-invalidation: on > performance.stat-prefetch: on > features.cache-invalidation-timeout: 600 > features.cache-invalidation: on > cluster.readdir-optimize: on > performance.parallel-readdir: off > performance.readdir-ahead: on > cluster.lookup-optimize: on > client.event-threads: 4 > server.event-threads: 4 > nfs.disable: on > transport.address-family: inet > cluster.quorum-type: none > cluster.min-free-disk: 10 > performance.client-io-threads: on > features.quota: on > features.inode-quota: on > features.bitrot: on > features.scrub: Active > network.ping-timeout: 0 > cluster.brick-multiplex: off > cluster.server-quorum-ratio: 51 > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From mauro.tridici at cmcc.it Fri Jan 18 09:36:27 2019 From: mauro.tridici at cmcc.it (Mauro Tridici) Date: Fri, 18 Jan 2019 10:36:27 +0100 Subject: [Gluster-users] invisible files in some directory In-Reply-To: References: <96B07283-D8AB-4F06-909D-E00424625528@cmcc.it> Message-ID: Hi Nithya, I just executed the command you suggested for a while. In attachment you can find the log file. Thank you, Mauro > Il giorno 18 gen 2019, alle ore 10:14, Nithya Balachandran ha scritto: > > > > On Fri, 18 Jan 2019 at 14:25, Mauro Tridici > wrote: > Dear Users, > > I?m facing with a new problem on our gluster volume (v. 3.12.14). > Sometime it happen that ?ls? command execution, in a specified directory, return empty output. > ?ls? command output is empty, but I know that the involved directory contains some files and subdirectories. > In fact, if I try to execute ?ls? command against a specified file (in the same folder) I can see that the file is there. > > In a few words: > > ?ls" command output executed in a particular folder is empty; > "ls filename? command output executed in the same folder is ok. > > There is something I can do in order to identify the cause of this issue? > > > Yes, please take a tcpdump of the client when running the ls on the problematic directory and send it across. > > tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22 > > > We have seen such issues when the gfid handle for the directory is missing on the bricks. > Regards, > Nithya > > > > You can find below some information about the volume. > Thank you in advance, > Mauro Tridici > > [root at s01 ~]# gluster volume info > > Volume Name: tier2 > Type: Distributed-Disperse > Volume ID: a28d88c5-3295-4e35-98d4-210b3af9358c > Status: Started > Snapshot Count: 0 > Number of Bricks: 12 x (4 + 2) = 72 > Transport-type: tcp > Bricks: > Brick1: s01-stg:/gluster/mnt1/brick > Brick2: s02-stg:/gluster/mnt1/brick > Brick3: s03-stg:/gluster/mnt1/brick > Brick4: s01-stg:/gluster/mnt2/brick > Brick5: s02-stg:/gluster/mnt2/brick > Brick6: s03-stg:/gluster/mnt2/brick > Brick7: s01-stg:/gluster/mnt3/brick > Brick8: s02-stg:/gluster/mnt3/brick > Brick9: s03-stg:/gluster/mnt3/brick > Brick10: s01-stg:/gluster/mnt4/brick > Brick11: s02-stg:/gluster/mnt4/brick > Brick12: s03-stg:/gluster/mnt4/brick > Brick13: s01-stg:/gluster/mnt5/brick > Brick14: s02-stg:/gluster/mnt5/brick > Brick15: s03-stg:/gluster/mnt5/brick > Brick16: s01-stg:/gluster/mnt6/brick > Brick17: s02-stg:/gluster/mnt6/brick > Brick18: s03-stg:/gluster/mnt6/brick > Brick19: s01-stg:/gluster/mnt7/brick > Brick20: s02-stg:/gluster/mnt7/brick > Brick21: s03-stg:/gluster/mnt7/brick > Brick22: s01-stg:/gluster/mnt8/brick > Brick23: s02-stg:/gluster/mnt8/brick > Brick24: s03-stg:/gluster/mnt8/brick > Brick25: s01-stg:/gluster/mnt9/brick > Brick26: s02-stg:/gluster/mnt9/brick > Brick27: s03-stg:/gluster/mnt9/brick > Brick28: s01-stg:/gluster/mnt10/brick > Brick29: s02-stg:/gluster/mnt10/brick > Brick30: s03-stg:/gluster/mnt10/brick > Brick31: s01-stg:/gluster/mnt11/brick > Brick32: s02-stg:/gluster/mnt11/brick > Brick33: s03-stg:/gluster/mnt11/brick > Brick34: s01-stg:/gluster/mnt12/brick > Brick35: s02-stg:/gluster/mnt12/brick > Brick36: s03-stg:/gluster/mnt12/brick > Brick37: s04-stg:/gluster/mnt1/brick > Brick38: s05-stg:/gluster/mnt1/brick > Brick39: s06-stg:/gluster/mnt1/brick > Brick40: s04-stg:/gluster/mnt2/brick > Brick41: s05-stg:/gluster/mnt2/brick > Brick42: s06-stg:/gluster/mnt2/brick > Brick43: s04-stg:/gluster/mnt3/brick > Brick44: s05-stg:/gluster/mnt3/brick > Brick45: s06-stg:/gluster/mnt3/brick > Brick46: s04-stg:/gluster/mnt4/brick > Brick47: s05-stg:/gluster/mnt4/brick > Brick48: s06-stg:/gluster/mnt4/brick > Brick49: s04-stg:/gluster/mnt5/brick > Brick50: s05-stg:/gluster/mnt5/brick > Brick51: s06-stg:/gluster/mnt5/brick > Brick52: s04-stg:/gluster/mnt6/brick > Brick53: s05-stg:/gluster/mnt6/brick > Brick54: s06-stg:/gluster/mnt6/brick > Brick55: s04-stg:/gluster/mnt7/brick > Brick56: s05-stg:/gluster/mnt7/brick > Brick57: s06-stg:/gluster/mnt7/brick > Brick58: s04-stg:/gluster/mnt8/brick > Brick59: s05-stg:/gluster/mnt8/brick > Brick60: s06-stg:/gluster/mnt8/brick > Brick61: s04-stg:/gluster/mnt9/brick > Brick62: s05-stg:/gluster/mnt9/brick > Brick63: s06-stg:/gluster/mnt9/brick > Brick64: s04-stg:/gluster/mnt10/brick > Brick65: s05-stg:/gluster/mnt10/brick > Brick66: s06-stg:/gluster/mnt10/brick > Brick67: s04-stg:/gluster/mnt11/brick > Brick68: s05-stg:/gluster/mnt11/brick > Brick69: s06-stg:/gluster/mnt11/brick > Brick70: s04-stg:/gluster/mnt12/brick > Brick71: s05-stg:/gluster/mnt12/brick > Brick72: s06-stg:/gluster/mnt12/brick > Options Reconfigured: > disperse.eager-lock: off > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > cluster.server-quorum-type: server > features.default-soft-limit: 90 > features.quota-deem-statfs: on > performance.io -thread-count: 16 > disperse.cpu-extensions: auto > performance.io -cache: off > network.inode-lru-limit: 50000 > performance.md-cache-timeout: 600 > performance.cache-invalidation: on > performance.stat-prefetch: on > features.cache-invalidation-timeout: 600 > features.cache-invalidation: on > cluster.readdir-optimize: on > performance.parallel-readdir: off > performance.readdir-ahead: on > cluster.lookup-optimize: on > client.event-threads: 4 > server.event-threads: 4 > nfs.disable: on > transport.address-family: inet > cluster.quorum-type: none > cluster.min-free-disk: 10 > performance.client-io-threads: on > features.quota: on > features.inode-quota: on > features.bitrot: on > features.scrub: Active > network.ping-timeout: 0 > cluster.brick-multiplex: off > cluster.server-quorum-ratio: 51 > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: dirls.pcap.gz Type: application/x-gzip Size: 1018101 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Fri Jan 18 11:22:40 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Fri, 18 Jan 2019 16:52:40 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Hi Atin, I have sent files to your email directly in other mail. hope you have received. regards Amudhan On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee wrote: > Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? > Instead of doing too many back and forth I suggest you to share the content > of /var/lib/glusterd from all the nodes. Also do mention which particular > node the glusterd service is unable to come up. > > On Thu, Jan 17, 2019 at 11:34 AM Amudhan P wrote: > >> I have created the folder in the path as said but still, service failed >> to start below is the error msg in glusterd.log >> >> [2019-01-16 14:50:14.555742] I [MSGID: 100030] [glusterfsd.c:2741:main] >> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >> 0-management: Maximum allowed open file descriptors set to 65536 >> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >> 0-management: Using /var/lib/glusterd as working directory >> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >> 0-management: Using /var/run/gluster as pid file working directory >> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >> channel creation failed [No such device] >> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >> 0-rdma.management: Failed to initialize IB Device >> [2019-01-16 14:50:14.563882] W [rpc-transport.c:351:rpc_transport_load] >> 0-rpc-transport: 'rdma' initialization failed >> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >> 0-rpc-service: cannot create listener, initing the transport failed >> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >> 0-management: creation of 1 listeners failed, continuing with succeeded >> transport >> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >> op-version: 40100 >> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >> d6bf51a7-c296-492f-8dac-e81efa9dd22d >> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >> connect returned 0 >> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >> Failed to get tcp-user-timeout >> [2019-01-16 14:50:15.675451] I [rpc-clnt.c:1059:rpc_clnt_connection_init] >> 0-management: setting frame-timeout to 600 >> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >> brick failed in restore* >> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >> [xlator.c:720:xlator_init] 0-management: Initialization of volume >> 'management' failed, review your volfile again* >> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >> failed >> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >> received signum (-1), shutting down >> >> >> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee >> wrote: >> >>> If gluster volume info/status shows the brick to be /media/disk4/brick4 >>> then you'd need to mount the same path and hence you'd need to create the >>> brick4 directory explicitly. I fail to understand the rationale how only >>> /media/disk4 can be used as the mount path for the brick. >>> >>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P wrote: >>> >>>> Yes, I did mount bricks but the folder 'brick4' was still not created >>>> inside the brick. >>>> Do I need to create this folder because when I run replace-brick it >>>> will create folder inside the brick. I have seen this behavior before when >>>> running replace-brick or heal begins. >>>> >>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >>>> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P wrote: >>>>> >>>>>> Atin, >>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>> >>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>> /var/run/glusterd.pid) >>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init] >>>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init] >>>>>> 0-management: Using /var/lib/glusterd as working directory >>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init] >>>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>> channel creation failed [No such device] >>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>> [2019-01-15 20:16:59.521562] W >>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>> initialization failed >>>>>> [2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init] >>>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>>> transport >>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>> op-version: 40100 >>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>> directory] >>>>>> >>>>> >>>>> This means that underlying brick /media/disk4/brick4 doesn't exist. >>>>> You already mentioned that you had replaced the faulty disk, but have you >>>>> not mounted it yet? >>>>> >>>>> >>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>> connect returned 0 >>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>> Failed to get tcp-user-timeout >>>>>> [2019-01-15 20:17:00.691331] I >>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>> frame-timeout to 600 >>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>> brick failed in restore >>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>> 'management' failed, review your volfile again >>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>> failed >>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>> received signum (-1), shutting down >>>>>> >>>>>> >>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >>>>>> wrote: >>>>>> >>>>>>> This is a case of partial write of a transaction and as the host ran >>>>>>> out of space for the root partition where all the glusterd related >>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>> configuration. The workaround for this is to copy the content of >>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>> reporting all nodes healthy and connected. >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> In short, when I started glusterd service I am getting following >>>>>>>> error msg in the glusterd.log file in one server. >>>>>>>> what needs to be done? >>>>>>>> >>>>>>>> error logged in glusterd.log >>>>>>>> >>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>> /var/run/glusterd.pid) >>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>> set to 65536 >>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>> directory >>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>> working directory >>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>> channel creation failed [No such device] >>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>> initialization failed >>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>> listener, initing the transport failed >>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>> continuing with succeeded transport >>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>> op-version: 40100 >>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>> file or directory] >>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>> Unable to restore volume: gfs-tst >>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>> 'management' failed, review your volfile again >>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>> failed >>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> In long, I am trying to simulate a situation. where volume stoped >>>>>>>> abnormally and >>>>>>>> entire cluster restarted with some missing disks. >>>>>>>> >>>>>>>> My test cluster is set up with 3 nodes and each has four disks, I >>>>>>>> have setup a volume with disperse 4+2. >>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all system >>>>>>>> >>>>>>>> below are the steps done. >>>>>>>> >>>>>>>> 1. umount from client machine >>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>> without stopping volume and stop service) >>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>> 4. powered ON all system >>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>> 6. start glusterd service in all node (success) >>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>> file for details. >>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : >>>>>>>> Volume gfs-tst already started >>>>>>>> >>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>> available but 'self-heal daemon' not running >>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>> Status of volume: gfs-tst >>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>> Online Pid >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>> 1517 >>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>> 1668 >>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>> 1522 >>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>> 1678 >>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>> 1527 >>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>> 1677 >>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>> 1541 >>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>> 1683 >>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>> 2662 >>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>> 2786 >>>>>>>> >>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>> `reset-brick` command >>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>> >>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>> >>>>>>>> 11. reset-brick command was not working, so, tried stopping volume >>>>>>>> and start with force command >>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>> details >>>>>>>> >>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>> >>>>>>>> in node-3 receiving following message. >>>>>>>> >>>>>>>> sudo service glusterd start >>>>>>>> * Starting glusterd service glusterd >>>>>>>> >>>>>>>> [fail] >>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>>> >>>>>>>> 13. checking glusterd log file found that OS drive was running out >>>>>>>> of space >>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>> left on device] >>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>> Unable to write volume values for gfs-tst >>>>>>>> >>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>> running. below is the error logged in glusterd.log >>>>>>>> >>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>> /var/run/glusterd.pid) >>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>> set to 65536 >>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>> directory >>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>> working directory >>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>> channel creation failed [No such device] >>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>> initialization failed >>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>> listener, initing the transport failed >>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>> continuing with succeeded transport >>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>> op-version: 40100 >>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>> file or directory] >>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>> Unable to restore volume: gfs-tst >>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>> 'management' failed, review your volfile again >>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>> failed >>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>> received signum (-1), shutting down >>>>>>>> >>>>>>>> >>>>>>>> 15. In other node running `volume status' still shows bricks node3 >>>>>>>> is live >>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>> >>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>> Status of volume: gfs-tst >>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>> Online Pid >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>> 1517 >>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>> 1668 >>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>> 1522 >>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>> 1678 >>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>> 1527 >>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>> 1677 >>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>> 1541 >>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>> 1683 >>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>> 2662 >>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>> 2786 >>>>>>>> >>>>>>>> Task Status of Volume gfs-tst >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> There are no active volume tasks >>>>>>>> >>>>>>>> >>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>> UUID Hostname State >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>>> >>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>> Number of Peers: 2 >>>>>>>> >>>>>>>> Hostname: IP.3 >>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>> >>>>>>>> Hostname: IP.4 >>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>> State: Peer in Cluster (Connected) >>>>>>>> >>>>>>>> >>>>>>>> regards >>>>>>>> Amudhan >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dijuremo at gmail.com Fri Jan 18 13:35:12 2019 From: dijuremo at gmail.com (Diego Remolina) Date: Fri, 18 Jan 2019 08:35:12 -0500 Subject: [Gluster-users] To good to be truth speed improvements? In-Reply-To: References: Message-ID: The OP (me) has a two node setup. I am not sure how many nodes in Artem's configuration (he is running 4.0.2). It can make sense that the more bricks you have, the higher the performance hit in certain conditions, given that supposedly one of the issues of gluster with many small files is that gluster has to stat the files in all the bricks (I would assume where they reside), so this is what creates the high latencies which lead to bad performance with many small files. I am no expert on the internals of how it works, so I am not 100% sure though. Diego On Fri, Jan 18, 2019 at 8:22 AM Andreas Davour wrote: > On Thu, 17 Jan 2019, Artem Russakovskii wrote: > > > When we first started with glusterfs and version 3 last year, we also > had a > > ton of performance issues, especially with small files. I've made several > > reports at the time, hopefully some of them helped. > > > > However, at some point, possibly after updating to v4 (currently using > > 4.0.2), the performance issues went away. Poof. I imagine when we upgrade > > further, performance may improve even more. > > > > When we were running v3, I was desperately considering > > alternatives/competitors. Not anymore, gluster has performed perfectly > and > > without a single durability issue for almost a year now. > > If I understood correctly you only have two bricks, is that correct? I'm > working at a site with gazillions of bricks and the performance is > terrible, and I suspect there's a sweetspot where you don't want too > many bricks compared to nodes. It would be interesting to investigate > that further. > > /andreas > > -- > "economics is a pseudoscience; the astrology of our time" > Kim Stanley Robinson > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Fri Jan 18 15:33:51 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Fri, 18 Jan 2019 07:33:51 -0800 Subject: [Gluster-users] To good to be truth speed improvements? In-Reply-To: References: Message-ID: I actually have 4 bricks with no arbiters. Fixed quorum count of 1 assures the files will be accessible even if all but 1 brick go down. Performance is good enough, though it can always be better, of course. On Fri, Jan 18, 2019, 5:37 AM Andreas Davour On Fri, 18 Jan 2019, Diego Remolina wrote: > > > The OP (me) has a two node setup. I am not sure how many nodes in Artem's > > configuration (he is running 4.0.2). > > > > It can make sense that the more bricks you have, the higher the > performance > > hit in certain conditions, given that supposedly one of the issues of > > gluster with many small files is that gluster has to stat the files in > all > > the bricks (I would assume where they reside), so this is what creates > the > > high latencies which lead to bad performance with many small files. > > > > I am no expert on the internals of how it works, so I am not 100% sure > > though. > > Ah! Look at that. I mixed up you and Artem. Sorry about that! > > /andreas > > -- > "economics is a pseudoscience; the astrology of our time" > Kim Stanley Robinson > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Sat Jan 19 02:25:32 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Sat, 19 Jan 2019 07:55:32 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: I have received but haven?t got a chance to look at them. I can only come back on this sometime early next week based on my schedule. On Fri, 18 Jan 2019 at 16:52, Amudhan P wrote: > Hi Atin, > > I have sent files to your email directly in other mail. hope you have > received. > > regards > Amudhan > > On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee > wrote: > >> Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? >> Instead of doing too many back and forth I suggest you to share the content >> of /var/lib/glusterd from all the nodes. Also do mention which particular >> node the glusterd service is unable to come up. >> >> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P wrote: >> >>> I have created the folder in the path as said but still, service failed >>> to start below is the error msg in glusterd.log >>> >>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] [glusterfsd.c:2741:main] >>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >>> 0-management: Maximum allowed open file descriptors set to 65536 >>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >>> 0-management: Using /var/lib/glusterd as working directory >>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >>> 0-management: Using /var/run/gluster as pid file working directory >>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>> channel creation failed [No such device] >>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>> 0-rdma.management: Failed to initialize IB Device >>> [2019-01-16 14:50:14.563882] W [rpc-transport.c:351:rpc_transport_load] >>> 0-rpc-transport: 'rdma' initialization failed >>> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >>> 0-rpc-service: cannot create listener, initing the transport failed >>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >>> 0-management: creation of 1 listeners failed, continuing with succeeded >>> transport >>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>> op-version: 40100 >>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>> connect returned 0 >>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>> Failed to get tcp-user-timeout >>> [2019-01-16 14:50:15.675451] I >>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>> frame-timeout to 600 >>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>> brick failed in restore* >>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>> 'management' failed, review your volfile again* >>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>> failed >>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>> received signum (-1), shutting down >>> >>> >>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee >>> wrote: >>> >>>> If gluster volume info/status shows the brick to be /media/disk4/brick4 >>>> then you'd need to mount the same path and hence you'd need to create the >>>> brick4 directory explicitly. I fail to understand the rationale how only >>>> /media/disk4 can be used as the mount path for the brick. >>>> >>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P wrote: >>>> >>>>> Yes, I did mount bricks but the folder 'brick4' was still not created >>>>> inside the brick. >>>>> Do I need to create this folder because when I run replace-brick it >>>>> will create folder inside the brick. I have seen this behavior before when >>>>> running replace-brick or heal begins. >>>>> >>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P >>>>>> wrote: >>>>>> >>>>>>> Atin, >>>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>>> >>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>> /var/run/glusterd.pid) >>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>> set to 65536 >>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>> directory >>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>> working directory >>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>> channel creation failed [No such device] >>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>> initialization failed >>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>> listener, initing the transport failed >>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>> continuing with succeeded transport >>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>> op-version: 40100 >>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>> directory] >>>>>>> >>>>>> >>>>>> This means that underlying brick /media/disk4/brick4 doesn't exist. >>>>>> You already mentioned that you had replaced the faulty disk, but have you >>>>>> not mounted it yet? >>>>>> >>>>>> >>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>> connect returned 0 >>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>> Failed to get tcp-user-timeout >>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>> frame-timeout to 600 >>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>> brick failed in restore >>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>> 'management' failed, review your volfile again >>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>> failed >>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>> received signum (-1), shutting down >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >>>>>>> wrote: >>>>>>> >>>>>>>> This is a case of partial write of a transaction and as the host >>>>>>>> ran out of space for the root partition where all the glusterd related >>>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>>> reporting all nodes healthy and connected. >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> In short, when I started glusterd service I am getting following >>>>>>>>> error msg in the glusterd.log file in one server. >>>>>>>>> what needs to be done? >>>>>>>>> >>>>>>>>> error logged in glusterd.log >>>>>>>>> >>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>> /var/run/glusterd.pid) >>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>> set to 65536 >>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>> directory >>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>> working directory >>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>> channel creation failed [No such device] >>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>> initialization failed >>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>> listener, initing the transport failed >>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>> continuing with succeeded transport >>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>> op-version: 40100 >>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>> file or directory] >>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>> 'management' failed, review your volfile again >>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>> failed >>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> In long, I am trying to simulate a situation. where volume stoped >>>>>>>>> abnormally and >>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>> >>>>>>>>> My test cluster is set up with 3 nodes and each has four disks, I >>>>>>>>> have setup a volume with disperse 4+2. >>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>> system >>>>>>>>> >>>>>>>>> below are the steps done. >>>>>>>>> >>>>>>>>> 1. umount from client machine >>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>> without stopping volume and stop service) >>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>> 4. powered ON all system >>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>>> file for details. >>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED >>>>>>>>> : Volume gfs-tst already started >>>>>>>>> >>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>> Status of volume: gfs-tst >>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>> Online Pid >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>> 1517 >>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>> 1668 >>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>> 1522 >>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>> 1678 >>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>> 1527 >>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>> 1677 >>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>> 1541 >>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>> 1683 >>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>> Y 2662 >>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>> 2786 >>>>>>>>> >>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>> `reset-brick` command >>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>> >>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>> >>>>>>>>> 11. reset-brick command was not working, so, tried stopping volume >>>>>>>>> and start with force command >>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>>> details >>>>>>>>> >>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>>> >>>>>>>>> in node-3 receiving following message. >>>>>>>>> >>>>>>>>> sudo service glusterd start >>>>>>>>> * Starting glusterd service glusterd >>>>>>>>> >>>>>>>>> [fail] >>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>>>> >>>>>>>>> 13. checking glusterd log file found that OS drive was running out >>>>>>>>> of space >>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>>> left on device] >>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>> >>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>> >>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>> /var/run/glusterd.pid) >>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>> set to 65536 >>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>> directory >>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>> working directory >>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>> channel creation failed [No such device] >>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>> initialization failed >>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>> listener, initing the transport failed >>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>> continuing with succeeded transport >>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>> op-version: 40100 >>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>> file or directory] >>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>> 'management' failed, review your volfile again >>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>> failed >>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>> received signum (-1), shutting down >>>>>>>>> >>>>>>>>> >>>>>>>>> 15. In other node running `volume status' still shows bricks node3 >>>>>>>>> is live >>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>> >>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>> Status of volume: gfs-tst >>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>> Online Pid >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>> 1517 >>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>> 1668 >>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>> 1522 >>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>> 1678 >>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>> 1527 >>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>> 1677 >>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>> 1541 >>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>> 1683 >>>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>>> 2662 >>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>> 2786 >>>>>>>>> >>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> There are no active volume tasks >>>>>>>>> >>>>>>>>> >>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>>> UUID Hostname State >>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>>>> >>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>>> Number of Peers: 2 >>>>>>>>> >>>>>>>>> Hostname: IP.3 >>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>> >>>>>>>>> Hostname: IP.4 >>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>> >>>>>>>>> >>>>>>>>> regards >>>>>>>>> Amudhan >>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org >>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> -- - Atin (atinm) -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Sun Jan 20 02:59:06 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Sun, 20 Jan 2019 08:29:06 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Ok, no problem. On Sat 19 Jan, 2019, 7:55 AM Atin Mukherjee I have received but haven?t got a chance to look at them. I can only come > back on this sometime early next week based on my schedule. > > On Fri, 18 Jan 2019 at 16:52, Amudhan P wrote: > >> Hi Atin, >> >> I have sent files to your email directly in other mail. hope you have >> received. >> >> regards >> Amudhan >> >> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee >> wrote: >> >>> Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? >>> Instead of doing too many back and forth I suggest you to share the content >>> of /var/lib/glusterd from all the nodes. Also do mention which particular >>> node the glusterd service is unable to come up. >>> >>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P wrote: >>> >>>> I have created the folder in the path as said but still, service failed >>>> to start below is the error msg in glusterd.log >>>> >>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] [glusterfsd.c:2741:main] >>>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >>>> 0-management: Using /var/lib/glusterd as working directory >>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >>>> 0-management: Using /var/run/gluster as pid file working directory >>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>> channel creation failed [No such device] >>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>>> 0-rdma.management: Failed to initialize IB Device >>>> [2019-01-16 14:50:14.563882] W [rpc-transport.c:351:rpc_transport_load] >>>> 0-rpc-transport: 'rdma' initialization failed >>>> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>> 0-rpc-service: cannot create listener, initing the transport failed >>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>> transport >>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>> op-version: 40100 >>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>> connect returned 0 >>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>> Failed to get tcp-user-timeout >>>> [2019-01-16 14:50:15.675451] I >>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>> frame-timeout to 600 >>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>> brick failed in restore* >>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>> 'management' failed, review your volfile again* >>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>> failed >>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>> received signum (-1), shutting down >>>> >>>> >>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee >>>> wrote: >>>> >>>>> If gluster volume info/status shows the brick to be >>>>> /media/disk4/brick4 then you'd need to mount the same path and hence you'd >>>>> need to create the brick4 directory explicitly. I fail to understand the >>>>> rationale how only /media/disk4 can be used as the mount path for the >>>>> brick. >>>>> >>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P wrote: >>>>> >>>>>> Yes, I did mount bricks but the folder 'brick4' was still not created >>>>>> inside the brick. >>>>>> Do I need to create this folder because when I run replace-brick it >>>>>> will create folder inside the brick. I have seen this behavior before when >>>>>> running replace-brick or heal begins. >>>>>> >>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P >>>>>>> wrote: >>>>>>> >>>>>>>> Atin, >>>>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>>>> >>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>> /var/run/glusterd.pid) >>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>> set to 65536 >>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>> directory >>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>> working directory >>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>> channel creation failed [No such device] >>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>> initialization failed >>>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>> listener, initing the transport failed >>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>> continuing with succeeded transport >>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>> op-version: 40100 >>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>>> directory] >>>>>>>> >>>>>>> >>>>>>> This means that underlying brick /media/disk4/brick4 doesn't exist. >>>>>>> You already mentioned that you had replaced the faulty disk, but have you >>>>>>> not mounted it yet? >>>>>>> >>>>>>> >>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>>> connect returned 0 >>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>>> Failed to get tcp-user-timeout >>>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>>> frame-timeout to 600 >>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>>> brick failed in restore >>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>> 'management' failed, review your volfile again >>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>> failed >>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>> received signum (-1), shutting down >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This is a case of partial write of a transaction and as the host >>>>>>>>> ran out of space for the root partition where all the glusterd related >>>>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>>>> reporting all nodes healthy and connected. >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> In short, when I started glusterd service I am getting following >>>>>>>>>> error msg in the glusterd.log file in one server. >>>>>>>>>> what needs to be done? >>>>>>>>>> >>>>>>>>>> error logged in glusterd.log >>>>>>>>>> >>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>> set to 65536 >>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>> directory >>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>> working directory >>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>> channel creation failed [No such device] >>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>> initialization failed >>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>> listener, initing the transport failed >>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>> continuing with succeeded transport >>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>> op-version: 40100 >>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>> file or directory] >>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>> failed >>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> In long, I am trying to simulate a situation. where volume stoped >>>>>>>>>> abnormally and >>>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>>> >>>>>>>>>> My test cluster is set up with 3 nodes and each has four disks, I >>>>>>>>>> have setup a volume with disperse 4+2. >>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>>> system >>>>>>>>>> >>>>>>>>>> below are the steps done. >>>>>>>>>> >>>>>>>>>> 1. umount from client machine >>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>>> without stopping volume and stop service) >>>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>>> 4. powered ON all system >>>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>>>> file for details. >>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED >>>>>>>>>> : Volume gfs-tst already started >>>>>>>>>> >>>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>> Online Pid >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1517 >>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1668 >>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1522 >>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1678 >>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1527 >>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1677 >>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1541 >>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1683 >>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>> Y 2662 >>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>> 2786 >>>>>>>>>> >>>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>>> `reset-brick` command >>>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>>> >>>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>>> >>>>>>>>>> 11. reset-brick command was not working, so, tried stopping >>>>>>>>>> volume and start with force command >>>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>>>> details >>>>>>>>>> >>>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>>>> >>>>>>>>>> in node-3 receiving following message. >>>>>>>>>> >>>>>>>>>> sudo service glusterd start >>>>>>>>>> * Starting glusterd service glusterd >>>>>>>>>> >>>>>>>>>> [fail] >>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>>>>> >>>>>>>>>> 13. checking glusterd log file found that OS drive was running >>>>>>>>>> out of space >>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>>>> left on device] >>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>>> >>>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>>> >>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>> set to 65536 >>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>> directory >>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>> working directory >>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>> channel creation failed [No such device] >>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>> initialization failed >>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>> listener, initing the transport failed >>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>> continuing with succeeded transport >>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>> op-version: 40100 >>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>> file or directory] >>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>> failed >>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>> received signum (-1), shutting down >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 15. In other node running `volume status' still shows bricks >>>>>>>>>> node3 is live >>>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>>> >>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>> Online Pid >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1517 >>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1668 >>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1522 >>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1678 >>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1527 >>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1677 >>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1541 >>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1683 >>>>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>>>> 2662 >>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>> 2786 >>>>>>>>>> >>>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> There are no active volume tasks >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>>>> UUID Hostname State >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>>>>> >>>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>>>> Number of Peers: 2 >>>>>>>>>> >>>>>>>>>> Hostname: IP.3 >>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>>> >>>>>>>>>> Hostname: IP.4 >>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> regards >>>>>>>>>> Amudhan >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> -- > - Atin (atinm) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Mon Jan 21 12:06:07 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Mon, 21 Jan 2019 17:36:07 +0530 Subject: [Gluster-users] Unable to create new volume due to pending operations Message-ID: Hi, We have deployed glustrerfs as containers on openshift orgin. We are unable to create new volume for opeshift pods observed following error. Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. We could observe above issue due to pending operations on heketi db. We have exported db and removed pending operations (volumes,bricks) on db, lvs from physical hosts and imported. But heketi trying to delete volumes which are removed from heketi as part of pending operations in db and still we are unable to create volumes facing server busy. Can you please let me know still heketi where it is getting volumes id which are not available on heketi deb?? why we are unable to create volumes ?? (no info on glusterd,glusterfsd logs)?? [negroni] Started POST /volumes [heketi] WARNING 2019/01/21 11:48:31 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 260.577?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/21 11:48:39 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 221.477?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 151.896?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 125.387?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 168.23?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 123.231?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 160.416?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 124.439?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 126.748?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.377?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 138.477?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 267.79?s BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Mon Jan 21 13:06:02 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Mon, 21 Jan 2019 18:36:02 +0530 Subject: [Gluster-users] Bricks are going offline unable to recover with heal/start force commands Message-ID: Hi, Bricks are in offline when we try to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ we can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req we can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Mon Jan 21 13:47:22 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Mon, 21 Jan 2019 19:17:22 +0530 Subject: [Gluster-users] [Bugs] Unable to create new volume due to pending operations In-Reply-To: References: Message-ID: Hi, Can you please reply on my issues. I think its already known issue I feel. Can you please let me know still heketi where it is getting volumes id which are not available on heketi db?? why we are unable to create volumes ?? (no info on glusterd,glusterfsd logs)?? BR Salam From: "Shaik Salam" To: Date: 01/21/2019 05:47 PM Subject: [Bugs] Unable to create new volume due to pending operations Sent by: bugs-bounces at gluster.org "External email. Open with Caution" Hi, We have deployed glustrerfs as containers on openshift orgin. We are unable to create new volume for opeshift pods observed following error. Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. We could observe above issue due to pending operations on heketi db. We have exported db and removed pending operations (volumes,bricks) on db, lvs from physical hosts and imported. But heketi trying to delete volumes which are removed from heketi as part of pending operations in db and still we are unable to create volumes facing server busy. Can you please let me know still heketi where it is getting volumes id which are not available on heketi deb?? why we are unable to create volumes ?? (no info on glusterd,glusterfsd logs)?? [negroni] Started POST /volumes [heketi] WARNING 2019/01/21 11:48:31 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 260.577?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/21 11:48:39 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 221.477?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 151.896?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 125.387?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 168.23?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 123.231?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 160.416?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 124.439?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 126.748?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.377?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 138.477?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 267.79?s BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -------------- next part -------------- An HTML attachment was scrubbed... URL: From jocarbur at gmail.com Mon Jan 21 15:56:45 2019 From: jocarbur at gmail.com (Jose V. Carrion) Date: Mon, 21 Jan 2019 16:56:45 +0100 Subject: [Gluster-users] trouble moving files In-Reply-To: References: Message-ID: I have a gluster 3.12.6-1 installation with 2 distributed volumes: volume0 and volume1. I'm moving files from volume0 to volume1 (using the glusterfs client and the mv command) but the? free space/ free inodes in volume0 remains the same? from the start of the move task. Why this behavior ? Thanks in advance. From shaik.salam at tcs.com Mon Jan 21 16:27:25 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Mon, 21 Jan 2019 21:57:25 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Message-ID: Hi, We are facing also similar issue on openshift origin while we are creating pvc for pods. Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. heketi looks fine. [negroni] Completed 200 OK in 116.41?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 124.552?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 128.632?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 134.856?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 123.378?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 134.202?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 120.114?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 141.04?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 122.628?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 150.651?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 116.978?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 110.189?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 226.655?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 129.487?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 116.809?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 118.697?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 112.947?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 134.569?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 119.018?s BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Mon Jan 21 16:33:24 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Mon, 21 Jan 2019 22:03:24 +0530 Subject: [Gluster-users] Bricks are going offline unable to recover with heal/start force commands Message-ID: Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 05:53:41 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 11:23:41 +0530 Subject: [Gluster-users] 'dirfingerprint' to get glusterfs directory stats In-Reply-To: <11132dc6-1d5f-490b-5cfe-f6427be75024@umich.edu> References: <11132dc6-1d5f-490b-5cfe-f6427be75024@umich.edu> Message-ID: Thanks for working on this Manhong! We always like people trying out new things with glusterfs. It would be great if you can publish a blog, and let Amye know about it, so we can also publish same / or have a link to it in our website. A added bonus if you can tweet about it with mention of @gluster Regards, Amar On Tue, Jan 8, 2019 at 2:17 AM Manhong Dai wrote: > Hi, > > I released a python program 'dirfingerprint' at > https://github.com/daimh/dirfingerprint/ . We have been using this > program to get directory stat recursively from each brick node for > glusterfs filesystem. as it is always slower to access file meta data > info from gluster filesystem indirectly than brick node directly. > > In our environment, I did the steps below before accessing brick > nodes. > > 1, generate a ssh key, and put it under all brick nodes. > > 2, ssh to each brick node so '.ssh/known_hosts' has an entry for each node. > > 3, as all our brick node has the actual data storage mounted under > /brick, the dirfingerprint command I used is something like > > dirfingerprint --gluster-brick=node1:/brick --gluster-brick=node2:/brick > /home > > Feel free to let me know if you have any questions or suggestions. > > > Best, > > Manhong > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 05:57:35 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 11:27:35 +0530 Subject: [Gluster-users] Glusterfs backup and restore In-Reply-To: References: Message-ID: Kannan, Currently GlusterFS depends on backend LVM snapshot for snapshot feature, and LVM is not very friendly with migrating blocks. Currently best way of sending the data out from snapshot would be to do the 'xfsdump' from snapshot, and send it to another place. May be that would work faster. -Amar On Tue, Jan 8, 2019 at 7:49 AM Kannan V wrote: > Hi, > I am able to take the glusterfs snapshot and activated it. > Now I want to send the snapshot to another machine for backup (Preferably > tar file). > When there is a problem, I wanted to take the backed up data from another > machine and restore. > I could not compress the data. I mean snapshot have been created at > " /var/lib/glusterd/snaps/" > Now if i compress the snapshot, actual data is not present. > Where exactly, I have to compress the data and restore back ? > Kindly provide your suggestions. > Thanks, > Kannan V > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Tue Jan 22 05:58:56 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Tue, 22 Jan 2019 11:28:56 +0530 Subject: [Gluster-users] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 06:00:42 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 11:30:42 +0530 Subject: [Gluster-users] java application crushes while reading a zip file In-Reply-To: References: Message-ID: Dmitry, Thanks for the detailed updates on this thread. Let us know how your 'production' setup is running. For much smoother next upgrade, we request you to help out with some early testing of glusterfs-6 RC builds which are expected to be out by Feb 1st week. Also, if it is possible for you to automate the tests, it would be great to have it in our regression, so we can always be sure your setup would never break in future releases. Regards, Amar On Mon, Jan 7, 2019 at 11:42 PM Dmitry Isakbayev wrote: > This system is going into production. I will try to replicate this > problem on the next installation. > > On Wed, Jan 2, 2019 at 9:25 PM Raghavendra Gowdappa > wrote: > >> >> >> On Wed, Jan 2, 2019 at 9:59 PM Dmitry Isakbayev >> wrote: >> >>> Still no JVM crushes. Is it possible that running glusterfs with >>> performance options turned off for a couple of days cleared out the "stale >>> metadata issue"? >>> >> >> restarting these options, would've cleared the existing cache and hence >> previous stale metadata would've been cleared. Hitting stale metadata >> again depends on races. That might be the reason you are still not seeing >> the issue. Can you try with enabling all perf xlators (default >> configuration)? >> >> >>> >>> On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev >>> wrote: >>> >>>> The software ran with all of the options turned off over the weekend >>>> without any problems. >>>> I will try to collect the debug info for you. I have re-enabled the 3 >>>> three options, but yet to see the problem reoccurring. >>>> >>>> >>>> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa < >>>> rgowdapp at redhat.com> wrote: >>>> >>>>> Thanks Dmitry. Can you provide the following debug info I asked >>>>> earlier: >>>>> >>>>> * strace -ff -v ... of java application >>>>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse >>>>> while mounting). >>>>> >>>>> regards, >>>>> Raghavendra >>>>> >>>>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev >>>>> wrote: >>>>> >>>>>> These 3 options seem to trigger both (reading zip file and renaming >>>>>> files) problems. >>>>>> >>>>>> Options Reconfigured: >>>>>> performance.io-cache: off >>>>>> performance.stat-prefetch: off >>>>>> performance.quick-read: off >>>>>> performance.parallel-readdir: off >>>>>> *performance.readdir-ahead: on* >>>>>> *performance.write-behind: on* >>>>>> *performance.read-ahead: on* >>>>>> performance.client-io-threads: off >>>>>> nfs.disable: on >>>>>> transport.address-family: inet >>>>>> >>>>>> >>>>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev >>>>>> wrote: >>>>>> >>>>>>> Turning a single option on at a time still worked fine. I will keep >>>>>>> trying. >>>>>>> >>>>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues or >>>>>>> log messages. Do you suppose these issues are triggered by the new >>>>>>> environment or did not exist in 4.1.5? >>>>>>> >>>>>>> [root at node1 ~]# glusterfs --version >>>>>>> glusterfs 4.1.5 >>>>>>> >>>>>>> On AWS using >>>>>>> [root at node1 ~]# hostnamectl >>>>>>> Static hostname: node1 >>>>>>> Icon name: computer-vm >>>>>>> Chassis: vm >>>>>>> Machine ID: b30d0f2110ac3807b210c19ede3ce88f >>>>>>> Boot ID: 52bb159a0aa94043a40e7c7651967bd9 >>>>>>> Virtualization: kvm >>>>>>> Operating System: CentOS Linux 7 (Core) >>>>>>> CPE OS Name: cpe:/o:centos:centos:7 >>>>>>> Kernel: Linux 3.10.0-862.3.2.el7.x86_64 >>>>>>> Architecture: x86-64 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa < >>>>>>> rgowdapp at redhat.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ok. I will try different options. >>>>>>>>> >>>>>>>>> This system is scheduled to go into production soon. What version >>>>>>>>> would you recommend to roll back to? >>>>>>>>> >>>>>>>> >>>>>>>> These are long standing issues. So, rolling back may not make these >>>>>>>> issues go away. Instead if you think performance is agreeable to you, >>>>>>>> please keep these xlators off in production. >>>>>>>> >>>>>>>> >>>>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa < >>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev < >>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Raghavendra, >>>>>>>>>>> >>>>>>>>>>> Thank for the suggestion. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I am suing >>>>>>>>>>> >>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster --version >>>>>>>>>>> glusterfs 5.0 >>>>>>>>>>> >>>>>>>>>>> On >>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# hostnamectl >>>>>>>>>>> Icon name: computer-vm >>>>>>>>>>> Chassis: vm >>>>>>>>>>> Machine ID: e44b8478ef7a467d98363614f4e50535 >>>>>>>>>>> Boot ID: eed98992fdda4c88bdd459a89101766b >>>>>>>>>>> Virtualization: vmware >>>>>>>>>>> Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) >>>>>>>>>>> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server >>>>>>>>>>> Kernel: Linux 3.10.0-862.14.4.el7.x86_64 >>>>>>>>>>> Architecture: x86-64 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I have configured the following options >>>>>>>>>>> >>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster volume info >>>>>>>>>>> Volume Name: gv0 >>>>>>>>>>> Type: Replicate >>>>>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824 >>>>>>>>>>> Status: Started >>>>>>>>>>> Snapshot Count: 0 >>>>>>>>>>> Number of Bricks: 1 x 3 = 3 >>>>>>>>>>> Transport-type: tcp >>>>>>>>>>> Bricks: >>>>>>>>>>> Brick1: jl-fanexoss1p.cspire.net:/data/brick1/gv0 >>>>>>>>>>> Brick2: sl-fanexoss2p.cspire.net:/data/brick1/gv0 >>>>>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0 >>>>>>>>>>> Options Reconfigured: >>>>>>>>>>> performance.io-cache: off >>>>>>>>>>> performance.stat-prefetch: off >>>>>>>>>>> performance.quick-read: off >>>>>>>>>>> performance.parallel-readdir: off >>>>>>>>>>> performance.readdir-ahead: off >>>>>>>>>>> performance.write-behind: off >>>>>>>>>>> performance.read-ahead: off >>>>>>>>>>> performance.client-io-threads: off >>>>>>>>>>> nfs.disable: on >>>>>>>>>>> transport.address-family: inet >>>>>>>>>>> >>>>>>>>>>> I don't know if it is related, but I am seeing a lot of >>>>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031] >>>>>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: remote >>>>>>>>>>> operation failed [No such device or address] >>>>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>> handler >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> These msgs were introduced by patch [1]. To the best of my >>>>>>>>>> knowledge they are benign. We'll be sending a patch to fix these msgs >>>>>>>>>> though. >>>>>>>>>> >>>>>>>>>> +Mohit Agrawal +Milind Changire >>>>>>>>>> . Can you try to identify why we are >>>>>>>>>> seeing these messages? If possible please send a patch to fix this. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> And java.io exceptions trying to rename files. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> When you see the errors is it possible to collect, >>>>>>>>>> * strace of the java application (strace -ff -v ...) >>>>>>>>>> * fuse-dump of the glusterfs mount (use option --dump-fuse while >>>>>>>>>> mounting)? >>>>>>>>>> >>>>>>>>>> I also need another favour from you. By trail and error, can you >>>>>>>>>> point out which of the many performance xlators you've turned off is >>>>>>>>>> causing the issue? >>>>>>>>>> >>>>>>>>>> The above two data-points will help us to fix the problem. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Thank You, >>>>>>>>>>> Dmitry >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra Gowdappa < >>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> What version of glusterfs are you using? It might be either >>>>>>>>>>>> * a stale metadata issue. >>>>>>>>>>>> * inconsistent ctime issue. >>>>>>>>>>>> >>>>>>>>>>>> Can you try turning off all performance xlators? If the issue >>>>>>>>>>>> is 1, that should help. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry Isakbayev < >>>>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Attempted to set 'performance.read-ahead off` according to >>>>>>>>>>>>> https://jira.apache.org/jira/browse/AMQ-7041 >>>>>>>>>>>>> That did not help. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry Isakbayev < >>>>>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The core file generated by JVM suggests that it happens >>>>>>>>>>>>>> because the file is changing while it is being read - >>>>>>>>>>>>>> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557. >>>>>>>>>>>>>> The application reads in the zipfile and goes through the zip >>>>>>>>>>>>>> entries, then reloads the file and goes the zip entries again. It does so >>>>>>>>>>>>>> 3 times. The application never crushes on the 1st cycle but sometimes >>>>>>>>>>>>>> crushes on the 2nd or 3rd cycle. >>>>>>>>>>>>>> The zip file is generated about 20 seconds prior to it being >>>>>>>>>>>>>> used and is not updated or even used by any other application. I have >>>>>>>>>>>>>> never seen this problem on a plain file system. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would appreciate any suggestions on how to go debugging >>>>>>>>>>>>>> this issue. I can change the source code of the java application. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Dmitry >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 06:02:01 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 11:32:01 +0530 Subject: [Gluster-users] [External] Re: Self Heal Confusion In-Reply-To: References: <24cfe8a5-dadb-6271-9b7f-af8670f43fce@l1049h.com> <14cbebd2-44d0-8558-1e26-944e1dec15a7@l1049h.com> <1851464190.54195617.1545898152060.JavaMail.zimbra@redhat.com> <988970243.54246776.1545976827971.JavaMail.zimbra@redhat.com> <9d548f7b-1859-f438-2cb9-9ca1cb3baa86@l1049h.com> <3c2edc47-2cdc-0b90-a708-c59cc8a51937@l1049h.com> <7fe3f846-7289-5885-9905-3e7812964970@l1049h.com> Message-ID: Brett, On Sat, Jan 5, 2019 at 3:54 AM Brett Holcomb wrote: > I wrote a script to search the output of gluster volume heal projects > info, picks the brick I gave it and then deletes any of the files listed > that actually exist in .glusterfs/dir1/dir2. I did this on the first host > which had 85 pending and that cleared them up so I'll do it via ssh on the > other two servers. > > Hopefully that will clear it up and glusterfs will be happy again. > > If things are fine now, consider posting those scripts as patch to glusterfs, or post in your own github account, so in future we can refer others to use same scripts when in trouble. Thanks. -Amar > Thanks everyone for the help. > > > On 12/31/18 4:39 AM, Davide Obbi wrote: > > cluster.quorum-type auto > cluster.quorum-count (null) > cluster.server-quorum-type off > cluster.server-quorum-ratio 0 > cluster.quorum-reads no > > Where exacty do I remove the gfid entries from - the .glusterfs > directory? --> yes can't remember exactly where but try to do a find in > the brick paths with the gfid it should return something > > Where do I put the cluster.heal-timeout option - which file? --> gluster > volume set volumename option value > > On Mon, Dec 31, 2018 at 10:34 AM Brett Holcomb > wrote: > >> That is probably the case as a lot of files were deleted some time ago. >> >> I'm on version 5.2 but was on 3.12 until about a week ago. >> >> Here is the quorum info. I'm running a distributed replicated volumes >> in 2 x 3 = 6 >> >> cluster.quorum-type auto >> cluster.quorum-count (null) >> cluster.server-quorum-type off >> cluster.server-quorum-ratio 0 >> cluster.quorum-reads no >> >> Where exacty do I remove the gfid entries from - the .glusterfs >> directory? Do I just delete all the directories can files under this >> directory? >> >> Where do I put the cluster.heal-timeout option - which file? >> >> I think you've hit on the cause of the issue. Thinking back we've had >> some extended power outages and due to a misconfiguration in the swap >> file device name a couple of the nodes did not come up and I didn't >> catch it for a while so maybe the deletes occured then. >> >> Thank you. >> >> On 12/31/18 2:58 AM, Davide Obbi wrote: >> > if the long GFID does not correspond to any file it could mean the >> > file has been deleted by the client mounting the volume. I think this >> > is caused when the delete was issued and the number of active bricks >> > were not reaching quorum majority or a second brick was taken down >> > while another was down or did not finish the selfheal, the latter more >> > likely. >> > It would be interesting to see: >> > - what version of glusterfs you running, it happened to me with 3.12 >> > - volume quorum rules: "gluster volume get vol all | grep quorum" >> > >> > To clean it up if i remember correctly it should be possible to delete >> > the gfid entries from the brick mounts on the glusterfs server nodes >> > reporting the files to heal. >> > >> > As a side note you might want to consider changing the selfheal >> > timeout to more agressive schedule in cluster.heal-timeout option >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Davide Obbi > System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > Direct +31207031558 > [image: Booking.com] > Empowering people to experience the world since 1996 > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 06:07:04 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 11:37:04 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: > Can anyone respond how to recover bricks apart from heal/start force > according to below events from logs. > Please let me know any other logs required. > Thanks in advance. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: bugs at gluster.org, gluster-users at gluster.org > Date: 01/21/2019 10:03 PM > Subject: Bricks are going offline unable to recover with > heal/start force commands > ------------------------------ > > > Hi, > > Bricks are in offline and unable to recover with following commands > > gluster volume heal > > gluster volume start force > > But still bricks are offline. > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > > Please let us know steps to recover bricks. > > > BR > Salam > > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > _______________________________________________ > Bugs mailing list > Bugs at gluster.org > https://lists.gluster.org/mailman/listinfo/bugs > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 06:10:27 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 11:40:27 +0530 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? In-Reply-To: References: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> <258abbde-5a3a-0df2-988a-cb4d1b8b5347@ya.ru> Message-ID: On Thu, Jan 10, 2019 at 1:56 PM Hu Bert wrote: > Hi, > > > > We ara also using 10TB disks, heal takes 7-8 days. > > > You can play with "cluster.shd-max-threads" setting. It is default 1 I > > > think. I am using it with 4. > > > Below you can find more info: > > > https://access.redhat.com/solutions/882233 > > cluster.shd-max-threads: 8 > > cluster.shd-wait-qlength: 10000 > > Our setup: > cluster.shd-max-threads: 2 > cluster.shd-wait-qlength: 10000 > > > >> Volume Name: shared > > >> Type: Distributed-Replicate > > A, you have distributed-replicated volume, but I choose only replicated > > (for beginning simplicity :) > > May be replicated volume are healing faster? > > Well, maybe our setup with 3 servers and 4 disks=bricks == 12 bricks, > resulting in a distributed-replicate volume (all /dev/sd{a,b,c,d} > identical) , isn't optimal? And it would be better to create a > replicate 3 volume with only 1 (big) brick per server (with 4 disks: > either a logical volume or sw/hw raid)? > > But it would be interesting to know if a replicate volume is healing > faster than a distributed-replicate volume - even if there was only 1 > faulty brick. > > We don't have any data point to agree to this, but it may be true. Specially, as the crawling when DHT (ie, distribute) is involved can get little slower, which means, the healing would get slower too. We are trying to experiment few performance enhancement patches (like https://review.gluster.org/20636), would be great to see how things work with newer base. Will keep the list updated about performance numbers once we have some more data on them. -Amar > > Thx > Hubert > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 06:12:55 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 11:42:55 +0530 Subject: [Gluster-users] Samba+Gluster: Performance measurements for small files In-Reply-To: References: Message-ID: For Samba usecase, please make sure you have nl-cache (ie, 'negative-lookup cache') enabled. We have seen some improvements from this value. -Amar On Tue, Dec 18, 2018 at 8:23 PM David Spisla wrote: > Dear Gluster Community, > > it is a known fact that Samba+Gluster has a bad smallfile performance. We > now have some test measurements created by this setup: 2-Node-Cluster on > real hardware with Replica-2 Volume (just one subvolume), Gluster v.4.1.6, > Samba v4.7. Samba writes to Gluster via FUSE. Files created by fio. We used > a Windows System as Client which is in the same network like the servers. > > The measurements are as follows. In each test case 400 files were written: > > 64KiB_x_400 files 1MiB_x_400 files > 10MiB_x_400 files > 1 Thread 0,77 MiB/s 8,05 > MiB/s 72,67 MiB/s > 4 Threads 0,86 MiB/s 8,92 MiB/s > 90,38 MiB/s > 8 Threads 0,87 MiB/s 8,92 > MiB/s 94,75 MiB/s > > Does anyone have measurements that are in a similar range or are significantly different? > We do not know which values can still be considered "normal" and which are not. > We also know that there are options to improve performance. But first of all we are interested > in whether there are reference values. > Regards > David Spisla > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Tue Jan 22 06:36:57 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Tue, 22 Jan 2019 12:06:57 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Surya, I have enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 11:38 AM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Tue Jan 22 07:16:31 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Tue, 22 Jan 2019 12:46:31 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Surya, It is already customer setup and cant redeploy again. Enabled debug for brick level log but nothing writing to it. Can you tell me is any other ways to troubleshoot or logs to look?? From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 12:06 PM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Surya, I have enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 11:38 AM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From spisla80 at gmail.com Tue Jan 22 08:09:32 2019 From: spisla80 at gmail.com (David Spisla) Date: Tue, 22 Jan 2019 09:09:32 +0100 Subject: [Gluster-users] Samba+Gluster: Performance measurements for small files In-Reply-To: References: Message-ID: Hello Amar, thank you for the advice. We already use nl-cache option and a bunch of other settings. At the moment we try the samba-vfs-glusterfs plugin to access a gluster volume via samba. The performance increase now. Additionally we are looking for some performance measurements to compare with. Maybe someone in the community also does performance tests. Does Redhat has some official reference measurement? Regards David Spisla Am Di., 22. Jan. 2019 um 07:14 Uhr schrieb Amar Tumballi Suryanarayan < atumball at redhat.com>: > For Samba usecase, please make sure you have nl-cache (ie, > 'negative-lookup cache') enabled. We have seen some improvements from this > value. > > -Amar > > On Tue, Dec 18, 2018 at 8:23 PM David Spisla wrote: > >> Dear Gluster Community, >> >> it is a known fact that Samba+Gluster has a bad smallfile performance. We >> now have some test measurements created by this setup: 2-Node-Cluster on >> real hardware with Replica-2 Volume (just one subvolume), Gluster v.4.1.6, >> Samba v4.7. Samba writes to Gluster via FUSE. Files created by fio. We used >> a Windows System as Client which is in the same network like the servers. >> >> The measurements are as follows. In each test case 400 files were written: >> >> 64KiB_x_400 files 1MiB_x_400 files >> 10MiB_x_400 files >> 1 Thread 0,77 MiB/s 8,05 >> MiB/s 72,67 MiB/s >> 4 Threads 0,86 MiB/s 8,92 MiB/s >> 90,38 MiB/s >> 8 Threads 0,87 MiB/s 8,92 >> MiB/s 94,75 MiB/s >> >> Does anyone have measurements that are in a similar range or are significantly different? >> We do not know which values can still be considered "normal" and which are not. >> We also know that there are options to improve performance. But first of all we are interested >> in whether there are reference values. >> Regards >> David Spisla >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Tue Jan 22 08:19:51 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Tue, 22 Jan 2019 13:49:51 +0530 Subject: [Gluster-users] Increasing Bitrot speed glusterfs 4.1.6 In-Reply-To: References: Message-ID: Bitrot feature in Glusterfs is production ready or is it in beta phase? On Mon, Jan 14, 2019 at 12:46 PM Amudhan P wrote: > Resending mail. > > I have a total size of 50GB files per node and it has crossed 5 days but > till now not completed bitrot signature process? yet 20GB+ files are > pending for completion. > > On Fri, Jan 11, 2019 at 12:02 PM Amudhan P wrote: > >> Hi, >> >> How do I increase the speed of bitrot file signature process in >> glusterfs 4.1.6? >> Currently, it's processing 250 KB/s. is there any way to do the changes >> thru gluster cli? >> >> regards >> Amudhan >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davide.obbi at booking.com Tue Jan 22 08:23:03 2019 From: davide.obbi at booking.com (Davide Obbi) Date: Tue, 22 Jan 2019 09:23:03 +0100 Subject: [Gluster-users] [External] Re: Samba+Gluster: Performance measurements for small files In-Reply-To: References: Message-ID: Hi David, i haven't tested samba but glusterfs fuse, i have posted the results few months ago, tests conducted using gluster 4.1.5: *Options Reconfigured:* client.event-threads 3 performance.cache-size 8GB performance.io-thread-count 24 network.inode-lru-limit 1048576 performance.parallel-readdir on performance.cache-invalidation on performance.md-cache-timeout 600 features.cache-invalidation on features.cache-invalidation-timeout 600 performance.client-io-threads on nr of clients 6 Network 10Gb Clients Mem 128GB Clients Cores 22 Centos 7.5.1804 Kernel 3.10.0-862.14.4.el7.x86_64 nr of servers/bricks per volume 3 Network 100Gb *node to node is 100Gb, cleints 10Gb Server Mem 377GB Server Cores 56 *Intel 5120 CPU Storage 4x8TB NVME Centos 7.5.1804 Kernel 3.10.0-862.14.4.el7.x86_64 This for example are FOPS with 128K IO size (cnsidered sweet spot for glusterfs according to documentation). In Blue 8threads per client and red 4threads for client [image: image.png] Below 4K [image: image.png] and 1MB [image: image.png] On Tue, Jan 22, 2019 at 9:09 AM David Spisla wrote: > Hello Amar, > thank you for the advice. We already use nl-cache option and a bunch of > other settings. At the moment we try the samba-vfs-glusterfs plugin to > access a gluster volume via samba. The performance increase now. > Additionally we are looking for some performance measurements to compare > with. Maybe someone in the community also does performance tests. Does > Redhat has some official reference measurement? > > Regards > David Spisla > > Am Di., 22. Jan. 2019 um 07:14 Uhr schrieb Amar Tumballi Suryanarayan < > atumball at redhat.com>: > >> For Samba usecase, please make sure you have nl-cache (ie, >> 'negative-lookup cache') enabled. We have seen some improvements from this >> value. >> >> -Amar >> >> On Tue, Dec 18, 2018 at 8:23 PM David Spisla wrote: >> >>> Dear Gluster Community, >>> >>> it is a known fact that Samba+Gluster has a bad smallfile performance. >>> We now have some test measurements created by this setup: 2-Node-Cluster on >>> real hardware with Replica-2 Volume (just one subvolume), Gluster v.4.1.6, >>> Samba v4.7. Samba writes to Gluster via FUSE. Files created by fio. We used >>> a Windows System as Client which is in the same network like the servers. >>> >>> The measurements are as follows. In each test case 400 files were >>> written: >>> >>> 64KiB_x_400 files 1MiB_x_400 files >>> 10MiB_x_400 files >>> 1 Thread 0,77 MiB/s 8,05 >>> MiB/s 72,67 MiB/s >>> 4 Threads 0,86 MiB/s 8,92 MiB/s >>> 90,38 MiB/s >>> 8 Threads 0,87 MiB/s 8,92 >>> MiB/s 94,75 MiB/s >>> >>> Does anyone have measurements that are in a similar range or are significantly different? >>> We do not know which values can still be considered "normal" and which are not. >>> We also know that there are options to improve performance. But first of all we are interested >>> in whether there are reference values. >>> Regards >>> David Spisla >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> -- >> Amar Tumballi (amarts) >> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [image: Booking.com] Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 6494 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 6574 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 6299 bytes Desc: not available URL: From srakonde at redhat.com Tue Jan 22 08:51:05 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Tue, 22 Jan 2019 14:21:05 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Shaik, Can you please provide us complete glusterd and cmd_history logs from all the nodes in the cluster? Also please paste output of the following commands (from all nodes): 1. gluster --version 2. gluster volume info 3. gluster volume status 4. gluster peer status 5. ps -ax | grep glusterfsd On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam wrote: > Hi Surya, > > It is already customer setup and cant redeploy again. > Enabled debug for brick level log but nothing writing to it. > Can you tell me is any other ways to troubleshoot or logs to look?? > > > From: Shaik Salam/HYD/TCS > To: "Amar Tumballi Suryanarayan" > Cc: "gluster-users at gluster.org List" > Date: 01/22/2019 12:06 PM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > Hi Surya, > > I have enabled DEBUG mode for brick level. But nothing writing to brick > log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > BR > Salam > > > > > From: "Amar Tumballi Suryanarayan" > To: "Shaik Salam" > Cc: "gluster-users at gluster.org List" > Date: 01/22/2019 11:38 AM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > > *"External email. Open with Caution"* > Hi Shaik, > > Can you check what is there in brick logs? They are located in > /var/log/glusterfs/bricks/*? > > Looks like the samba hooks script failed, but that shouldn't matter in > this use case. > > Also, I see that you are trying to setup heketi to provision volumes, > which means you may be using gluster in container usecases. If you are > still in 'PoC' phase, can you give *https://github.com/gluster/gcs* > a try? That makes the deployment and the > stack little simpler. > > -Amar > > > > > On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Can anyone respond how to recover bricks apart from heal/start force > according to below events from logs. > Please let me know any other logs required. > Thanks in advance. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: *bugs at gluster.org* , > *gluster-users at gluster.org* > Date: 01/21/2019 10:03 PM > Subject: Bricks are going offline unable to recover with > heal/start force commands > ------------------------------ > > > Hi, > > Bricks are in offline and unable to recover with following commands > > gluster volume heal > > gluster volume start force > > But still bricks are offline. > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > > Please let us know steps to recover bricks. > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > _______________________________________________ > Bugs mailing list > *Bugs at gluster.org* > *https://lists.gluster.org/mailman/listinfo/bugs* > > > > -- > Amar Tumballi (amarts) > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue Jan 22 09:10:33 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 22 Jan 2019 14:40:33 +0530 Subject: [Gluster-users] Increasing Bitrot speed glusterfs 4.1.6 In-Reply-To: References: Message-ID: On Tue, Jan 22, 2019 at 1:50 PM Amudhan P wrote: > > Bitrot feature in Glusterfs is production ready or is it in beta phase? > > We have not done extensive performance testing with BitRot, as it is known to consume resources, and depending on the resources (CPU/Memory) available, the speed would vary a lot. With respect to the functionality, we have had no issues reported on the feature in a long time now. It is also supported in Red Hat Gluster product, and I haven't seen any bugs arising from there either. For community, considering we are not doing active development effort on the module, I would call the support at Beta phase. -Amar > On Mon, Jan 14, 2019 at 12:46 PM Amudhan P wrote: > >> Resending mail. >> >> I have a total size of 50GB files per node and it has crossed 5 days but >> till now not completed bitrot signature process? yet 20GB+ files are >> pending for completion. >> >> On Fri, Jan 11, 2019 at 12:02 PM Amudhan P wrote: >> >>> Hi, >>> >>> How do I increase the speed of bitrot file signature process in >>> glusterfs 4.1.6? >>> Currently, it's processing 250 KB/s. is there any way to do the changes >>> thru gluster cli? >>> >>> regards >>> Amudhan >>> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From snowmailer at gmail.com Tue Jan 22 09:27:27 2019 From: snowmailer at gmail.com (Martin Toth) Date: Tue, 22 Jan 2019 10:27:27 +0100 Subject: [Gluster-users] Self/Healing process after node maintenance Message-ID: <82F3A23F-7C04-45C4-9685-0904032333A7@gmail.com> Hi all, I just want to ensure myself how self-healing process exactly works, because I need to turn one of my nodes down for maintenance. I have replica 3 setup. Nothing complicated. 3 nodes, 1 volume, 1 brick per node (ZFS pool). All nodes running Qemu VMs and disks of VMs are on Gluster volume. I want to turn off node1 for maintenance. If I will migrate all VMs to node2 and node3 and shutdown node1, I suppose everything will be running without downtime. (2 nodes of 3 will be online) My question is if I will start up node1 after maintenance and node1 will be done back online in running state, this will trigger self-healing process on all disk files of all VMs.. will this healing process be only and only on node1? Can node2 and node3 run VMs without problem while node1 will be healing these files? I want to ensure myself this files (VM disks) will not get ?locked? on node2 and node3 while self-healing will be in process on node1. Thanks for clarification in advance. BR! From ravishankar at redhat.com Tue Jan 22 10:04:30 2019 From: ravishankar at redhat.com (Ravishankar N) Date: Tue, 22 Jan 2019 15:34:30 +0530 Subject: [Gluster-users] Self/Healing process after node maintenance In-Reply-To: <82F3A23F-7C04-45C4-9685-0904032333A7@gmail.com> References: <82F3A23F-7C04-45C4-9685-0904032333A7@gmail.com> Message-ID: <640b8701-7bb6-9a4e-f713-bd606ffd2f47@redhat.com> On 01/22/2019 02:57 PM, Martin Toth wrote: > Hi all, > > I just want to ensure myself how self-healing process exactly works, because I need to turn one of my nodes down for maintenance. > I have replica 3 setup. Nothing complicated. 3 nodes, 1 volume, 1 brick per node (ZFS pool). All nodes running Qemu VMs and disks of VMs are on Gluster volume. > > I want to turn off node1 for maintenance. If I will migrate all VMs to node2 and node3 and shutdown node1, I suppose everything will be running without downtime. (2 nodes of 3 will be online) Yes it should. Before you `shutdown` a node, kill all the gluster processes on it. i.e. `pkill gluster`. > > My question is if I will start up node1 after maintenance and node1 will be done back online in running state, this will trigger self-healing process on all disk files of all VMs.. will this healing process be only and only on node1? The list of files needing heal on node1 are captured on the other 2 nodes that were up, so the selfheal daemons on those nodes will do the heals. > Can node2 and node3 run VMs without problem while node1 will be healing these files? Yes. You might notice some performance drop if there are a lot of heals happening though. > I want to ensure myself this files (VM disks) will not get ?locked? on node2 and node3 while self-healing will be in process on node1. Heal won't block I/O from clients indefinitely. If both are writing to overlapping offset, one of them (i.e either heal or client I/O)? will get the lock, do its job and release the lock so that the other can acquire it and continue. HTH, Ravi > > Thanks for clarification in advance. > > BR! > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From nbalacha at redhat.com Tue Jan 22 10:36:31 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Tue, 22 Jan 2019 16:06:31 +0530 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? In-Reply-To: References: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> <258abbde-5a3a-0df2-988a-cb4d1b8b5347@ya.ru> Message-ID: On Tue, 22 Jan 2019 at 11:42, Amar Tumballi Suryanarayan < atumball at redhat.com> wrote: > > > On Thu, Jan 10, 2019 at 1:56 PM Hu Bert wrote: > >> Hi, >> >> > > We ara also using 10TB disks, heal takes 7-8 days. >> > > You can play with "cluster.shd-max-threads" setting. It is default 1 I >> > > think. I am using it with 4. >> > > Below you can find more info: >> > > https://access.redhat.com/solutions/882233 >> > cluster.shd-max-threads: 8 >> > cluster.shd-wait-qlength: 10000 >> >> Our setup: >> cluster.shd-max-threads: 2 >> cluster.shd-wait-qlength: 10000 >> >> > >> Volume Name: shared >> > >> Type: Distributed-Replicate >> > A, you have distributed-replicated volume, but I choose only replicated >> > (for beginning simplicity :) >> > May be replicated volume are healing faster? >> >> Well, maybe our setup with 3 servers and 4 disks=bricks == 12 bricks, >> resulting in a distributed-replicate volume (all /dev/sd{a,b,c,d} >> identical) , isn't optimal? And it would be better to create a >> replicate 3 volume with only 1 (big) brick per server (with 4 disks: >> either a logical volume or sw/hw raid)? >> >> But it would be interesting to know if a replicate volume is healing >> faster than a distributed-replicate volume - even if there was only 1 >> faulty brick. >> >> > We don't have any data point to agree to this, but it may be true. > Specially, as the crawling when DHT (ie, distribute) is involved can get > little slower, which means, the healing would get slower too. > If the healing is being done by the Self heal daemon, the slowdown is not due to dht (shd does not load dht). > > We are trying to experiment few performance enhancement patches (like > https://review.gluster.org/20636), would be great to see how things work > with newer base. Will keep the list updated about performance numbers once > we have some more data on them. > > -Amar > > >> >> Thx >> Hubert >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> > > -- > Amar Tumballi (amarts) > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From asl at launay.org Tue Jan 22 12:09:58 2019 From: asl at launay.org (Arnaud Launay) Date: Tue, 22 Jan 2019 13:09:58 +0100 Subject: [Gluster-users] File renaming not geo-replicated In-Reply-To: References: <20181210121530.GA5953@launay.org> <20181217100644.GA17243@launay.org> <20181217103912.GB17243@launay.org> Message-ID: <20190122120958.GA6519@launay.org> Hello Sunny, Le Mon, Dec 17, 2018 at 04:19:04PM +0530, Sunny Kumar a ?crit: > Can you please share geo-replication log for master and mount log form slave. Master log, when doing root at prod01:/srv/www# touch coin2.txt && sleep 30 && mv coin2.txt bouh42.txt root at prod01:/srv/www# ==> gsyncd.log <== [2019-01-22 11:37:51.480606] I [master(worker /var/lib/glusterbricks/gbrick1/www1):1484:crawl] _GMaster: slave's time stime=(1548151871, 0) [2019-01-22 11:37:51.524636] I [master(worker /var/lib/glusterbricks/gbrick1/www1):1398:process] _GMaster: Entry Time Taken MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0 CRE=1 duration=0.0088 UNL=0 [2019-01-22 11:37:51.524750] I [master(worker /var/lib/glusterbricks/gbrick1/www1):1408:process] _GMaster: Data/Metadata Time Taken SETA=1 SETX=0 meta_duration=0.0085 data_duration=0.0134 DATA=0 XATT=0 [2019-01-22 11:37:51.524945] I [master(worker /var/lib/glusterbricks/gbrick1/www1):1418:process] _GMaster: Batch Completed changelog_end=1548157069 entry_stime=(1548157068, 0) changelog_start=1548157069 stime=(1548157068, 0) duration=0.0326 num_changelogs=1 mode=live_changelog [2019-01-22 11:38:21.584166] I [master(worker /var/lib/glusterbricks/gbrick1/www1):1484:crawl] _GMaster: slave's time stime=(1548157068, 0) [2019-01-22 11:38:21.655014] I [master(worker /var/lib/glusterbricks/gbrick1/www1):1398:process] _GMaster: Entry Time Taken MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0 CRE=2 duration=0.0410 UNL=2 [2019-01-22 11:38:21.655116] I [master(worker /var/lib/glusterbricks/gbrick1/www1):1408:process] _GMaster: Data/Metadata Time Taken SETA=0 SETX=0 meta_duration=0.0000 data_duration=0.0138 DATA=0 XATT=0 [2019-01-22 11:38:21.655251] I [master(worker /var/lib/glusterbricks/gbrick1/www1):1418:process] _GMaster: Batch Completed changelog_end=1548157099 entry_stime=(1548157098, 0) changelog_start=1548157099 stime=(1548157098, 0) duration=0.0595 num_changelogs=1 mode=live_changelog On the slave, I don't see any change on the log files when doing that... Which log are you interested in ? (BTW, I just re-did the test after upgrading to 4.1.7, no change) Arnaud. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From sunkumar at redhat.com Tue Jan 22 12:19:37 2019 From: sunkumar at redhat.com (Sunny Kumar) Date: Tue, 22 Jan 2019 17:49:37 +0530 Subject: [Gluster-users] File renaming not geo-replicated In-Reply-To: <20190122120958.GA6519@launay.org> References: <20181210121530.GA5953@launay.org> <20181217100644.GA17243@launay.org> <20181217103912.GB17243@launay.org> <20190122120958.GA6519@launay.org> Message-ID: Hi Arnaud, To analyse this behaviour I need log from slave and mount log also from slave and not just snips please share complete log. You can find logs from master - var/log/glusterfs/geo-replication/* and for slave var/log/glusterfs/geo-replication-slave/* on slave node. - Sunny From srangana at redhat.com Tue Jan 22 13:58:01 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Tue, 22 Jan 2019 08:58:01 -0500 Subject: [Gluster-users] Announcing Gluster release 5.3 and 4.1.7 Message-ID: <6a6fde0c-0f20-7911-22bf-b21395357647@redhat.com> The Gluster community is pleased to announce the release of Gluster 4.1.7 and 5.3 (packages available at [1] & [2]). Release notes for the release can be found at [3] & [4]. Major changes, features and limitations addressed in this release: - This release fixes several security vulnerabilities as listed in the release notes. Thanks, Gluster community [1] Packages for 4.1.7: https://download.gluster.org/pub/gluster/glusterfs/4.1/4.1.7/ [2] Packages for 5.3: https://download.gluster.org/pub/gluster/glusterfs/5/5.3/ [3] Release notes for 4.1.7: https://docs.gluster.org/en/latest/release-notes/4.1.7/ [4] Release notes for 5.3: https://docs.gluster.org/en/latest/release-notes/5.3/ From meira at cesup.ufrgs.br Tue Jan 22 20:20:08 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Tue, 22 Jan 2019 18:20:08 -0200 (-02) Subject: [Gluster-users] writev: Transport endpoint is not connected Message-ID: Dear all, I've been trying to benchmark a gluster file system using the MPIIO API of IOR. Almost all of the times I try to run the application with more than 6 tasks performing I/O (mpirun -n N, for N > 6) I get the error: "writev: Transport endpoint is not connected". And then each one of the N tasks returns "ERROR: cannot open file to get file size, MPI MPI_ERR_FILE: invalid file, (aiori-MPIIO.c:488)". Does anyone have any idea what's going on? I'm writing from a single node, to a system configured for stripe over 6 bricks. The volume is mounted with the options _netdev and transport=rdma. I'm using OpenMPI 2.1.2 (I tested version 4.0.0 and nothing changed). IOR arguments used: -B -E -F -q -w -k -z -i=1 -t=2m -b=1g -a=MPIIO. Running OpenSUSE Leap 15.0 and GlusterFS 5.3. Output of "gluster volume info" follows bellow: Volume Name: gfs Type: Stripe Volume ID: ea159033-5f7f-40ac-bad0-6f46613a336b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 6 = 6 Transport-type: rdma Bricks: Brick1: pfs01-ib:/mnt/data/gfs Brick2: pfs02-ib:/mnt/data/gfs Brick3: pfs03-ib:/mnt/data/gfs Brick4: pfs04-ib:/mnt/data/gfs Brick5: pfs05-ib:/mnt/data/gfs Brick6: pfs06-ib:/mnt/data/gfs Options Reconfigured: nfs.disable: on Thanks in advance, Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 From hunter86_bg at yahoo.com Tue Jan 22 23:09:00 2019 From: hunter86_bg at yahoo.com (Strahil Nikolov) Date: Tue, 22 Jan 2019 23:09:00 +0000 (UTC) Subject: [Gluster-users] Performance issue, need guidance References: <662056908.2026192.1548198541012.ref@mail.yahoo.com> Message-ID: <662056908.2026192.1548198541012@mail.yahoo.com> Hello Community, I would be very grateful if you share your thoughts about my problem.I'm quite new to gluster , so keep that in mind. Setup:I have 3 nodes for an ovirt setup (2 hosts + 1 gluster arbiter). The diagram of my setup is:?ovirt | | | | | | | | | | | ovirt | | | | I'm having bad write & read performance from the VM, despite having a SSD (not an enterprise-grade one , but still a SSD) for LVM writeback cache. Writing directly to the bricks is far faster than my network can do (1 Gbit/s network). I have checked the bandwidth and it seems that maximum possible for me is 123MB/s, yet observed speeds via glusterfs fuse client (testing from one of the hosts) is no more than 56MB/s and from VMs is around 20-30 MB/s. Here is my volume profile info (I'm not sure what to look for): [root at ovirt2 tuned]# gluster volume profile data infoBrick: ovirt2.localdomain:/gluster_bricks/data/data---------------------------------------------------Cumulative Stats:? ?Block Size:? ? ? ? ? ? ? ? 256b+? ? ? ? ? ? ? ? ?512b+? ? ? ? ? ? ? ? 1024b+??No. of Reads:? ? ? ? ? ? ? ? ?5854? ? ? ? ? ? ? ? ? ?198? ? ? ? ? ? ? ? ? ?159?No. of Writes:? ? ? ? ? ? ? ? ? ? 2? ? ? ? ? ? ? ? ? 6025? ? ? ? ? ? ? ? ? 1430??? ?Block Size:? ? ? ? ? ? ? ?2048b+? ? ? ? ? ? ? ? 4096b+? ? ? ? ? ? ? ? 8192b+??No. of Reads:? ? ? ? ? ? ? ? ? 302? ? ? ? ? ? ? ? ? 9950? ? ? ? ? ? ? ? ? 6485?No. of Writes:? ? ? ? ? ? ? ? ? 611? ? ? ? ? ? ? ? ?23513? ? ? ? ? ? ? ? ? 6540??? ?Block Size:? ? ? ? ? ? ? 16384b+? ? ? ? ? ? ? ?32768b+? ? ? ? ? ? ? ?65536b+??No. of Reads:? ? ? ? ? ? ? ? ?6952? ? ? ? ? ? ? ? ? 1774? ? ? ? ? ? ? ? ? 1699?No. of Writes:? ? ? ? ? ? ? ? ?6439? ? ? ? ? ? ? ? ? 5870? ? ? ? ? ? ? ? ? 5171??? ?Block Size:? ? ? ? ? ? ?131072b+??No. of Reads:? ? ? ? ? ? ? ? 48690?No. of Writes:? ? ? ? ? ? ? ?127023??%-latency? ?Avg-latency? ?Min-Latency? ?Max-Latency? ?No. of calls? ? ? ? ?Fop?---------? ?-----------? ?-----------? ?-----------? ?------------? ? ? ? ----? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ? ?14? ? ? FORGET? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ? 534? ? ?RELEASE? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?2180? RELEASEDIR? ? ? 0.00? ? ? 44.00 us? ? ? 44.00 us? ? ? 44.00 us? ? ? ? ? ? ? 1? ? GETXATTR? ? ? 0.00? ? ? 28.75 us? ? ? 23.00 us? ? ? 33.00 us? ? ? ? ? ? ? 4? ? ?INODELK? ? ? 0.00? ? ? 85.00 us? ? ? 85.00 us? ? ? 85.00 us? ? ? ? ? ? ? 2? ? ?SETATTR? ? ? 0.00? ? ? 52.50 us? ? ? 31.00 us? ? ?113.00 us? ? ? ? ? ? ? 8? ? ?OPENDIR? ? ? 0.00? ? ? 33.83 us? ? ? 23.00 us? ? ? 60.00 us? ? ? ? ? ? ?29? ? ?ENTRYLK? ? ? 0.00? ? ? 40.32 us? ? ? 24.00 us? ? ?157.00 us? ? ? ? ? ? ?53? ? ? ? STAT? ? ? 0.00? ? ? 59.32 us? ? ? 43.00 us? ? ?121.00 us? ? ? ? ? ? ?38? ? ? ?FSTAT? ? ? 0.00? ? ?188.08 us? ? ? 28.00 us? ? ?458.00 us? ? ? ? ? ? ?13? ? READDIRP? ? ? 0.00? ? ?323.46 us? ? ?286.00 us? ? ?398.00 us? ? ? ? ? ? ?13? ? ? ?MKNOD? ? ? 0.00? ? ? 42.84 us? ? ? 26.00 us? ? ?142.00 us? ? ? ? ? ? 109? ? ? STATFS? ? ? 0.01? ? ?147.36 us? ? ? 71.00 us? ? ?365.00 us? ? ? ? ? ? 200? ? ? LOOKUP? ? ? 0.17? ? ?914.71 us? ? ?186.00 us? ? 3887.00 us? ? ? ? ? ? 533? ? ? ? READ? ? ? 3.57? ? ?322.84 us? ? ? 41.00 us 1552044.00 us? ? ? ? ? 32308? ? FXATTROP? ? ?21.39? ? 1090.04 us? ? ? 15.00 us 1228946.00 us? ? ? ? ? 57261? ? FINODELK? ? ?27.42? ? 2527.96 us? ? ? 98.00 us 1552471.00 us? ? ? ? ? 31651? ? ? ?WRITE? ? ?47.43? ?12047.80 us? ? ?203.00 us 1891369.00 us? ? ? ? ? 11489? ? ? ?FSYNC?? ? Duration: 59012 seconds? ?Data Read: 6853540304 bytesData Written: 17667709568 bytes?Interval 0 Stats:? ?Block Size:? ? ? ? ? ? ? ? 256b+? ? ? ? ? ? ? ? ?512b+? ? ? ? ? ? ? ? 1024b+??No. of Reads:? ? ? ? ? ? ? ? ?5854? ? ? ? ? ? ? ? ? ?198? ? ? ? ? ? ? ? ? ?159?No. of Writes:? ? ? ? ? ? ? ? ? ? 2? ? ? ? ? ? ? ? ? 6025? ? ? ? ? ? ? ? ? 1430??? ?Block Size:? ? ? ? ? ? ? ?2048b+? ? ? ? ? ? ? ? 4096b+? ? ? ? ? ? ? ? 8192b+??No. of Reads:? ? ? ? ? ? ? ? ? 302? ? ? ? ? ? ? ? ? 9950? ? ? ? ? ? ? ? ? 6485?No. of Writes:? ? ? ? ? ? ? ? ? 611? ? ? ? ? ? ? ? ?23513? ? ? ? ? ? ? ? ? 6540??? ?Block Size:? ? ? ? ? ? ? 16384b+? ? ? ? ? ? ? ?32768b+? ? ? ? ? ? ? ?65536b+??No. of Reads:? ? ? ? ? ? ? ? ?6952? ? ? ? ? ? ? ? ? 1774? ? ? ? ? ? ? ? ? 1699?No. of Writes:? ? ? ? ? ? ? ? ?6439? ? ? ? ? ? ? ? ? 5870? ? ? ? ? ? ? ? ? 5171??? ?Block Size:? ? ? ? ? ? ?131072b+??No. of Reads:? ? ? ? ? ? ? ? 48690?No. of Writes:? ? ? ? ? ? ? ?127023?%-latency? ?Avg-latency? ?Min-Latency? ?Max-Latency? ?No. of calls? ? ? ? ?Fop?---------? ?-----------? ?-----------? ?-----------? ?------------? ? ? ? ----? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ? ?14? ? ? FORGET? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ? 534? ? ?RELEASE? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?2180? RELEASEDIR? ? ? 0.00? ? ? 44.00 us? ? ? 44.00 us? ? ? 44.00 us? ? ? ? ? ? ? 1? ? GETXATTR? ? ? 0.00? ? ? 28.75 us? ? ? 23.00 us? ? ? 33.00 us? ? ? ? ? ? ? 4? ? ?INODELK? ? ? 0.00? ? ? 85.00 us? ? ? 85.00 us? ? ? 85.00 us? ? ? ? ? ? ? 2? ? ?SETATTR? ? ? 0.00? ? ? 52.50 us? ? ? 31.00 us? ? ?113.00 us? ? ? ? ? ? ? 8? ? ?OPENDIR? ? ? 0.00? ? ? 33.83 us? ? ? 23.00 us? ? ? 60.00 us? ? ? ? ? ? ?29? ? ?ENTRYLK? ? ? 0.00? ? ? 40.32 us? ? ? 24.00 us? ? ?157.00 us? ? ? ? ? ? ?53? ? ? ? STAT? ? ? 0.00? ? ? 59.32 us? ? ? 43.00 us? ? ?121.00 us? ? ? ? ? ? ?38? ? ? ?FSTAT? ? ? 0.00? ? ?188.08 us? ? ? 28.00 us? ? ?458.00 us? ? ? ? ? ? ?13? ? READDIRP? ? ? 0.00? ? ?323.46 us? ? ?286.00 us? ? ?398.00 us? ? ? ? ? ? ?13? ? ? ?MKNOD? ? ? 0.00? ? ? 42.84 us? ? ? 26.00 us? ? ?142.00 us? ? ? ? ? ? 109? ? ? STATFS? ? ? 0.01? ? ?147.36 us? ? ? 71.00 us? ? ?365.00 us? ? ? ? ? ? 200? ? ? LOOKUP? ? ? 0.17? ? ?914.71 us? ? ?186.00 us? ? 3887.00 us? ? ? ? ? ? 533? ? ? ? READ? ? ? 3.57? ? ?322.84 us? ? ? 41.00 us 1552044.00 us? ? ? ? ? 32308? ? FXATTROP? ? ?21.39? ? 1090.04 us? ? ? 15.00 us 1228946.00 us? ? ? ? ? 57261? ? FINODELK? ? ?27.42? ? 2527.96 us? ? ? 98.00 us 1552471.00 us? ? ? ? ? 31651? ? ? ?WRITE? ? ?47.43? ?12047.80 us? ? ?203.00 us 1891369.00 us? ? ? ? ? 11489? ? ? ?FSYNC?? ? Duration: 59012 seconds? ?Data Read: 6853540304 bytesData Written: 17667709568 bytes?Brick: ovirt3.localdomain:/gluster_bricks/data/data---------------------------------------------------Cumulative Stats:? ?Block Size:? ? ? ? ? ? ? ? ? 1b+??No. of Reads:? ? ? ? ? ? ? ? ? ? 0?No. of Writes:? ? ? ? ? ? ? ?257997??%-latency? ?Avg-latency? ?Min-Latency? ?Max-Latency? ?No. of calls? ? ? ? ?Fop?---------? ?-----------? ?-----------? ?-----------? ?------------? ? ? ? ----? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ? ?53? ? ? FORGET? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?3180? ? ?RELEASE? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?2672? RELEASEDIR? ? ?0.00? ? ? 14.33 us? ? ? 13.00 us? ? ? 16.00 us? ? ? ? ? ? ? 3? ? ? ?FLUSH? ? ? 0.00? ? ? 43.00 us? ? ? 43.00 us? ? ? 43.00 us? ? ? ? ? ? ? 1? ? TRUNCATE? ? ? 0.00? ? ?229.00 us? ? ?229.00 us? ? ?229.00 us? ? ? ? ? ? ? 1? ? ? CREATE? ? ? 0.00? ? ? 80.00 us? ? ? 48.00 us? ? ?114.00 us? ? ? ? ? ? ? 3? ? ?XATTROP? ? ? 0.00? ? ? 66.83 us? ? ? 57.00 us? ? ? 79.00 us? ? ? ? ? ? ? 6? ? ?SETATTR? ? ? 0.00? ? ? 29.76 us? ? ? ?1.00 us? ? ? 72.00 us? ? ? ? ? ? ?34? ? ?OPENDIR? ? ? 0.00? ? ? 96.12 us? ? ? 12.00 us? ? ?240.00 us? ? ? ? ? ? ?17? ? GETXATTR? ? ? 0.01? ? ?375.00 us? ? ?216.00 us? ? ?790.00 us? ? ? ? ? ? ? 6? ? ?READDIR? ? ? 0.01? ? ? 30.90 us? ? ? 11.00 us? ? ?186.00 us? ? ? ? ? ? ?73? ? ?INODELK? ? ? 0.01? ? ?178.89 us? ? ? 36.00 us? ? ?541.00 us? ? ? ? ? ? ?19? ? ? ? OPEN? ? ? 0.02? ? ?113.86 us? ? ? 57.00 us? ? ?313.00 us? ? ? ? ? ? ?78? ? ? UNLINK? ? ? 0.05? ? ?212.36 us? ? ?146.00 us? ? ?488.00 us? ? ? ? ? ? ?90? ? ? ?MKNOD? ? ? 0.14? ? ? 35.88 us? ? ? 10.00 us? ? ?399.00 us? ? ? ? ? ?1636? ? ?ENTRYLK? ? ? 0.42? ? ? 80.78 us? ? ? 21.00 us? ? ?495.00 us? ? ? ? ? ?2122? ? ? LOOKUP? ? ? 7.92? ? ? 44.74 us? ? ? 12.00 us? ?10903.00 us? ? ? ? ? 71878? ? ? ?WRITE? ? ?11.52? ? ? 62.55 us? ? ? 25.00 us? ?27390.00 us? ? ? ? ? 74768? ? FXATTROP? ? ?12.00? ? ? 27.54 us? ? ? ?9.00 us? ? 7191.00 us? ? ? ? ?176968? ? FINODELK? ? ?67.90? ? 2384.96 us? ? ? 53.00 us? ?82033.00 us? ? ? ? ? 11562? ? ? ?FSYNC?? ? Duration: 75025 seconds? ?Data Read: 0 bytesData Written: 257997 bytes?Interval 2 Stats:? ?Block Size:? ? ? ? ? ? ? ? ? 1b+??No. of Reads:? ? ? ? ? ? ? ? ? ? 0?No. of Writes:? ? ? ? ? ? ? ?201556??%-latency? ?Avg-latency? ?Min-Latency? ?Max-Latency? ?No. of calls? ? ? ? ?Fop?---------? ?-----------? ?-----------? ?-----------? ?------------? ? ? ? ----? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ? ?19? ? ? FORGET? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?2046? ? ?RELEASE? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?2606? RELEASEDIR? ? ? 0.00? ? ? 38.50 us? ? ? 35.00 us? ? ? 42.00 us? ? ? ? ? ? ? 2? ? GETXATTR? ? ? 0.00? ? ? 33.17 us? ? ? 15.00 us? ? ? 62.00 us? ? ? ? ? ? ? 6? ? ?INODELK? ? ? 0.00? ? ? 71.00 us? ? ? 63.00 us? ? ? 79.00 us? ? ? ? ? ? ? 3? ? ?SETATTR? ? ? 0.00? ? ? 41.75 us? ? ? 24.00 us? ? ? 72.00 us? ? ? ? ? ? ?12? ? ?OPENDIR? ? ? 0.00? ? ? 26.28 us? ? ? 12.00 us? ? ? 48.00 us? ? ? ? ? ? ?29? ? ?ENTRYLK? ? ? 0.01? ? ?191.15 us? ? ?154.00 us? ? ?240.00 us? ? ? ? ? ? ?13? ? ? ?MKNOD? ? ? 0.10? ? ?112.97 us? ? ? 40.00 us? ? ?170.00 us? ? ? ? ? ? 299? ? ? LOOKUP? ? ? 4.22? ? ? 45.65 us? ? ? 14.00 us? ? 1288.00 us? ? ? ? ? 31732? ? ? ?WRITE? ? ? 5.99? ? ? 35.51 us? ? ? ?9.00 us? ? 7124.00 us? ? ? ? ? 57863? ? FINODELK? ? ? 9.57? ? ?101.29 us? ? ? 33.00 us? ?27390.00 us? ? ? ? ? 32435? ? FXATTROP? ? ?80.11? ? 2390.06 us? ? ? 61.00 us? ?82033.00 us? ? ? ? ? 11507? ? ? ?FSYNC?? ? Duration: 74302 seconds? ?Data Read: 0 bytesData Written: 201556 bytes?Brick: ovirt1.localdomain:/gluster_bricks/data/data---------------------------------------------------Cumulative Stats:? ?Block Size:? ? ? ? ? ? ? ? 256b+? ? ? ? ? ? ? ? ?512b+? ? ? ? ? ? ? ? 1024b+??No. of Reads:? ? ? ? ? ? ? ? 13408? ? ? ? ? ? ? ? ? ?183? ? ? ? ? ? ? ? ? 2200?No. of Writes:? ? ? ? ? ? ? ? ? ?11? ? ? ? ? ? ? ? ?12677? ? ? ? ? ? ? ? ? 1556??? ?Block Size:? ? ? ? ? ? ? ?2048b+? ? ? ? ? ? ? ? 4096b+? ? ? ? ? ? ? ? 8192b+??No. of Reads:? ? ? ? ? ? ? ? ? ?60? ? ? ? ? ? ? ? ? 3393? ? ? ? ? ? ? ? ? 1278?No. of Writes:? ? ? ? ? ? ? ? ? 666? ? ? ? ? ? ? ? ?59118? ? ? ? ? ? ? ? ?22688??? ?Block Size:? ? ? ? ? ? ? 16384b+? ? ? ? ? ? ? ?32768b+? ? ? ? ? ? ? ?65536b+??No. of Reads:? ? ? ? ? ? ? ? 10658? ? ? ? ? ? ? ? ? ?574? ? ? ? ? ? ? ? ? ?954?No. of Writes:? ? ? ? ? ? ? ? 32082? ? ? ? ? ? ? ? ?29900? ? ? ? ? ? ? ? ?61801??? ?Block Size:? ? ? ? ? ? ?131072b+??No. of Reads:? ? ? ? ? ? ? ? 55974?No. of Writes:? ? ? ? ? ? ? ?576153??%-latency? ?Avg-latency? ?Min-Latency? ?Max-Latency? ?No. of calls? ? ? ? ?Fop?---------? ?-----------? ?-----------? ?-----------? ?------------? ? ? ? ----? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ? ?58? ? ? FORGET? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?3357? ? ?RELEASE? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?4823? RELEASEDIR? ? ? 0.00? ? ? 26.00 us? ? ? 24.00 us? ? ? 28.00 us? ? ? ? ? ? ? 3? ? ? ?FLUSH? ? ? 0.00? ? ? 64.00 us? ? ? 52.00 us? ? ? 81.00 us? ? ? ? ? ? ? 3? ? ? ?FSTAT? ? ? 0.00? ? ?202.00 us? ? ?202.00 us? ? ?202.00 us? ? ? ? ? ? ? 1? ? TRUNCATE? ? ? 0.00? ? ?271.00 us? ? ?271.00 us? ? ?271.00 us? ? ? ? ? ? ? 1? ? ? CREATE? ? ? 0.00? ? ?135.33 us? ? ? 72.00 us? ? ?180.00 us? ? ? ? ? ? ? 3? ? ?XATTROP? ? ? 0.00? ? ?103.50 us? ? ? 84.00 us? ? ?155.00 us? ? ? ? ? ? ? 6? ? ?SETATTR? ? ? 0.00? ? ? 69.77 us? ? ? 17.00 us? ? ?174.00 us? ? ? ? ? ? ?13? ? GETXATTR? ? ? 0.00? ? ?438.67 us? ? ?399.00 us? ? ?479.00 us? ? ? ? ? ? ? 3? ? READDIRP? ? ? 0.00? ? ? 45.12 us? ? ? ?3.00 us? ? ? 94.00 us? ? ? ? ? ? ?34? ? ?OPENDIR? ? ? 0.00? ? ? 91.63 us? ? ? 53.00 us? ? ?184.00 us? ? ? ? ? ? ?19? ? ? ? OPEN? ? ? 0.00? ? ?331.57 us? ? ?262.00 us? ? ?381.00 us? ? ? ? ? ? ? 7? ? ?READDIR? ? ? 0.00? ? ? 46.28 us? ? ? 24.00 us? ? ?176.00 us? ? ? ? ? ? 337? ? ? STATFS? ? ? 0.00? ? ?223.68 us? ? ?101.00 us? ? ?630.00 us? ? ? ? ? ? ?78? ? ? UNLINK? ? ? 0.00? ? ?318.33 us? ? ?176.00 us? ? ?456.00 us? ? ? ? ? ? ?90? ? ? ?MKNOD? ? ? 0.00? ? ?237.58 us? ? ? 18.00 us? 148201.00 us? ? ? ? ? ?2122? ? ? LOOKUP? ? ? 0.01? ? 1323.15 us? ? ?104.00 us? ?73159.00 us? ? ? ? ? ? 704? ? ? ? READ? ? ? 0.02? ? 1890.07 us? ? ? 15.00 us? ?89692.00 us? ? ? ? ? ?1634? ? ?ENTRYLK? ? ? 0.07? 137914.67 us? ? ? 22.00 us 1523239.00 us? ? ? ? ? ? ?73? ? ?INODELK? ? ? 0.22? ? ?395.07 us? ? ? 30.00 us? 676921.00 us? ? ? ? ? 74768? ? FXATTROP? ? ? 1.16? ?13685.56 us? ? 1739.00 us? 650863.00 us? ? ? ? ? 11562? ? ? ?FSYNC? ? ? 3.08? ? 5851.12 us? ? ? 94.00 us 1033754.00 us? ? ? ? ? 71879? ? ? ?WRITE? ? ?95.43? ?76480.58 us? ? ? 16.00 us 5368628.00 us? ? ? ? ?170274? ? FINODELK?? ? Duration: 134846 seconds? ?Data Read: 7665929388 bytesData Written: 84724663200 bytes?Interval 2 Stats:? ?Block Size:? ? ? ? ? ? ? ? 256b+? ? ? ? ? ? ? ? ?512b+? ? ? ? ? ? ? ? 1024b+??No. of Reads:? ? ? ? ? ? ? ? ?7432? ? ? ? ? ? ? ? ? ?130? ? ? ? ? ? ? ? ? 1738?No. of Writes:? ? ? ? ? ? ? ? ? ? 6? ? ? ? ? ? ? ? ? 6816? ? ? ? ? ? ? ? ? 1430??? ?Block Size:? ? ? ? ? ? ? ?2048b+? ? ? ? ? ? ? ? 4096b+? ? ? ? ? ? ? ? 8192b+??No. of Reads:? ? ? ? ? ? ? ? ? ?48? ? ? ? ? ? ? ? ? 3099? ? ? ? ? ? ? ? ? 1040?No. of Writes:? ? ? ? ? ? ? ? ? 611? ? ? ? ? ? ? ? ?24732? ? ? ? ? ? ? ? ? 6649??? ?Block Size:? ? ? ? ? ? ? 16384b+? ? ? ? ? ? ? ?32768b+? ? ? ? ? ? ? ?65536b+??No. of Reads:? ? ? ? ? ? ? ? ?9382? ? ? ? ? ? ? ? ? ?442? ? ? ? ? ? ? ? ? ?832?No. of Writes:? ? ? ? ? ? ? ? ?6602? ? ? ? ? ? ? ? ? 6234? ? ? ? ? ? ? ? ? 5531??? ?Block Size:? ? ? ? ? ? ?131072b+??No. of Reads:? ? ? ? ? ? ? ? 20719?No. of Writes:? ? ? ? ? ? ? ?142946??%-latency? ?Avg-latency? ?Min-Latency? ?Max-Latency? ?No. of calls? ? ? ? ?Fop?---------? ?-----------? ?-----------? ?-----------? ?------------? ? ? ? ----? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ? ?19? ? ? FORGET? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?2046? ? ?RELEASE? ? ? 0.00? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ?0.00 us? ? ? ? ? ?2606? RELEASEDIR? ? ? 0.00? ? ? 63.00 us? ? ? 47.00 us? ? ? 79.00 us? ? ? ? ? ? ? 2? ? GETXATTR? ? ? 0.00? ? ? 64.00 us? ? ? 52.00 us? ? ? 81.00 us? ? ? ? ? ? ? 3? ? ? ?FSTAT? ? ? 0.00? ? ? 33.83 us? ? ? 27.00 us? ? ? 41.00 us? ? ? ? ? ? ? 6? ? ?INODELK? ? ? 0.00? ? ?116.33 us? ? ? 92.00 us? ? ?155.00 us? ? ? ? ? ? ? 3? ? ?SETATTR? ? ? 0.00? ? ?479.00 us? ? ?479.00 us? ? ?479.00 us? ? ? ? ? ? ? 1? ? READDIRP? ? ? 0.00? ? ? 48.75 us? ? ? 32.00 us? ? ? 83.00 us? ? ? ? ? ? ?12? ? ?OPENDIR? ? ? 0.00? ? ?283.31 us? ? ?176.00 us? ? ?383.00 us? ? ? ? ? ? ?13? ? ? ?MKNOD? ? ? 0.00? ? ? 46.40 us? ? ? 26.00 us? ? ?142.00 us? ? ? ? ? ? 164? ? ? STATFS? ? ? 0.00? ? 1303.45 us? ? ? 20.00 us? ?36916.00 us? ? ? ? ? ? ?29? ? ?ENTRYLK? ? ? 0.00? ? 1054.11 us? ? ? 63.00 us? 148201.00 us? ? ? ? ? ? 299? ? ? LOOKUP? ? ? 0.01? ? 1302.85 us? ? ?104.00 us? ?73159.00 us? ? ? ? ? ? 618? ? ? ? READ? ? ? 0.20? ? ?812.06 us? ? ? 40.00 us? 676921.00 us? ? ? ? ? 32435? ? FXATTROP? ? ? 1.19? ?13697.54 us? ? 1739.00 us? 650863.00 us? ? ? ? ? 11507? ? ? ?FSYNC? ? ? 1.94? ? 8117.26 us? ? ? 98.00 us 1033754.00 us? ? ? ? ? 31733? ? ? ?WRITE? ? ?96.67? 229052.02 us? ? ? 17.00 us 5368628.00 us? ? ? ? ? 56103? ? FINODELK?? ? Duration: 74302 seconds? ?Data Read: 2999932124 bytesData Written: 19810606944 bytes Any help will be appreciated. Best Regards,Strahil Nikolov -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed Jan 23 02:33:42 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 23 Jan 2019 08:03:42 +0530 Subject: [Gluster-users] writev: Transport endpoint is not connected In-Reply-To: References: Message-ID: On Wed, Jan 23, 2019 at 1:59 AM Lindolfo Meira wrote: > Dear all, > > I've been trying to benchmark a gluster file system using the MPIIO API of > IOR. Almost all of the times I try to run the application with more than 6 > tasks performing I/O (mpirun -n N, for N > 6) I get the error: "writev: > Transport endpoint is not connected". And then each one of the N tasks > returns "ERROR: cannot open file to get file size, MPI MPI_ERR_FILE: > invalid file, (aiori-MPIIO.c:488)". > > Does anyone have any idea what's going on? > > I'm writing from a single node, to a system configured for stripe over 6 > bricks. The volume is mounted with the options _netdev and transport=rdma. > I'm using OpenMPI 2.1.2 (I tested version 4.0.0 and nothing changed). IOR > arguments used: -B -E -F -q -w -k -z -i=1 -t=2m -b=1g -a=MPIIO. Running > OpenSUSE Leap 15.0 and GlusterFS 5.3. Output of "gluster volume info" > follows bellow: > > Volume Name: gfs > Type: Stripe > +Dhananjay, Krutika stripe has been deprecated. You can use sharded volumes. > Volume ID: ea159033-5f7f-40ac-bad0-6f46613a336b > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 6 = 6 > Transport-type: rdma > Bricks: > Brick1: pfs01-ib:/mnt/data/gfs > Brick2: pfs02-ib:/mnt/data/gfs > Brick3: pfs03-ib:/mnt/data/gfs > Brick4: pfs04-ib:/mnt/data/gfs > Brick5: pfs05-ib:/mnt/data/gfs > Brick6: pfs06-ib:/mnt/data/gfs > Options Reconfigured: > nfs.disable: on > > > Thanks in advance, > > Lindolfo Meira, MSc > Diretor Geral, Centro Nacional de Supercomputa??o > Universidade Federal do Rio Grande do Sul > +55 (51) 3308-3139_______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From hunter86_bg at yahoo.com Wed Jan 23 07:15:36 2019 From: hunter86_bg at yahoo.com (Strahil) Date: Wed, 23 Jan 2019 09:15:36 +0200 Subject: [Gluster-users] Performance issue, need guidance In-Reply-To: <662056908.2026192.1548198541012@mail.yahoo.com> Message-ID: An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Wed Jan 23 07:18:09 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Wed, 23 Jan 2019 12:48:09 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Sanju, Please find requested information attached logs. Below brick is offline and try to start force/heal commands but doesn't makes up. sh-4.2# sh-4.2# gluster --version glusterfs 4.1.5 sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log From: Sanju Rakonde To: Shaik Salam Cc: Amar Tumballi Suryanarayan , "gluster-users at gluster.org List" Date: 01/22/2019 02:21 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you please provide us complete glusterd and cmd_history logs from all the nodes in the cluster? Also please paste output of the following commands (from all nodes): 1. gluster --version 2. gluster volume info 3. gluster volume status 4. gluster peer status 5. ps -ax | grep glusterfsd On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam wrote: Hi Surya, It is already customer setup and cant redeploy again. Enabled debug for brick level log but nothing writing to it. Can you tell me is any other ways to troubleshoot or logs to look?? From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 12:06 PM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Surya, I have enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 11:38 AM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -- Amar Tumballi (amarts) _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: firstnode.log Type: application/octet-stream Size: 272336 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: second.log Type: application/octet-stream Size: 396731 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: thirdnode.log Type: application/octet-stream Size: 280962 bytes Desc: not available URL: From srakonde at redhat.com Wed Jan 23 08:44:58 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Wed, 23 Jan 2019 14:14:58 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Shaik, I can see below errors in glusterd logs. [2019-01-22 09:20:17.540196] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid [2019-01-22 09:20:17.546408] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid [2019-01-22 09:20:17.552575] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid [2019-01-22 09:20:17.558888] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid [2019-01-22 09:20:17.565266] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid [2019-01-22 09:20:17.585926] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.617806] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.649628] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid [2019-01-22 09:20:17.649700] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid So it looks like, neither gf_is_service_running() nor glusterd_brick_signal() are able to read the pid file. That means pidfiles might be having nothing to read. Can you please paste the contents of brick pidfiles. You can find brick pidfiles in /var/run/gluster/vols// or you can just run this command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat $i;done" On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam wrote: > Hi Sanju, > > Please find requested information attached logs. > > > > > Below brick is offline and try to start force/heal commands but doesn't > makes up. > > sh-4.2# > sh-4.2# gluster --version > glusterfs 4.1.5 > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > Enabled DEBUG mode for brick level. But nothing writing to brick log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > > > > > > From: Sanju Rakonde > To: Shaik Salam > Cc: Amar Tumballi Suryanarayan , " > gluster-users at gluster.org List" > Date: 01/22/2019 02:21 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > *"External email. Open with Caution"* > Hi Shaik, > > Can you please provide us complete glusterd and cmd_history logs from all > the nodes in the cluster? Also please paste output of the following > commands (from all nodes): > 1. gluster --version > 2. gluster volume info > 3. gluster volume status > 4. gluster peer status > 5. ps -ax | grep glusterfsd > > On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Surya, > > It is already customer setup and cant redeploy again. > Enabled debug for brick level log but nothing writing to it. > Can you tell me is any other ways to troubleshoot or logs to look?? > > > From: Shaik Salam/HYD/TCS > To: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 12:06 PM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > Hi Surya, > > I have enabled DEBUG mode for brick level. But nothing writing to brick > log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > BR > Salam > > > > > From: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 11:38 AM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you check what is there in brick logs? They are located in > /var/log/glusterfs/bricks/*? > > Looks like the samba hooks script failed, but that shouldn't matter in > this use case. > > Also, I see that you are trying to setup heketi to provision volumes, > which means you may be using gluster in container usecases. If you are > still in 'PoC' phase, can you give *https://github.com/gluster/gcs* > a try? That makes the deployment and the > stack little simpler. > > -Amar > > > > > On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Can anyone respond how to recover bricks apart from heal/start force > according to below events from logs. > Please let me know any other logs required. > Thanks in advance. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: *bugs at gluster.org* , > *gluster-users at gluster.org* > Date: 01/21/2019 10:03 PM > Subject: Bricks are going offline unable to recover with > heal/start force commands > ------------------------------ > > > Hi, > > Bricks are in offline and unable to recover with following commands > > gluster volume heal > > gluster volume start force > > But still bricks are offline. > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > > Please let us know steps to recover bricks. > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > _______________________________________________ > Bugs mailing list > *Bugs at gluster.org* > *https://lists.gluster.org/mailman/listinfo/bugs* > > > > -- > Amar Tumballi (amarts) > _______________________________________________ > Gluster-users mailing list > *Gluster-users at gluster.org* > *https://lists.gluster.org/mailman/listinfo/gluster-users* > > > > -- > Thanks, > Sanju > -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Wed Jan 23 12:20:13 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Wed, 23 Jan 2019 17:50:13 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Sanju, Please find requested information. Sorry to repeat again I am trying start force command once brick log enabled to debug by taking one volume example. Please correct me If I am doing wrong. [root at master ~]# oc rsh glusterfs-storage-vll7x sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 Type: Replicate Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.3.6:/var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick Brick2: 192.168.3.5:/var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick Brick3: 192.168.3.15:/var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick Options Reconfigured: diagnostics.brick-log-level: INFO performance.client-io-threads: off nfs.disable: on transport.address-family: inet sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 108434 Self-heal Daemon on matrix1.matrix.orange.l ab N/A N/A Y 69525 Self-heal Daemon on matrix2.matrix.orange.l ab N/A N/A Y 18569 gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG volume set: success sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep log cluster.entry-change-log on cluster.data-change-log on cluster.metadata-change-log on diagnostics.brick-log-level DEBUG sh-4.2# cd /var/log/glusterfs/bricks/ sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log >>> Noting in log -rw-------. 1 root root 189057 Jan 18 09:20 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:48:59.111292] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:14.112271] E [MSGID: 106026] [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid argument] [2019-01-23 11:50:14.112305] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:20.322902] I [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered already-running brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick [2019-01-23 11:50:20.322925] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick on port 49165 [2019-01-23 11:50:20.327557] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2019-01-23 11:50:20.327586] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped [2019-01-23 11:50:20.327604] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-01-23 11:50:20.337735] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 69525 [2019-01-23 11:50:21.338058] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-01-23 11:50:21.338180] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service [2019-01-23 11:50:21.348234] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2019-01-23 11:50:21.348285] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped [2019-01-23 11:50:21.348866] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2019-01-23 11:50:21.348883] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 109550 Self-heal Daemon on 192.168.3.6 N/A N/A Y 52557 Self-heal Daemon on 192.168.3.15 N/A N/A Y 16946 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ There are no active volume tasks From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , "gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 02:15 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, I can see below errors in glusterd logs. [2019-01-22 09:20:17.540196] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid [2019-01-22 09:20:17.546408] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid [2019-01-22 09:20:17.552575] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid [2019-01-22 09:20:17.558888] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid [2019-01-22 09:20:17.565266] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid [2019-01-22 09:20:17.585926] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.617806] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.649628] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid [2019-01-22 09:20:17.649700] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid So it looks like, neither gf_is_service_running() nor glusterd_brick_signal() are able to read the pid file. That means pidfiles might be having nothing to read. Can you please paste the contents of brick pidfiles. You can find brick pidfiles in /var/run/gluster/vols// or you can just run this command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat $i;done" On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam wrote: Hi Sanju, Please find requested information attached logs. Below brick is offline and try to start force/heal commands but doesn't makes up. sh-4.2# sh-4.2# gluster --version glusterfs 4.1.5 sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log From: Sanju Rakonde To: Shaik Salam Cc: Amar Tumballi Suryanarayan , " gluster-users at gluster.org List" Date: 01/22/2019 02:21 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you please provide us complete glusterd and cmd_history logs from all the nodes in the cluster? Also please paste output of the following commands (from all nodes): 1. gluster --version 2. gluster volume info 3. gluster volume status 4. gluster peer status 5. ps -ax | grep glusterfsd On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam wrote: Hi Surya, It is already customer setup and cant redeploy again. Enabled debug for brick level log but nothing writing to it. Can you tell me is any other ways to troubleshoot or logs to look?? From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 12:06 PM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Surya, I have enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 11:38 AM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -- Amar Tumballi (amarts) _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: firsnode_brick.log Type: application/octet-stream Size: 5625 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Secondnode_brick.log Type: application/octet-stream Size: 30409 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Thirdnode_brick.log Type: application/octet-stream Size: 47635 bytes Desc: not available URL: From shaik.salam at tcs.com Wed Jan 23 12:42:32 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Wed, 23 Jan 2019 18:12:32 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume In-Reply-To: References: Message-ID: Hi, We are facing also following issue on openshift origin while we are creating pvc for pods. Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 09:57 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Hi, We are facing also similar issue on openshift origin while we are creating pvc for pods. Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. heketi looks fine. [negroni] Completed 200 OK in 116.41?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 124.552?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 128.632?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 134.856?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 123.378?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 134.202?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 120.114?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 141.04?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 122.628?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 150.651?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 116.978?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 110.189?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 226.655?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 129.487?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 116.809?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 118.697?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 112.947?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 134.569?s [negroni] Started GET /queue/756488c7baccc2a64252b1a82b2c70b3 [negroni] Completed 200 OK in 119.018?s BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-gluster.db.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi.log.txt URL: From shaik.salam at tcs.com Wed Jan 23 12:49:20 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Wed, 23 Jan 2019 18:19:20 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Message-ID: Hi, We are facing also following issue on openshift origin while we are creating pvc for pods. Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-gluster.db.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi.log.txt URL: From meira at cesup.ufrgs.br Wed Jan 23 14:24:06 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Wed, 23 Jan 2019 12:24:06 -0200 (-02) Subject: [Gluster-users] writev: Transport endpoint is not connected In-Reply-To: References: Message-ID: Does this remark has anything to do with the problem I'm talking about? Because I took the time the recreate the volume, changing its type and enabling shard and the problem persists :/ Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 On Wed, 23 Jan 2019, Raghavendra Gowdappa wrote: > On Wed, Jan 23, 2019 at 1:59 AM Lindolfo Meira wrote: > > > Dear all, > > > > I've been trying to benchmark a gluster file system using the MPIIO API of > > IOR. Almost all of the times I try to run the application with more than 6 > > tasks performing I/O (mpirun -n N, for N > 6) I get the error: "writev: > > Transport endpoint is not connected". And then each one of the N tasks > > returns "ERROR: cannot open file to get file size, MPI MPI_ERR_FILE: > > invalid file, (aiori-MPIIO.c:488)". > > > > Does anyone have any idea what's going on? > > > > I'm writing from a single node, to a system configured for stripe over 6 > > bricks. The volume is mounted with the options _netdev and transport=rdma. > > I'm using OpenMPI 2.1.2 (I tested version 4.0.0 and nothing changed). IOR > > arguments used: -B -E -F -q -w -k -z -i=1 -t=2m -b=1g -a=MPIIO. Running > > OpenSUSE Leap 15.0 and GlusterFS 5.3. Output of "gluster volume info" > > follows bellow: > > > > Volume Name: gfs > > Type: Stripe > > > > +Dhananjay, Krutika > stripe has been deprecated. You can use sharded volumes. > > > > Volume ID: ea159033-5f7f-40ac-bad0-6f46613a336b > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 6 = 6 > > Transport-type: rdma > > Bricks: > > Brick1: pfs01-ib:/mnt/data/gfs > > Brick2: pfs02-ib:/mnt/data/gfs > > Brick3: pfs03-ib:/mnt/data/gfs > > Brick4: pfs04-ib:/mnt/data/gfs > > Brick5: pfs05-ib:/mnt/data/gfs > > Brick6: pfs06-ib:/mnt/data/gfs > > Options Reconfigured: > > nfs.disable: on > > > > > > Thanks in advance, > > > > Lindolfo Meira, MSc > > Diretor Geral, Centro Nacional de Supercomputa??o > > Universidade Federal do Rio Grande do Sul > > +55 (51) 3308-3139_______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > From atumball at redhat.com Wed Jan 23 14:28:43 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 23 Jan 2019 19:58:43 +0530 Subject: [Gluster-users] writev: Transport endpoint is not connected In-Reply-To: References: Message-ID: Hi Lindolfo, Can you now share the 'gluster volume info' from your setup? Please note some basic documentation on shard is available @ https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/shard/ -Amar On Wed, Jan 23, 2019 at 7:55 PM Lindolfo Meira wrote: > Does this remark has anything to do with the problem I'm talking about? > Because I took the time the recreate the volume, changing its type and > enabling shard and the problem persists :/ > > > Lindolfo Meira, MSc > Diretor Geral, Centro Nacional de Supercomputa??o > Universidade Federal do Rio Grande do Sul > +55 (51) 3308-3139 > > On Wed, 23 Jan 2019, Raghavendra Gowdappa wrote: > > > On Wed, Jan 23, 2019 at 1:59 AM Lindolfo Meira > wrote: > > > > > Dear all, > > > > > > I've been trying to benchmark a gluster file system using the MPIIO > API of > > > IOR. Almost all of the times I try to run the application with more > than 6 > > > tasks performing I/O (mpirun -n N, for N > 6) I get the error: "writev: > > > Transport endpoint is not connected". And then each one of the N tasks > > > returns "ERROR: cannot open file to get file size, MPI MPI_ERR_FILE: > > > invalid file, (aiori-MPIIO.c:488)". > > > > > > Does anyone have any idea what's going on? > > > > > > I'm writing from a single node, to a system configured for stripe over > 6 > > > bricks. The volume is mounted with the options _netdev and > transport=rdma. > > > I'm using OpenMPI 2.1.2 (I tested version 4.0.0 and nothing changed). > IOR > > > arguments used: -B -E -F -q -w -k -z -i=1 -t=2m -b=1g -a=MPIIO. Running > > > OpenSUSE Leap 15.0 and GlusterFS 5.3. Output of "gluster volume info" > > > follows bellow: > > > > > > Volume Name: gfs > > > Type: Stripe > > > > > > > +Dhananjay, Krutika > > stripe has been deprecated. You can use sharded volumes. > > > > > > > Volume ID: ea159033-5f7f-40ac-bad0-6f46613a336b > > > Status: Started > > > Snapshot Count: 0 > > > Number of Bricks: 1 x 6 = 6 > > > Transport-type: rdma > > > Bricks: > > > Brick1: pfs01-ib:/mnt/data/gfs > > > Brick2: pfs02-ib:/mnt/data/gfs > > > Brick3: pfs03-ib:/mnt/data/gfs > > > Brick4: pfs04-ib:/mnt/data/gfs > > > Brick5: pfs05-ib:/mnt/data/gfs > > > Brick6: pfs06-ib:/mnt/data/gfs > > > Options Reconfigured: > > > nfs.disable: on > > > > > > > > > Thanks in advance, > > > > > > Lindolfo Meira, MSc > > > Diretor Geral, Centro Nacional de Supercomputa??o > > > Universidade Federal do Rio Grande do Sul > > > +55 (51) 3308-3139_______________________________________________ > > > Gluster-users mailing list > > > Gluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hunter86_bg at yahoo.com Wed Jan 23 14:34:08 2019 From: hunter86_bg at yahoo.com (Strahil Nikolov) Date: Wed, 23 Jan 2019 14:34:08 +0000 (UTC) Subject: [Gluster-users] =?utf-8?b?0J7RgtC9OiBQZXJmb3JtYW5jZSBpc3N1ZSwg?= =?utf-8?q?need_guidance?= In-Reply-To: References: Message-ID: <8417167.2335576.1548254048384@mail.yahoo.com> Dear Community, I'm quite puzzled with this situation. 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure on' in glusterd.vol) Volume info after that change: Volume Name: data Type: Replicate Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt1.localdomain:/gluster_bricks/data/data Brick2: ovirt2.localdomain:/gluster_bricks/data/data Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off storage.owner-uid: 36 storage.owner-gid: 36 network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable server.allow-insecure: on Seems no positive or negative effect so far. 2. Tested with tmpfs on all bricks -> ovirt1 mounted gluster volume -> max 60MB/s (bs=1M without 'oflag=direct') [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M count=4000 status=progress 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s 4000+0 records in 4000+0 records out 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s [root at ovirt1 data]# rm -f large_io [root at ovirt1 data]# gluster volume profile data info Brick: ovirt1.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size: 131072b+ No. of Reads: 8 No. of Writes: 44968 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 3 FORGET 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP 0.00 45.80 us 38.00 us 54.00 us 10 STAT 0.00 227.67 us 216.00 us 242.00 us 3 CREATE 0.00 113.38 us 68.00 us 381.00 us 8 READ 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR 0.00 59.97 us 45.00 us 113.00 us 32 OPEN 0.00 24.41 us 13.00 us 89.00 us 161 INODELK 0.00 43.43 us 28.00 us 214.00 us 93 STATFS 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK Duration: 380 seconds Data Read: 1048576 bytes Data Written: 5894045696 bytes Interval 0 Stats: Block Size: 131072b+ No. of Reads: 8 No. of Writes: 44968 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 3 FORGET 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP 0.00 45.80 us 38.00 us 54.00 us 10 STAT 0.00 227.67 us 216.00 us 242.00 us 3 CREATE 0.00 113.38 us 68.00 us 381.00 us 8 READ 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR 0.00 59.97 us 45.00 us 113.00 us 32 OPEN 0.00 24.41 us 13.00 us 89.00 us 161 INODELK 0.00 43.43 us 28.00 us 214.00 us 93 STATFS 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK Duration: 380 seconds Data Read: 1048576 bytes Data Written: 5894045696 bytes Brick: ovirt3.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size: 1b+ No. of Reads: 0 No. of Writes: 39328 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 2 FORGET 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH 0.01 219.50 us 188.00 us 251.00 us 2 CREATE 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR 0.01 62.30 us 38.00 us 119.00 us 10 OPEN 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR 0.01 24.60 us 12.00 us 64.00 us 40 INODELK 0.02 176.30 us 10.00 us 765.00 us 10 READDIR 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK Duration: 189 seconds Data Read: 0 bytes Data Written: 39328 bytes Interval 0 Stats: Block Size: 1b+ No. of Reads: 0 No. of Writes: 39328 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 2 FORGET 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH 0.01 219.50 us 188.00 us 251.00 us 2 CREATE 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR 0.01 62.30 us 38.00 us 119.00 us 10 OPEN 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR 0.01 24.60 us 12.00 us 64.00 us 40 INODELK 0.02 176.30 us 10.00 us 765.00 us 10 READDIR 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK Duration: 189 seconds Data Read: 0 bytes Data Written: 39328 bytes Brick: ovirt2.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size: 512b+ 131072b+ No. of Reads: 0 0 No. of Writes: 36 76758 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 6 FORGET 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR 0.00 272.40 us 235.00 us 296.00 us 5 CREATE 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR 0.01 86.69 us 30.00 us 379.00 us 62 STAT 0.01 64.30 us 47.00 us 169.00 us 84 OPEN 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR 0.04 65.59 us 26.00 us 293.00 us 279 STATFS 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK 0.67 91.68 us 12.00 us 1141.00 us 3186 LOOKUP 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK Duration: 1206 seconds Data Read: 0 bytes Data Written: 10060843008 bytes Interval 0 Stats: Block Size: 512b+ 131072b+ No. of Reads: 0 0 No. of Writes: 36 76758 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 6 FORGET 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR 0.00 272.40 us 235.00 us 296.00 us 5 CREATE 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR 0.01 86.69 us 30.00 us 379.00 us 62 STAT 0.01 64.30 us 47.00 us 169.00 us 84 OPEN 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR 0.04 65.59 us 26.00 us 293.00 us 279 STATFS 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK 0.67 91.66 us 12.00 us 1141.00 us 3186 LOOKUP 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK Duration: 1206 seconds Data Read: 0 bytes Data Written: 10060843008 bytes This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. Most probably I haven't created the volume properly or some option/feature is disabled ?!? Network shows OK for a gigabit: [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C 7180980+0 records in 7180979+0 records out 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s Disign: https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing Gluster version is 3.12.15 I would appreciate any hint how to proceed further from this point! Thank in advance. Best Regards, Strahil Nikolov From meira at cesup.ufrgs.br Wed Jan 23 14:43:04 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Wed, 23 Jan 2019 12:43:04 -0200 (-02) Subject: [Gluster-users] writev: Transport endpoint is not connected In-Reply-To: References: Message-ID: Hi Amar. Yeah, I've taken a look at the documentation. Bellow is the output of volume info on the new volume. Pretty standard. Volume Name: gfs Type: Distribute Volume ID: b5ef065f-1ba2-481f-8108-e8f6d2d3f036 Status: Started Snapshot Count: 0 Number of Bricks: 6 Transport-type: rdma Bricks: Brick1: pfs01-ib:/mnt/data Brick2: pfs02-ib:/mnt/data Brick3: pfs03-ib:/mnt/data Brick4: pfs04-ib:/mnt/data Brick5: pfs05-ib:/mnt/data Brick6: pfs06-ib:/mnt/data Options Reconfigured: features.shard: on nfs.disable: on Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 On Wed, 23 Jan 2019, Amar Tumballi Suryanarayan wrote: > Hi Lindolfo, > > Can you now share the 'gluster volume info' from your setup? > > Please note some basic documentation on shard is available @ > https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/shard/ > > -Amar > > On Wed, Jan 23, 2019 at 7:55 PM Lindolfo Meira wrote: > > > Does this remark has anything to do with the problem I'm talking about? > > Because I took the time the recreate the volume, changing its type and > > enabling shard and the problem persists :/ > > > > > > Lindolfo Meira, MSc > > Diretor Geral, Centro Nacional de Supercomputa??o > > Universidade Federal do Rio Grande do Sul > > +55 (51) 3308-3139 > > > > On Wed, 23 Jan 2019, Raghavendra Gowdappa wrote: > > > > > On Wed, Jan 23, 2019 at 1:59 AM Lindolfo Meira > > wrote: > > > > > > > Dear all, > > > > > > > > I've been trying to benchmark a gluster file system using the MPIIO > > API of > > > > IOR. Almost all of the times I try to run the application with more > > than 6 > > > > tasks performing I/O (mpirun -n N, for N > 6) I get the error: "writev: > > > > Transport endpoint is not connected". And then each one of the N tasks > > > > returns "ERROR: cannot open file to get file size, MPI MPI_ERR_FILE: > > > > invalid file, (aiori-MPIIO.c:488)". > > > > > > > > Does anyone have any idea what's going on? > > > > > > > > I'm writing from a single node, to a system configured for stripe over > > 6 > > > > bricks. The volume is mounted with the options _netdev and > > transport=rdma. > > > > I'm using OpenMPI 2.1.2 (I tested version 4.0.0 and nothing changed). > > IOR > > > > arguments used: -B -E -F -q -w -k -z -i=1 -t=2m -b=1g -a=MPIIO. Running > > > > OpenSUSE Leap 15.0 and GlusterFS 5.3. Output of "gluster volume info" > > > > follows bellow: > > > > > > > > Volume Name: gfs > > > > Type: Stripe > > > > > > > > > > +Dhananjay, Krutika > > > stripe has been deprecated. You can use sharded volumes. > > > > > > > > > > Volume ID: ea159033-5f7f-40ac-bad0-6f46613a336b > > > > Status: Started > > > > Snapshot Count: 0 > > > > Number of Bricks: 1 x 6 = 6 > > > > Transport-type: rdma > > > > Bricks: > > > > Brick1: pfs01-ib:/mnt/data/gfs > > > > Brick2: pfs02-ib:/mnt/data/gfs > > > > Brick3: pfs03-ib:/mnt/data/gfs > > > > Brick4: pfs04-ib:/mnt/data/gfs > > > > Brick5: pfs05-ib:/mnt/data/gfs > > > > Brick6: pfs06-ib:/mnt/data/gfs > > > > Options Reconfigured: > > > > nfs.disable: on > > > > > > > > > > > > Thanks in advance, > > > > > > > > Lindolfo Meira, MSc > > > > Diretor Geral, Centro Nacional de Supercomputa??o > > > > Universidade Federal do Rio Grande do Sul > > > > +55 (51) 3308-3139_______________________________________________ > > > > Gluster-users mailing list > > > > Gluster-users at gluster.org > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > From hunter86_bg at yahoo.com Wed Jan 23 15:07:25 2019 From: hunter86_bg at yahoo.com (Strahil Nikolov) Date: Wed, 23 Jan 2019 15:07:25 +0000 (UTC) Subject: [Gluster-users] Gluster performance issues - need advise References: <215916002.2380263.1548256046035.ref@mail.yahoo.com> Message-ID: <215916002.2380263.1548256046035@mail.yahoo.com> Hello Community, recently I have built a new lab based on oVirt and CentOS 7. During deployment I had some hicups, but now the engine is up and running - but gluster is causing me trouble. Symptoms: Slow VM install from DVD, poor write performance. The latter has been tested via: dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_data bs=1M count=1000 status=progress The reported speed is 60MB/s which is way too low for my setup. My lab design: https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing Gluster version is 3.12.15 So far I have done: 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure on' in glusterd.vol) Volume info after that change: Volume Name: data Type: Replicate Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt1.localdomain:/gluster_bricks/data/data Brick2: ovirt2.localdomain:/gluster_bricks/data/data Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off storage.owner-uid: 36 storage.owner-gid: 36 network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable server.allow-insecure: on Seems no positive or negative effect so far. 2. Tested with tmpfs on all bricks -> ovirt1 mounted gluster volume -> max 60MB/s (bs=1M without 'oflag=direct') [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M count=4000 status=progress 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s 4000+0 records in 4000+0 records out 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s [root at ovirt1 data]# rm -f large_io [root at ovirt1 data]# gluster volume profile data info Brick: ovirt1.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size: 131072b+ No. of Reads: 8 No. of Writes: 44968 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 3 FORGET 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP 0.00 45.80 us 38.00 us 54.00 us 10 STAT 0.00 227.67 us 216.00 us 242.00 us 3 CREATE 0.00 113.38 us 68.00 us 381.00 us 8 READ 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR 0.00 59.97 us 45.00 us 113.00 us 32 OPEN 0.00 24.41 us 13.00 us 89.00 us 161 INODELK 0.00 43.43 us 28.00 us 214.00 us 93 STATFS 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK Duration: 380 seconds Data Read: 1048576 bytes Data Written: 5894045696 bytes Interval 0 Stats: Block Size: 131072b+ No. of Reads: 8 No. of Writes: 44968 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 3 FORGET 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP 0.00 45.80 us 38.00 us 54.00 us 10 STAT 0.00 227.67 us 216.00 us 242.00 us 3 CREATE 0.00 113.38 us 68.00 us 381.00 us 8 READ 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR 0.00 59.97 us 45.00 us 113.00 us 32 OPEN 0.00 24.41 us 13.00 us 89.00 us 161 INODELK 0.00 43.43 us 28.00 us 214.00 us 93 STATFS 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK Duration: 380 seconds Data Read: 1048576 bytes Data Written: 5894045696 bytes Brick: ovirt3.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size: 1b+ No. of Reads: 0 No. of Writes: 39328 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 2 FORGET 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH 0.01 219.50 us 188.00 us 251.00 us 2 CREATE 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR 0.01 62.30 us 38.00 us 119.00 us 10 OPEN 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR 0.01 24.60 us 12.00 us 64.00 us 40 INODELK 0.02 176.30 us 10.00 us 765.00 us 10 READDIR 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK Duration: 189 seconds Data Read: 0 bytes Data Written: 39328 bytes Interval 0 Stats: Block Size: 1b+ No. of Reads: 0 No. of Writes: 39328 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 2 FORGET 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH 0.01 219.50 us 188.00 us 251.00 us 2 CREATE 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR 0.01 62.30 us 38.00 us 119.00 us 10 OPEN 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR 0.01 24.60 us 12.00 us 64.00 us 40 INODELK 0.02 176.30 us 10.00 us 765.00 us 10 READDIR 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK Duration: 189 seconds Data Read: 0 bytes Data Written: 39328 bytes Brick: ovirt2.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size: 512b+ 131072b+ No. of Reads: 0 0 No. of Writes: 36 76758 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 6 FORGET 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR 0.00 272.40 us 235.00 us 296.00 us 5 CREATE 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR 0.01 86.69 us 30.00 us 379.00 us 62 STAT 0.01 64.30 us 47.00 us 169.00 us 84 OPEN 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR 0.04 65.59 us 26.00 us 293.00 us 279 STATFS 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK 0.67 91.68 us 12.00 us 1141.00 us 3186 LOOKUP 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK Duration: 1206 seconds Data Read: 0 bytes Data Written: 10060843008 bytes Interval 0 Stats: Block Size: 512b+ 131072b+ No. of Reads: 0 0 No. of Writes: 36 76758 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 6 FORGET 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR 0.00 272.40 us 235.00 us 296.00 us 5 CREATE 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR 0.01 86.69 us 30.00 us 379.00 us 62 STAT 0.01 64.30 us 47.00 us 169.00 us 84 OPEN 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR 0.04 65.59 us 26.00 us 293.00 us 279 STATFS 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK 0.67 91.66 us 12.00 us 1141.00 us 3186 LOOKUP 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK Duration: 1206 seconds Data Read: 0 bytes Data Written: 10060843008 bytes This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. Most probably I haven't created the volume properly or some option/feature is disabled ?!? Network shows OK for a gigabit: [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C 7180980+0 records in 7180979+0 records out 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s I'm looking for any help... you can share your volume info also. Thanks in advance. Best Regards, Strahil Nikolov From atumball at redhat.com Wed Jan 23 15:49:21 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 23 Jan 2019 21:19:21 +0530 Subject: [Gluster-users] Gluster performance issues - need advise In-Reply-To: <215916002.2380263.1548256046035@mail.yahoo.com> References: <215916002.2380263.1548256046035.ref@mail.yahoo.com> <215916002.2380263.1548256046035@mail.yahoo.com> Message-ID: I didn't understand the issue properly. Mostly I missed something. Are you concerned the performance is 49MB/s with and without perf options? or are you expecting it to be 123MB/s as over the n/w you get that speed? If it is the first problem, then you are actually having 'performance.write-behind on' in both options, and it is the only perf xlator which comes into action during the test you ran. If it is the second, then please be informed that gluster does client side replication, which means, n/w would be split in half for write operations (like write(), creat() etc), so the number you are getting is almost the maximum with 1GbE. Regards, Amar On Wed, Jan 23, 2019 at 8:38 PM Strahil Nikolov wrote: > Hello Community, > > recently I have built a new lab based on oVirt and CentOS 7. > During deployment I had some hicups, but now the engine is up and running > - but gluster is causing me trouble. > > Symptoms: Slow VM install from DVD, poor write performance. The latter has > been tested via: > dd if=/dev/zero > of=/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_data bs=1M > count=1000 status=progress > > The reported speed is 60MB/s which is way too low for my setup. > > My lab design: > > https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing > Gluster version is 3.12.15 > > So far I have done: > > 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure > on' in glusterd.vol) > Volume info after that change: > > Volume Name: data > Type: Replicate > Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.localdomain:/gluster_bricks/data/data > Brick2: ovirt2.localdomain:/gluster_bricks/data/data > Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) > Options Reconfigured: > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.low-prio-threads: 32 > network.remote-dio: off > cluster.eager-lock: enable > cluster.quorum-type: auto > cluster.server-quorum-type: server > cluster.data-self-heal-algorithm: full > cluster.locking-scheme: granular > cluster.shd-max-threads: 8 > cluster.shd-wait-qlength: 10000 > features.shard: on > user.cifs: off > storage.owner-uid: 36 > storage.owner-gid: 36 > network.ping-timeout: 30 > performance.strict-o-direct: on > cluster.granular-entry-heal: enable > server.allow-insecure: on > > Seems no positive or negative effect so far. > > 2. Tested with tmpfs on all bricks -> ovirt1 mounted gluster volume -> > max 60MB/s (bs=1M without 'oflag=direct') > > > [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M count=4000 > status=progress > 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s > 4000+0 records in > 4000+0 records out > 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s > [root at ovirt1 data]# rm -f large_io > [root at ovirt1 data]# gluster volume profile data info > Brick: ovirt1.localdomain:/gluster_bricks/data/data > --------------------------------------------------- > Cumulative Stats: > Block Size: 131072b+ > No. of Reads: 8 > No. of Writes: 44968 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 3 FORGET > 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR > 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT > 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH > 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP > 0.00 45.80 us 38.00 us 54.00 us 10 STAT > 0.00 227.67 us 216.00 us 242.00 us 3 CREATE > 0.00 113.38 us 68.00 us 381.00 us 8 READ > 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR > 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR > 0.00 59.97 us 45.00 us 113.00 us 32 OPEN > 0.00 24.41 us 13.00 us 89.00 us 161 INODELK > 0.00 43.43 us 28.00 us 214.00 us 93 STATFS > 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR > 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP > 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD > 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP > 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP > 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE > 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK > 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK > 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK > > Duration: 380 seconds > Data Read: 1048576 bytes > Data Written: 5894045696 bytes > > Interval 0 Stats: > Block Size: 131072b+ > No. of Reads: 8 > No. of Writes: 44968 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 3 FORGET > 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR > 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT > 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH > 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP > 0.00 45.80 us 38.00 us 54.00 us 10 STAT > 0.00 227.67 us 216.00 us 242.00 us 3 CREATE > 0.00 113.38 us 68.00 us 381.00 us 8 READ > 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR > 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR > 0.00 59.97 us 45.00 us 113.00 us 32 OPEN > 0.00 24.41 us 13.00 us 89.00 us 161 INODELK > 0.00 43.43 us 28.00 us 214.00 us 93 STATFS > 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR > 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP > 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD > 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP > 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP > 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE > 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK > 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK > 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK > > Duration: 380 seconds > Data Read: 1048576 bytes > Data Written: 5894045696 bytes > > Brick: ovirt3.localdomain:/gluster_bricks/data/data > --------------------------------------------------- > Cumulative Stats: > Block Size: 1b+ > No. of Reads: 0 > No. of Writes: 39328 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 2 FORGET > 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR > 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT > 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH > 0.01 219.50 us 188.00 us 251.00 us 2 CREATE > 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR > 0.01 62.30 us 38.00 us 119.00 us 10 OPEN > 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR > 0.01 24.60 us 12.00 us 64.00 us 40 INODELK > 0.02 176.30 us 10.00 us 765.00 us 10 READDIR > 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK > 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK > 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD > 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP > 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP > 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE > 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK > > Duration: 189 seconds > Data Read: 0 bytes > Data Written: 39328 bytes > > Interval 0 Stats: > Block Size: 1b+ > No. of Reads: 0 > No. of Writes: 39328 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 2 FORGET > 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR > 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT > 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH > 0.01 219.50 us 188.00 us 251.00 us 2 CREATE > 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR > 0.01 62.30 us 38.00 us 119.00 us 10 OPEN > 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR > 0.01 24.60 us 12.00 us 64.00 us 40 INODELK > 0.02 176.30 us 10.00 us 765.00 us 10 READDIR > 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK > 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK > 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD > 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP > 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP > 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE > 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK > > Duration: 189 seconds > Data Read: 0 bytes > Data Written: 39328 bytes > > Brick: ovirt2.localdomain:/gluster_bricks/data/data > --------------------------------------------------- > Cumulative Stats: > Block Size: 512b+ 131072b+ > No. of Reads: 0 0 > No. of Writes: 36 76758 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 6 FORGET > 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR > 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR > 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR > 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH > 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE > 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT > 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR > 0.00 272.40 us 235.00 us 296.00 us 5 CREATE > 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP > 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR > 0.01 86.69 us 30.00 us 379.00 us 62 STAT > 0.01 64.30 us 47.00 us 169.00 us 84 OPEN > 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP > 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE > 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR > 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC > 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR > 0.04 65.59 us 26.00 us 293.00 us 279 STATFS > 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD > 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK > 0.67 91.68 us 12.00 us 1141.00 us 3186 LOOKUP > 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP > 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK > 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK > 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE > 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK > > Duration: 1206 seconds > Data Read: 0 bytes > Data Written: 10060843008 bytes > > Interval 0 Stats: > Block Size: 512b+ 131072b+ > No. of Reads: 0 0 > No. of Writes: 36 76758 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 6 FORGET > 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR > 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR > 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR > 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH > 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE > 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT > 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR > 0.00 272.40 us 235.00 us 296.00 us 5 CREATE > 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP > 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR > 0.01 86.69 us 30.00 us 379.00 us 62 STAT > 0.01 64.30 us 47.00 us 169.00 us 84 OPEN > 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP > 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE > 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR > 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC > 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR > 0.04 65.59 us 26.00 us 293.00 us 279 STATFS > 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD > 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK > 0.67 91.66 us 12.00 us 1141.00 us 3186 LOOKUP > 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP > 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK > 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK > 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE > 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK > > Duration: 1206 seconds > Data Read: 0 bytes > Data Written: 10060843008 bytes > > > > This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. > > Most probably I haven't created the volume properly or some option/feature > is disabled ?!? > Network shows OK for a gigabit: > [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 > 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C > 7180980+0 records in > 7180979+0 records out > 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s > > > I'm looking for any help... you can share your volume info also. > > Thanks in advance. > > Best Regards, > Strahil Nikolov > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hunter86_bg at yahoo.com Wed Jan 23 16:43:50 2019 From: hunter86_bg at yahoo.com (Strahil) Date: Wed, 23 Jan 2019 18:43:50 +0200 Subject: [Gluster-users] Gluster performance issues - need advise In-Reply-To: Message-ID: <54e562bd-3465-4c69-8fda-8060a52e9c22@email.android.com> An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed Jan 23 17:32:35 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 23 Jan 2019 23:02:35 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Amudhan, I see that you have provided the content of the configuration of the volume gfs-tst where the request was to share the dump of /var/lib/glusterd/* . I can not debug this further until you share the correct dump. On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee wrote: > Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? > Instead of doing too many back and forth I suggest you to share the content > of /var/lib/glusterd from all the nodes. Also do mention which particular > node the glusterd service is unable to come up. > > On Thu, Jan 17, 2019 at 11:34 AM Amudhan P wrote: > >> I have created the folder in the path as said but still, service failed >> to start below is the error msg in glusterd.log >> >> [2019-01-16 14:50:14.555742] I [MSGID: 100030] [glusterfsd.c:2741:main] >> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >> 0-management: Maximum allowed open file descriptors set to 65536 >> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >> 0-management: Using /var/lib/glusterd as working directory >> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >> 0-management: Using /var/run/gluster as pid file working directory >> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >> channel creation failed [No such device] >> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >> 0-rdma.management: Failed to initialize IB Device >> [2019-01-16 14:50:14.563882] W [rpc-transport.c:351:rpc_transport_load] >> 0-rpc-transport: 'rdma' initialization failed >> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >> 0-rpc-service: cannot create listener, initing the transport failed >> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >> 0-management: creation of 1 listeners failed, continuing with succeeded >> transport >> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >> op-version: 40100 >> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >> d6bf51a7-c296-492f-8dac-e81efa9dd22d >> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >> connect returned 0 >> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >> Failed to get tcp-user-timeout >> [2019-01-16 14:50:15.675451] I [rpc-clnt.c:1059:rpc_clnt_connection_init] >> 0-management: setting frame-timeout to 600 >> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >> brick failed in restore* >> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >> [xlator.c:720:xlator_init] 0-management: Initialization of volume >> 'management' failed, review your volfile again* >> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >> failed >> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >> received signum (-1), shutting down >> >> >> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee >> wrote: >> >>> If gluster volume info/status shows the brick to be /media/disk4/brick4 >>> then you'd need to mount the same path and hence you'd need to create the >>> brick4 directory explicitly. I fail to understand the rationale how only >>> /media/disk4 can be used as the mount path for the brick. >>> >>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P wrote: >>> >>>> Yes, I did mount bricks but the folder 'brick4' was still not created >>>> inside the brick. >>>> Do I need to create this folder because when I run replace-brick it >>>> will create folder inside the brick. I have seen this behavior before when >>>> running replace-brick or heal begins. >>>> >>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >>>> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P wrote: >>>>> >>>>>> Atin, >>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>> >>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>> /var/run/glusterd.pid) >>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init] >>>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init] >>>>>> 0-management: Using /var/lib/glusterd as working directory >>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init] >>>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>> channel creation failed [No such device] >>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>> [2019-01-15 20:16:59.521562] W >>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>> initialization failed >>>>>> [2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init] >>>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>>> transport >>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>> op-version: 40100 >>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>> directory] >>>>>> >>>>> >>>>> This means that underlying brick /media/disk4/brick4 doesn't exist. >>>>> You already mentioned that you had replaced the faulty disk, but have you >>>>> not mounted it yet? >>>>> >>>>> >>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>> connect returned 0 >>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>> Failed to get tcp-user-timeout >>>>>> [2019-01-15 20:17:00.691331] I >>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>> frame-timeout to 600 >>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>> brick failed in restore >>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>> 'management' failed, review your volfile again >>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>> failed >>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>> received signum (-1), shutting down >>>>>> >>>>>> >>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >>>>>> wrote: >>>>>> >>>>>>> This is a case of partial write of a transaction and as the host ran >>>>>>> out of space for the root partition where all the glusterd related >>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>> configuration. The workaround for this is to copy the content of >>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>> reporting all nodes healthy and connected. >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> In short, when I started glusterd service I am getting following >>>>>>>> error msg in the glusterd.log file in one server. >>>>>>>> what needs to be done? >>>>>>>> >>>>>>>> error logged in glusterd.log >>>>>>>> >>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>> /var/run/glusterd.pid) >>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>> set to 65536 >>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>> directory >>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>> working directory >>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>> channel creation failed [No such device] >>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>> initialization failed >>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>> listener, initing the transport failed >>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>> continuing with succeeded transport >>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>> op-version: 40100 >>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>> file or directory] >>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>> Unable to restore volume: gfs-tst >>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>> 'management' failed, review your volfile again >>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>> failed >>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> In long, I am trying to simulate a situation. where volume stoped >>>>>>>> abnormally and >>>>>>>> entire cluster restarted with some missing disks. >>>>>>>> >>>>>>>> My test cluster is set up with 3 nodes and each has four disks, I >>>>>>>> have setup a volume with disperse 4+2. >>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all system >>>>>>>> >>>>>>>> below are the steps done. >>>>>>>> >>>>>>>> 1. umount from client machine >>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>> without stopping volume and stop service) >>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>> 4. powered ON all system >>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>> 6. start glusterd service in all node (success) >>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>> file for details. >>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : >>>>>>>> Volume gfs-tst already started >>>>>>>> >>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>> available but 'self-heal daemon' not running >>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>> Status of volume: gfs-tst >>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>> Online Pid >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>> 1517 >>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>> 1668 >>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>> 1522 >>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>> 1678 >>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>> 1527 >>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>> 1677 >>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>> 1541 >>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>> 1683 >>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>> 2662 >>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>> 2786 >>>>>>>> >>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>> `reset-brick` command >>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>> >>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>> >>>>>>>> 11. reset-brick command was not working, so, tried stopping volume >>>>>>>> and start with force command >>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>> details >>>>>>>> >>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>> >>>>>>>> in node-3 receiving following message. >>>>>>>> >>>>>>>> sudo service glusterd start >>>>>>>> * Starting glusterd service glusterd >>>>>>>> >>>>>>>> [fail] >>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>>> >>>>>>>> 13. checking glusterd log file found that OS drive was running out >>>>>>>> of space >>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>> left on device] >>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>> Unable to write volume values for gfs-tst >>>>>>>> >>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>> running. below is the error logged in glusterd.log >>>>>>>> >>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>> /var/run/glusterd.pid) >>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>> set to 65536 >>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>> directory >>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>> working directory >>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>> channel creation failed [No such device] >>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>> initialization failed >>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>> listener, initing the transport failed >>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>> continuing with succeeded transport >>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>> op-version: 40100 >>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>> file or directory] >>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>> Unable to restore volume: gfs-tst >>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>> 'management' failed, review your volfile again >>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>> failed >>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>> received signum (-1), shutting down >>>>>>>> >>>>>>>> >>>>>>>> 15. In other node running `volume status' still shows bricks node3 >>>>>>>> is live >>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>> >>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>> Status of volume: gfs-tst >>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>> Online Pid >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>> 1517 >>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>> 1668 >>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>> 1522 >>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>> 1678 >>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>> 1527 >>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>> 1677 >>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>> 1541 >>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>> 1683 >>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>> 2662 >>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>> 2786 >>>>>>>> >>>>>>>> Task Status of Volume gfs-tst >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> There are no active volume tasks >>>>>>>> >>>>>>>> >>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>> UUID Hostname State >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>>> >>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>> Number of Peers: 2 >>>>>>>> >>>>>>>> Hostname: IP.3 >>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>> >>>>>>>> Hostname: IP.4 >>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>> State: Peer in Cluster (Connected) >>>>>>>> >>>>>>>> >>>>>>>> regards >>>>>>>> Amudhan >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From gilberto.nunes32 at gmail.com Wed Jan 23 20:05:42 2019 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Wed, 23 Jan 2019 18:05:42 -0200 Subject: [Gluster-users] Access to Servers hangs after stop one server... Message-ID: Hit there... I have set up two server as replica, like this: gluster vol create Vol01 server1:/data/storage server2:/data/storage Then I create a config file in client, like this: volume remote1 type protocol/client option transport-type tcp option remote-host server1 option remote-subvolume /data/storage end-volume volume remote2 type protocol/client option transport-type tcp option remote-host server2 option remote-subvolume /data/storage end-volume volume replicate type cluster/replicate subvolumes remote1 remote2 end-volume volume writebehind type performance/write-behind option window-size 1MB subvolumes replicate end-volume volume cache type performance/io-cache option cache-size 512MB subvolumes writebehind end-volume And add this line in /etc/fstab /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 After mount /mnt, I can access the servers. So far so good! But when I make server1 crash, I was unable to access /mnt or even use gluster vol status on server2 Everything hangon! I have tried with replicated, distributed and replicated-distributed too. I am using Debian Stretch, with gluster package installed via apt, provided by Standard Debian Repo, glusterfs-server 3.8.8-1 I am sorry if this is a newbie question, but glusterfs share it's not suppose to keep online if one server goes down? Any adviced will be welcome Best --- Gilberto Nunes Ferreira (47) 3025-5907 (47) 99676-7530 - Whatsapp / Telegram Skype: gilberto.nunes36 -------------- next part -------------- An HTML attachment was scrubbed... URL: From meira at cesup.ufrgs.br Wed Jan 23 21:31:00 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Wed, 23 Jan 2019 19:31:00 -0200 (-02) Subject: [Gluster-users] Can't write to volume using vim/nano Message-ID: Am I missing something here? A mere write operation, using vim or nano, cannot be performed on a gluster volume mounted over fuse! What gives? Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 From jim.kinney at gmail.com Wed Jan 23 21:44:17 2019 From: jim.kinney at gmail.com (Jim Kinney) Date: Wed, 23 Jan 2019 16:44:17 -0500 Subject: [Gluster-users] Can't write to volume using vim/nano In-Reply-To: References: Message-ID: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> Check permissions on the mount. I have multiple dozens of systems mounting 18 "exports" using fuse and it works for multiple user read/write based on user access permissions to the mount point space. /home is mounted for 150+ users plus another dozen+ lab storage spaces. I do manage user access with freeIPA across all systems to keep things consistent. On Wed, 2019-01-23 at 19:31 -0200, Lindolfo Meira wrote: > Am I missing something here? A mere write operation, using vim or > nano, cannot be performed on a gluster volume mounted over fuse! What > gives? > Lindolfo Meira, MScDiretor Geral, Centro Nacional de > Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > 3308-3139_______________________________________________Gluster-users > mailing listGluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- James P. Kinney III Every time you stop a school, you will have to build a jail. What you gain at one end you lose at the other. It's like feeding a dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark Twain http://heretothereideas.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From meira at cesup.ufrgs.br Wed Jan 23 23:19:20 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Wed, 23 Jan 2019 21:19:20 -0200 (-02) Subject: [Gluster-users] Can't write to volume using vim/nano In-Reply-To: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> References: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> Message-ID: Hi Jim. Thanks for taking the time. Sorry I didn't express myself properly. It's not a simple matter of permissions. Users can write to the volume alright. It's when vim and nano are used, or when small file writes are performed (by cat or echo), that it doesn't work. The file is updated with the write in the server, but it shows up as empty in the client. I guess it has something to do with the size of the write, because I ran a test writing to a file one byte at a time, and it never showed up as having any content in the client (although in the server it kept growing accordingly). I should point out that I'm using a sharded volume. But when I was testing a striped volume, it also happened. Output of "gluster volume info" follows bellow: Volume Name: gfs Type: Distribute Volume ID: b5ef065f-1ba2-481f-8108-e8f6d2d3f036 Status: Started Snapshot Count: 0 Number of Bricks: 6 Transport-type: rdma Bricks: Brick1: pfs01-ib:/mnt/data Brick2: pfs02-ib:/mnt/data Brick3: pfs03-ib:/mnt/data Brick4: pfs04-ib:/mnt/data Brick5: pfs05-ib:/mnt/data Brick6: pfs06-ib:/mnt/data Options Reconfigured: nfs.disable: on features.shard: on Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 On Wed, 23 Jan 2019, Jim Kinney wrote: > Check permissions on the mount. I have multiple dozens of systems > mounting 18 "exports" using fuse and it works for multiple user > read/write based on user access permissions to the mount point space. > /home is mounted for 150+ users plus another dozen+ lab storage spaces. > I do manage user access with freeIPA across all systems to keep things > consistent. > On Wed, 2019-01-23 at 19:31 -0200, Lindolfo Meira wrote: > > Am I missing something here? A mere write operation, using vim or > > nano, cannot be performed on a gluster volume mounted over fuse! What > > gives? > > Lindolfo Meira, MScDiretor Geral, Centro Nacional de > > Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > > 3308-3139_______________________________________________Gluster-users > > mailing listGluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > -- > James P. Kinney III > > Every time you stop a school, you will have to build a jail. What you > gain at one end you lose at the other. It's like feeding a dog on his > own tail. It won't fatten the dog. > - Speech 11/23/1900 Mark Twain > > http://heretothereideas.blogspot.com/ > > From meira at cesup.ufrgs.br Wed Jan 23 23:39:00 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Wed, 23 Jan 2019 21:39:00 -0200 (-02) Subject: [Gluster-users] Can't write to volume using vim/nano In-Reply-To: References: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> Message-ID: Just checked: when the write is >= 340 bytes, everything works as supposed. If the write is smaller, the error takes place. And when it does, nothing is logged on the server. The client, however, logs the following: [2019-01-23 23:28:54.554664] W [MSGID: 103046] [rdma.c:3502:gf_rdma_decode_header] 0-rpc-transport/rdma: received a msg of type RDMA_ERROR [2019-01-23 23:28:54.554728] W [MSGID: 103046] [rdma.c:3939:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (172.24.1.6:49152), couldn't encode or decode the msg properly or write chunks were not provided for replies that were bigger than RDMA_INLINE_THRESHOLD (2048) [2019-01-23 23:28:54.554765] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-gfs-client-5: remote operation failed [Transport endpoint is not connected] [2019-01-23 23:28:54.554850] W [fuse-bridge.c:1436:fuse_err_cbk] 0-glusterfs-fuse: 1723199: FLUSH() ERR => -1 (Transport endpoint is not connected) Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 On Wed, 23 Jan 2019, Lindolfo Meira wrote: > Hi Jim. Thanks for taking the time. > > Sorry I didn't express myself properly. It's not a simple matter of > permissions. Users can write to the volume alright. It's when vim and nano > are used, or when small file writes are performed (by cat or echo), that > it doesn't work. The file is updated with the write in the server, but it > shows up as empty in the client. > > I guess it has something to do with the size of the write, because I ran a > test writing to a file one byte at a time, and it never showed up as > having any content in the client (although in the server it kept growing > accordingly). > > I should point out that I'm using a sharded volume. But when I was testing > a striped volume, it also happened. Output of "gluster volume info" > follows bellow: > > Volume Name: gfs > Type: Distribute > Volume ID: b5ef065f-1ba2-481f-8108-e8f6d2d3f036 > Status: Started > Snapshot Count: 0 > Number of Bricks: 6 > Transport-type: rdma > Bricks: > Brick1: pfs01-ib:/mnt/data > Brick2: pfs02-ib:/mnt/data > Brick3: pfs03-ib:/mnt/data > Brick4: pfs04-ib:/mnt/data > Brick5: pfs05-ib:/mnt/data > Brick6: pfs06-ib:/mnt/data > Options Reconfigured: > nfs.disable: on > features.shard: on > > > > Lindolfo Meira, MSc > Diretor Geral, Centro Nacional de Supercomputa??o > Universidade Federal do Rio Grande do Sul > +55 (51) 3308-3139 > > On Wed, 23 Jan 2019, Jim Kinney wrote: > > > Check permissions on the mount. I have multiple dozens of systems > > mounting 18 "exports" using fuse and it works for multiple user > > read/write based on user access permissions to the mount point space. > > /home is mounted for 150+ users plus another dozen+ lab storage spaces. > > I do manage user access with freeIPA across all systems to keep things > > consistent. > > On Wed, 2019-01-23 at 19:31 -0200, Lindolfo Meira wrote: > > > Am I missing something here? A mere write operation, using vim or > > > nano, cannot be performed on a gluster volume mounted over fuse! What > > > gives? > > > Lindolfo Meira, MScDiretor Geral, Centro Nacional de > > > Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > > > 3308-3139_______________________________________________Gluster-users > > > mailing listGluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > > James P. Kinney III > > > > Every time you stop a school, you will have to build a jail. What you > > gain at one end you lose at the other. It's like feeding a dog on his > > own tail. It won't fatten the dog. > > - Speech 11/23/1900 Mark Twain > > > > http://heretothereideas.blogspot.com/ > > > > From meira at cesup.ufrgs.br Wed Jan 23 23:46:11 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Wed, 23 Jan 2019 21:46:11 -0200 (-02) Subject: [Gluster-users] Can't write to volume using vim/nano In-Reply-To: References: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> Message-ID: Also I noticed that any subsequent write (after the first write with 340 bytes or more), regardless the size, will work as expected. Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 On Wed, 23 Jan 2019, Lindolfo Meira wrote: > Just checked: when the write is >= 340 bytes, everything works as > supposed. If the write is smaller, the error takes place. And when it > does, nothing is logged on the server. The client, however, logs the > following: > > [2019-01-23 23:28:54.554664] W [MSGID: 103046] > [rdma.c:3502:gf_rdma_decode_header] 0-rpc-transport/rdma: received a msg > of type RDMA_ERROR > > [2019-01-23 23:28:54.554728] W [MSGID: 103046] > [rdma.c:3939:gf_rdma_process_recv] 0-rpc-transport/rdma: peer > (172.24.1.6:49152), couldn't encode or decode the msg properly or write > chunks were not provided for replies that were bigger than > RDMA_INLINE_THRESHOLD (2048) > > [2019-01-23 23:28:54.554765] W [MSGID: 114031] > [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-gfs-client-5: remote > operation failed [Transport endpoint is not connected] > > [2019-01-23 23:28:54.554850] W [fuse-bridge.c:1436:fuse_err_cbk] > 0-glusterfs-fuse: 1723199: FLUSH() ERR => -1 (Transport endpoint is not > connected) > > > > Lindolfo Meira, MSc > Diretor Geral, Centro Nacional de Supercomputa??o > Universidade Federal do Rio Grande do Sul > +55 (51) 3308-3139 > > On Wed, 23 Jan 2019, Lindolfo Meira wrote: > > > Hi Jim. Thanks for taking the time. > > > > Sorry I didn't express myself properly. It's not a simple matter of > > permissions. Users can write to the volume alright. It's when vim and nano > > are used, or when small file writes are performed (by cat or echo), that > > it doesn't work. The file is updated with the write in the server, but it > > shows up as empty in the client. > > > > I guess it has something to do with the size of the write, because I ran a > > test writing to a file one byte at a time, and it never showed up as > > having any content in the client (although in the server it kept growing > > accordingly). > > > > I should point out that I'm using a sharded volume. But when I was testing > > a striped volume, it also happened. Output of "gluster volume info" > > follows bellow: > > > > Volume Name: gfs > > Type: Distribute > > Volume ID: b5ef065f-1ba2-481f-8108-e8f6d2d3f036 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 6 > > Transport-type: rdma > > Bricks: > > Brick1: pfs01-ib:/mnt/data > > Brick2: pfs02-ib:/mnt/data > > Brick3: pfs03-ib:/mnt/data > > Brick4: pfs04-ib:/mnt/data > > Brick5: pfs05-ib:/mnt/data > > Brick6: pfs06-ib:/mnt/data > > Options Reconfigured: > > nfs.disable: on > > features.shard: on > > > > > > > > Lindolfo Meira, MSc > > Diretor Geral, Centro Nacional de Supercomputa??o > > Universidade Federal do Rio Grande do Sul > > +55 (51) 3308-3139 > > > > On Wed, 23 Jan 2019, Jim Kinney wrote: > > > > > Check permissions on the mount. I have multiple dozens of systems > > > mounting 18 "exports" using fuse and it works for multiple user > > > read/write based on user access permissions to the mount point space. > > > /home is mounted for 150+ users plus another dozen+ lab storage spaces. > > > I do manage user access with freeIPA across all systems to keep things > > > consistent. > > > On Wed, 2019-01-23 at 19:31 -0200, Lindolfo Meira wrote: > > > > Am I missing something here? A mere write operation, using vim or > > > > nano, cannot be performed on a gluster volume mounted over fuse! What > > > > gives? > > > > Lindolfo Meira, MScDiretor Geral, Centro Nacional de > > > > Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > > > > 3308-3139_______________________________________________Gluster-users > > > > mailing listGluster-users at gluster.org > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > -- > > > James P. Kinney III > > > > > > Every time you stop a school, you will have to build a jail. What you > > > gain at one end you lose at the other. It's like feeding a dog on his > > > own tail. It won't fatten the dog. > > > - Speech 11/23/1900 Mark Twain > > > > > > http://heretothereideas.blogspot.com/ > > > > > > From jim.kinney at gmail.com Thu Jan 24 00:00:05 2019 From: jim.kinney at gmail.com (Jim Kinney) Date: Wed, 23 Jan 2019 19:00:05 -0500 Subject: [Gluster-users] Can't write to volume using vim/nano In-Reply-To: References: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> Message-ID: That really sounds like a bug with the sharding. I'm not using sharding on my setup and files are writeable (vim) with 2 bytes and no errors occur.Perhaps the small size is cached until it's large enough to trigger a write On Wed, 2019-01-23 at 21:46 -0200, Lindolfo Meira wrote: > Also I noticed that any subsequent write (after the first write with > 340 bytes or more), regardless the size, will work as expected. > > Lindolfo Meira, MScDiretor Geral, Centro Nacional de > Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > 3308-3139 > On Wed, 23 Jan 2019, Lindolfo Meira wrote: > > Just checked: when the write is >= 340 bytes, everything works as > > supposed. If the write is smaller, the error takes place. And when > > it does, nothing is logged on the server. The client, however, logs > > the following: > > [2019-01-23 23:28:54.554664] W [MSGID: 103046] > > [rdma.c:3502:gf_rdma_decode_header] 0-rpc-transport/rdma: received > > a msg of type RDMA_ERROR > > [2019-01-23 23:28:54.554728] W [MSGID: 103046] > > [rdma.c:3939:gf_rdma_process_recv] 0-rpc-transport/rdma: peer > > (172.24.1.6:49152), couldn't encode or decode the msg properly or > > write chunks were not provided for replies that were bigger than > > RDMA_INLINE_THRESHOLD (2048) > > [2019-01-23 23:28:54.554765] W [MSGID: 114031] [client-rpc- > > fops_v2.c:680:client4_0_writev_cbk] 0-gfs-client-5: remote > > operation failed [Transport endpoint is not connected] > > [2019-01-23 23:28:54.554850] W [fuse-bridge.c:1436:fuse_err_cbk] 0- > > glusterfs-fuse: 1723199: FLUSH() ERR => -1 (Transport endpoint is > > not connected) > > > > > > Lindolfo Meira, MScDiretor Geral, Centro Nacional de > > Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > > 3308-3139 > > On Wed, 23 Jan 2019, Lindolfo Meira wrote: > > > Hi Jim. Thanks for taking the time. > > > Sorry I didn't express myself properly. It's not a simple matter > > > of permissions. Users can write to the volume alright. It's when > > > vim and nano are used, or when small file writes are performed > > > (by cat or echo), that it doesn't work. The file is updated with > > > the write in the server, but it shows up as empty in the client. > > > I guess it has something to do with the size of the write, > > > because I ran a test writing to a file one byte at a time, and it > > > never showed up as having any content in the client (although in > > > the server it kept growing accordingly). > > > I should point out that I'm using a sharded volume. But when I > > > was testing a striped volume, it also happened. Output of > > > "gluster volume info" follows bellow: > > > Volume Name: gfsType: DistributeVolume ID: b5ef065f-1ba2-481f- > > > 8108-e8f6d2d3f036Status: StartedSnapshot Count: 0Number of > > > Bricks: 6Transport-type: rdmaBricks:Brick1: pfs01- > > > ib:/mnt/dataBrick2: pfs02-ib:/mnt/dataBrick3: pfs03- > > > ib:/mnt/dataBrick4: pfs04-ib:/mnt/dataBrick5: pfs05- > > > ib:/mnt/dataBrick6: pfs06-ib:/mnt/dataOptions > > > Reconfigured:nfs.disable: onfeatures.shard: on > > > > > > > > > Lindolfo Meira, MScDiretor Geral, Centro Nacional de > > > Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > > > 3308-3139 > > > On Wed, 23 Jan 2019, Jim Kinney wrote: > > > > Check permissions on the mount. I have multiple dozens of > > > > systemsmounting 18 "exports" using fuse and it works for > > > > multiple userread/write based on user access permissions to the > > > > mount point space./home is mounted for 150+ users plus another > > > > dozen+ lab storage spaces.I do manage user access with freeIPA > > > > across all systems to keep thingsconsistent.On Wed, 2019-01-23 > > > > at 19:31 -0200, Lindolfo Meira wrote: > > > > > Am I missing something here? A mere write operation, using > > > > > vim ornano, cannot be performed on a gluster volume mounted > > > > > over fuse! Whatgives?Lindolfo Meira, MScDiretor Geral, Centro > > > > > Nacional deSupercomputa??oUniversidade Federal do Rio Grande > > > > > do Sul+55 (51)3308- > > > > > 3139_______________________________________________Gluster- > > > > > users mailing listGluster-users at gluster.org > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- James P. Kinney III > > > > Every time you stop a school, you will have to build a jail. > > > > What yougain at one end you lose at the other. It's like > > > > feeding a dog on hisown tail. It won't fatten the dog.- Speech > > > > 11/23/1900 Mark Twain > > > > http://heretothereideas.blogspot.com/ > > > > -- James P. Kinney III Every time you stop a school, you will have to build a jail. What you gain at one end you lose at the other. It's like feeding a dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark Twain http://heretothereideas.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Thu Jan 24 05:54:26 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 24 Jan 2019 11:24:26 +0530 Subject: [Gluster-users] Can't write to volume using vim/nano In-Reply-To: References: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> Message-ID: I suspect this is a bug with 'Transport: rdma' part. We have called out for de-scoping that feature as we are lacking experts in that domain right now. Recommend you to use IPoIB option, and use tcp/socket transport type (which is default). That should mostly fix all the issues. -Amar On Thu, Jan 24, 2019 at 5:31 AM Jim Kinney wrote: > That really sounds like a bug with the sharding. I'm not using sharding on > my setup and files are writeable (vim) with 2 bytes and no errors occur. > Perhaps the small size is cached until it's large enough to trigger a write > > On Wed, 2019-01-23 at 21:46 -0200, Lindolfo Meira wrote: > > Also I noticed that any subsequent write (after the first write with 340 > > bytes or more), regardless the size, will work as expected. > > > > Lindolfo Meira, MSc > > Diretor Geral, Centro Nacional de Supercomputa??o > > Universidade Federal do Rio Grande do Sul > > +55 (51) 3308-3139 > > > On Wed, 23 Jan 2019, Lindolfo Meira wrote: > > > Just checked: when the write is >= 340 bytes, everything works as > > supposed. If the write is smaller, the error takes place. And when it > > does, nothing is logged on the server. The client, however, logs the > > following: > > > [2019-01-23 23:28:54.554664] W [MSGID: 103046] > > [rdma.c:3502:gf_rdma_decode_header] 0-rpc-transport/rdma: received a msg > > of type RDMA_ERROR > > > [2019-01-23 23:28:54.554728] W [MSGID: 103046] > > [rdma.c:3939:gf_rdma_process_recv] 0-rpc-transport/rdma: peer > > (172.24.1.6:49152), couldn't encode or decode the msg properly or write > > chunks were not provided for replies that were bigger than > > RDMA_INLINE_THRESHOLD (2048) > > > [2019-01-23 23:28:54.554765] W [MSGID: 114031] > > [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-gfs-client-5: remote > > operation failed [Transport endpoint is not connected] > > > [2019-01-23 23:28:54.554850] W [fuse-bridge.c:1436:fuse_err_cbk] > > 0-glusterfs-fuse: 1723199: FLUSH() ERR => -1 (Transport endpoint is not > > connected) > > > > > Lindolfo Meira, MSc > > Diretor Geral, Centro Nacional de Supercomputa??o > > Universidade Federal do Rio Grande do Sul > > +55 (51) 3308-3139 > > > On Wed, 23 Jan 2019, Lindolfo Meira wrote: > > > Hi Jim. Thanks for taking the time. > > > Sorry I didn't express myself properly. It's not a simple matter of > > permissions. Users can write to the volume alright. It's when vim and nano > > are used, or when small file writes are performed (by cat or echo), that > > it doesn't work. The file is updated with the write in the server, but it > > shows up as empty in the client. > > > I guess it has something to do with the size of the write, because I ran a > > test writing to a file one byte at a time, and it never showed up as > > having any content in the client (although in the server it kept growing > > accordingly). > > > I should point out that I'm using a sharded volume. But when I was testing > > a striped volume, it also happened. Output of "gluster volume info" > > follows bellow: > > > Volume Name: gfs > > Type: Distribute > > Volume ID: b5ef065f-1ba2-481f-8108-e8f6d2d3f036 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 6 > > Transport-type: rdma > > Bricks: > > Brick1: pfs01-ib:/mnt/data > > Brick2: pfs02-ib:/mnt/data > > Brick3: pfs03-ib:/mnt/data > > Brick4: pfs04-ib:/mnt/data > > Brick5: pfs05-ib:/mnt/data > > Brick6: pfs06-ib:/mnt/data > > Options Reconfigured: > > nfs.disable: on > > features.shard: on > > > > > Lindolfo Meira, MSc > > Diretor Geral, Centro Nacional de Supercomputa??o > > Universidade Federal do Rio Grande do Sul > > +55 (51) 3308-3139 > > > On Wed, 23 Jan 2019, Jim Kinney wrote: > > > Check permissions on the mount. I have multiple dozens of systems > > mounting 18 "exports" using fuse and it works for multiple user > > read/write based on user access permissions to the mount point space. > > /home is mounted for 150+ users plus another dozen+ lab storage spaces. > > I do manage user access with freeIPA across all systems to keep things > > consistent. > > On Wed, 2019-01-23 at 19:31 -0200, Lindolfo Meira wrote: > > Am I missing something here? A mere write operation, using vim or > > nano, cannot be performed on a gluster volume mounted over fuse! What > > gives? > > Lindolfo Meira, MScDiretor Geral, Centro Nacional de > > Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > > 3308-3139_______________________________________________Gluster-users > > mailing > > listGluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > -- > > James P. Kinney III > > > Every time you stop a school, you will have to build a jail. What you > > gain at one end you lose at the other. It's like feeding a dog on his > > own tail. It won't fatten the dog. > > - Speech 11/23/1900 Mark Twain > > > http://heretothereideas.blogspot.com/ > > > > -- > > James P. Kinney III Every time you stop a school, you will have to build a > jail. What you gain at one end you lose at the other. It's like feeding a > dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark > Twain http://heretothereideas.blogspot.com/ > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 05:58:05 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 11:28:05 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume In-Reply-To: References: Message-ID: Hi Surya, Could you please help us to resolve below issue (at lease workaround for creating volume) Attached db dump and log. Please let me know any other things need to check. Please guide us. BR Salam From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" , bugs at gluster.org, "gluster-users at gluster.org List" Cc: "Murali Kottakota" , "Sanju Rakonde" Date: 01/23/2019 06:19 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Hi, We are facing also following issue on openshift origin while we are creating pvc for pods. Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-gluster.db.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi.log.txt URL: From atumball at redhat.com Thu Jan 24 06:28:31 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 24 Jan 2019 11:58:31 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume In-Reply-To: References: Message-ID: Looks like the requests to create volume has increased significantly. Heketi can handle 8 parallel volume create requests, and looks like there are 429 volume create requests pending. I am not an expert in this. Added few more people in CC to help when they get to see this. -Amar On Thu, Jan 24, 2019 at 11:28 AM Shaik Salam wrote: > Hi Surya, > > Could you please help us to resolve below issue (at lease workaround for > creating volume) > Attached db dump and log. Please let me know any other things need to > check. > Please guide us. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: "Amar Tumballi Suryanarayan" , > bugs at gluster.org, "gluster-users at gluster.org List" < > gluster-users at gluster.org> > Cc: "Murali Kottakota" , "Sanju Rakonde" > > Date: 01/23/2019 06:19 PM > Subject: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume > ------------------------------ > > > > > > Hi, > > We are facing also following issue on openshift origin while we are > creating pvc for pods. > > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume Server busy. Retry > operation later.. > > Please find heketidb dump and log > > [negroni] Completed 429 Too Many Requests in 250.763?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 169.08?s > [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c > [negroni] Completed 404 Not Found in 148.125?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 496.624?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 101.673?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 209.681?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 103.595?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 297.594?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 96.75?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 477.007?s > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 165.38?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 488.253?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 171.836?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 208.59?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 125.141?s > [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 > [negroni] Completed 404 Not Found in 138.687?s > [negroni] Started POST /volumes > > > BR > Salam > > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 06:38:29 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 12:08:29 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume In-Reply-To: References: Message-ID: Hi All, Could you please help us to resolve issue (atleast workaround). 429 volumes are not requested at all in cluster. I am trying to create only one volume at a time. BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Madhu Rajanna" , "Raghavendra Talur" , "Michael Adam" Date: 01/24/2019 11:59 AM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume "External email. Open with Caution" Looks like the requests to create volume has increased significantly. Heketi can handle 8 parallel volume create requests, and looks like there are 429 volume create requests pending. I am not an expert in this. Added few more people in CC to help when they get to see this. -Amar On Thu, Jan 24, 2019 at 11:28 AM Shaik Salam wrote: Hi Surya, Could you please help us to resolve below issue (at lease workaround for creating volume) Attached db dump and log. Please let me know any other things need to check. Please guide us. BR Salam From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" , bugs at gluster.org, "gluster-users at gluster.org List" < gluster-users at gluster.org> Cc: "Murali Kottakota" , "Sanju Rakonde" Date: 01/23/2019 06:19 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Hi, We are facing also following issue on openshift origin while we are creating pvc for pods. Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Thu Jan 24 06:44:45 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 24 Jan 2019 12:14:45 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume In-Reply-To: References: Message-ID: On Thu, Jan 24, 2019 at 12:08 PM Shaik Salam wrote: > Hi All, > > Could you please help us to resolve issue (atleast workaround). > 429 volumes are not requested at all in cluster. I am trying to create > only one volume at a time. > > BR > Salam > > > > From: "Amar Tumballi Suryanarayan" > To: "Shaik Salam" > Cc: "gluster-users at gluster.org List" , > "Madhu Rajanna" , "Raghavendra Talur" < > rtalur at redhat.com>, "Michael Adam" > Date: 01/24/2019 11:59 AM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume > ------------------------------ > > > > *"External email. Open with Caution"* > Looks like the requests to create volume has increased significantly. > Heketi can handle 8 parallel volume create requests, and looks like there > are 429 volume create requests pending. > > I was wrong. There seems to be just 8 requests pending. Not sure why its not proceeding though. -Amar > I am not an expert in this. Added few more people in CC to help when they > get to see this. > > -Amar > > On Thu, Jan 24, 2019 at 11:28 AM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Surya, > > Could you please help us to resolve below issue (at lease workaround for > creating volume) > Attached db dump and log. Please let me know any other things need to > check. > Please guide us. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, *bugs at gluster.org* , " > *gluster-users at gluster.org* List" < > *gluster-users at gluster.org* > > Cc: "Murali Kottakota" <*murali.kottakota at tcs.com* > >, "Sanju Rakonde" <*srakonde at redhat.com* > > > Date: 01/23/2019 06:19 PM > Subject: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume > ------------------------------ > > > > > > Hi, > > We are facing also following issue on openshift origin while we are > creating pvc for pods. > > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume Server busy. Retry > operation later.. > > Please find heketidb dump and log > > [negroni] Completed 429 Too Many Requests in 250.763?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 169.08?s > [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c > [negroni] Completed 404 Not Found in 148.125?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 496.624?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 101.673?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 209.681?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 103.595?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 297.594?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 96.75?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 477.007?s > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 165.38?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 488.253?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 171.836?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 208.59?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 125.141?s > [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 > [negroni] Completed 404 Not Found in 138.687?s > [negroni] Started POST /volumes > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > > > -- > Amar Tumballi (amarts) > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 06:47:16 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 12:17:16 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Sanju, Could you please have look my issue if you have time (atleast provide workaround). BR Salam From: Shaik Salam/HYD/TCS To: "Sanju Rakonde" Cc: "Amar Tumballi Suryanarayan" , "gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 05:50 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Sanju, Please find requested information. Sorry to repeat again I am trying start force command once brick log enabled to debug by taking one volume example. Please correct me If I am doing wrong. [root at master ~]# oc rsh glusterfs-storage-vll7x sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 Type: Replicate Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.3.6:/var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick Brick2: 192.168.3.5:/var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick Brick3: 192.168.3.15:/var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick Options Reconfigured: diagnostics.brick-log-level: INFO performance.client-io-threads: off nfs.disable: on transport.address-family: inet sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 108434 Self-heal Daemon on matrix1.matrix.orange.l ab N/A N/A Y 69525 Self-heal Daemon on matrix2.matrix.orange.l ab N/A N/A Y 18569 gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG volume set: success sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep log cluster.entry-change-log on cluster.data-change-log on cluster.metadata-change-log on diagnostics.brick-log-level DEBUG sh-4.2# cd /var/log/glusterfs/bricks/ sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log >>> Noting in log -rw-------. 1 root root 189057 Jan 18 09:20 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:48:59.111292] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:14.112271] E [MSGID: 106026] [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid argument] [2019-01-23 11:50:14.112305] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:20.322902] I [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered already-running brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick [2019-01-23 11:50:20.322925] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick on port 49165 [2019-01-23 11:50:20.327557] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2019-01-23 11:50:20.327586] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped [2019-01-23 11:50:20.327604] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-01-23 11:50:20.337735] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 69525 [2019-01-23 11:50:21.338058] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-01-23 11:50:21.338180] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service [2019-01-23 11:50:21.348234] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2019-01-23 11:50:21.348285] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped [2019-01-23 11:50:21.348866] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2019-01-23 11:50:21.348883] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 109550 Self-heal Daemon on 192.168.3.6 N/A N/A Y 52557 Self-heal Daemon on 192.168.3.15 N/A N/A Y 16946 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ There are no active volume tasks From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , "gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 02:15 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, I can see below errors in glusterd logs. [2019-01-22 09:20:17.540196] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid [2019-01-22 09:20:17.546408] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid [2019-01-22 09:20:17.552575] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid [2019-01-22 09:20:17.558888] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid [2019-01-22 09:20:17.565266] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid [2019-01-22 09:20:17.585926] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.617806] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.649628] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid [2019-01-22 09:20:17.649700] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid So it looks like, neither gf_is_service_running() nor glusterd_brick_signal() are able to read the pid file. That means pidfiles might be having nothing to read. Can you please paste the contents of brick pidfiles. You can find brick pidfiles in /var/run/gluster/vols// or you can just run this command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat $i;done" On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam wrote: Hi Sanju, Please find requested information attached logs. Below brick is offline and try to start force/heal commands but doesn't makes up. sh-4.2# sh-4.2# gluster --version glusterfs 4.1.5 sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log From: Sanju Rakonde To: Shaik Salam Cc: Amar Tumballi Suryanarayan , " gluster-users at gluster.org List" Date: 01/22/2019 02:21 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you please provide us complete glusterd and cmd_history logs from all the nodes in the cluster? Also please paste output of the following commands (from all nodes): 1. gluster --version 2. gluster volume info 3. gluster volume status 4. gluster peer status 5. ps -ax | grep glusterfsd On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam wrote: Hi Surya, It is already customer setup and cant redeploy again. Enabled debug for brick level log but nothing writing to it. Can you tell me is any other ways to troubleshoot or logs to look?? From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 12:06 PM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Surya, I have enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 11:38 AM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -- Amar Tumballi (amarts) _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: firsnode_brick.log Type: application/octet-stream Size: 5625 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Secondnode_brick.log Type: application/octet-stream Size: 30409 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Thirdnode_brick.log Type: application/octet-stream Size: 47635 bytes Desc: not available URL: From shaik.salam at tcs.com Thu Jan 24 06:51:20 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 12:21:20 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Message-ID: Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-gluster.db.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi.log.txt URL: From shaik.salam at tcs.com Thu Jan 24 07:06:07 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 12:36:07 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" , "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-gluster.db.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi.log.txt URL: From revirii at googlemail.com Thu Jan 24 07:17:06 2019 From: revirii at googlemail.com (Hu Bert) Date: Thu, 24 Jan 2019 08:17:06 +0100 Subject: [Gluster-users] gluster 5.3: transport endpoint gets disconnected - Assertion failed: GF_MEM_TRAILER_MAGIC Message-ID: Good morning, we currently transfer some data to a new glusterfs volume; to check the throughput of the new volume/setup while the transfer is running i decided to create some files on one of the gluster servers with dd in loop: while true; do dd if=/dev/urandom of=/shared/private/1G.file bs=1M count=1024; rm /shared/private/1G.file; done /shared/private is the mount point of the glusterfs volume. The dd should run for about an hour. But now it happened twice that during this loop the transport endpoint gets disconnected: dd: failed to open '/shared/private/1G.file': Transport endpoint is not connected rm: cannot remove '/shared/private/1G.file': Transport endpoint is not connected In the /var/log/glusterfs/shared-private.log i see: [2019-01-24 07:03:28.938745] W [MSGID: 108001] [afr-transaction.c:1062:afr_handle_quorum] 0-persistent-replicate-0: 7212652e-c437-426c-a0a9-a47f5972fffe: Failing WRITE as quorum i s not met [Transport endpoint is not connected] [2019-01-24 07:03:28.939280] E [mem-pool.c:331:__gf_free] (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be8c) [0x7eff84248e8c] -->/usr/lib/x86_64-lin ux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be18) [0x7eff84248e18] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0xf6) [0x7eff8a9485a6] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [----snip----] The whole output can be found here: https://pastebin.com/qTMmFxx0 gluster volume info here: https://pastebin.com/ENTWZ7j3 After umount + mount the transport endpoint is connected again - until the next disconnect. A /core file gets generated. Maybe someone wants to have a look at this file? From mrajanna at redhat.com Thu Jan 24 07:21:35 2019 From: mrajanna at redhat.com (Madhu Rajanna) Date: Thu, 24 Jan 2019 12:51:35 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: > H Madhu, > > Could you please have look my issue If you have time (atleast workaround). > I am unable to send mail to "John Mulligan" " > who is currently handling issue > https://bugzilla.redhat.com/show_bug.cgi?id=1636912 > > BR > Salam > > > From: Shaik Salam/HYD/TCS > To: "John Mulligan" , "Michael Adam" < > madam at redhat.com>, "Madhu Rajanna" > Cc: "gluster-users at gluster.org List" > Date: 01/24/2019 12:21 PM > Subject: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > > > Hi All, > > We are facing also following issue on openshift origin while we are > creating pvc for pods. (atlease provide workaround to move further) > > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume Server busy. Retry > operation later.. > > Please find heketidb dump and log > > [negroni] Completed 429 Too Many Requests in 250.763?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 169.08?s > [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c > [negroni] Completed 404 Not Found in 148.125?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 496.624?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 101.673?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 209.681?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 103.595?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 297.594?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 96.75?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 477.007?s > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 165.38?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 488.253?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 171.836?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 208.59?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 125.141?s > [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 > [negroni] Completed 404 Not Found in 138.687?s > [negroni] Started POST /volumes > > > BR > Salam > > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Thu Jan 24 07:29:05 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 24 Jan 2019 12:59:05 +0530 Subject: [Gluster-users] gluster 5.3: transport endpoint gets disconnected - Assertion failed: GF_MEM_TRAILER_MAGIC In-Reply-To: References: Message-ID: On Thu, Jan 24, 2019 at 12:47 PM Hu Bert wrote: > Good morning, > > we currently transfer some data to a new glusterfs volume; to check > the throughput of the new volume/setup while the transfer is running i > decided to create some files on one of the gluster servers with dd in > loop: > > while true; do dd if=/dev/urandom of=/shared/private/1G.file bs=1M > count=1024; rm /shared/private/1G.file; done > > /shared/private is the mount point of the glusterfs volume. The dd > should run for about an hour. But now it happened twice that during > this loop the transport endpoint gets disconnected: > > dd: failed to open '/shared/private/1G.file': Transport endpoint is > not connected > rm: cannot remove '/shared/private/1G.file': Transport endpoint is not > connected > > In the /var/log/glusterfs/shared-private.log i see: > > [2019-01-24 07:03:28.938745] W [MSGID: 108001] > [afr-transaction.c:1062:afr_handle_quorum] 0-persistent-replicate-0: > 7212652e-c437-426c-a0a9-a47f5972fffe: Failing WRITE as quorum i > s not met [Transport endpoint is not connected] > [2019-01-24 07:03:28.939280] E [mem-pool.c:331:__gf_free] > > (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be8c) > [0x7eff84248e8c] -->/usr/lib/x86_64-lin > ux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be18) > [0x7eff84248e18] > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0xf6) > [0x7eff8a9485a6] ) 0-: Assertion failed: > GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) > [----snip----] > > The whole output can be found here: https://pastebin.com/qTMmFxx0 > gluster volume info here: https://pastebin.com/ENTWZ7j3 > > After umount + mount the transport endpoint is connected again - until > the next disconnect. A /core file gets generated. Maybe someone wants > to have a look at this file? > _________________ Hi Hu Bert, Thanks for these logs, and report. 'Transport end point not connected' on a mount comes for 2 reasons. 1. When the brick (in case of replica all the bricks) having the file is not reachable, or are down. This gets to normal state when the bricks are restarted. 2. When the client process crashes/asserts. In this case, /dev/fuse wouldn't be connected to a process, but mount will still have a reference. This needs 'umount' and mount again to work. We will see what is this issue and get back. Regards, Amar > ______________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 07:45:59 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 13:15:59 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mrajanna at redhat.com Thu Jan 24 08:03:28 2019 From: mrajanna at redhat.com (Madhu Rajanna) Date: Thu, 24 Jan 2019 13:33:28 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: > Hi Madhu. > > I tried lot of times restarted heketi pod but not resolved. > > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 0 > New: 0 > Stale: 0 > > Now you can see all operations are zero. Now I try to create single volume > below is observation in-flight reaching slowly to 8. > > sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 > ; export HEKETI_CLI_USE Operation > Counts: > Total: 0 > In-Flight: 6 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > > [negroni] Completed 200 OK in 186.286?s > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 166.294?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 186.411?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 179.796?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 131.108?s > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 111.392?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 265.023?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 179.364?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 295.058?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 146.857?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 403.166?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 193.554?s > > > But for pod volume is not creating. > > 1:15:36 PM > Warning > Provisioning failed Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume > Server busy. Retry operation later.. > 9 times in the last 2 minutes > 1:13:21 PM > Warning > Provisioning failed Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume . > 8 times in the last > > > > > From: "Madhu Rajanna" > To: "Shaik Salam" > Cc: "gluster-users at gluster.org List" , > "Michael Adam" > Date: 01/24/2019 12:51 PM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > > *"External email. Open with Caution"* > HI Shaik, > > can you provide me the outpout of $heketi-cli server operations info > from heketi pod > > as a workround you can try restarting the heketi pod. This will cause the > current operations to go stale, but other pending pvcs may go to Bound > state > > Regards, > > Madhu R > > On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > H Madhu, > > Could you please have look my issue If you have time (atleast workaround). > I am unable to send mail to "John Mulligan" <*John_Mulligan at redhat.com* > >" who is currently handling issue > *https://bugzilla.redhat.com/show_bug.cgi?id=1636912* > > > BR > Salam > > > From: Shaik Salam/HYD/TCS > To: "John Mulligan" <*John_Mulligan at redhat.com* > >, "Michael Adam" <*madam at redhat.com* > >, "Madhu Rajanna" <*mrajanna at redhat.com* > > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/24/2019 12:21 PM > Subject: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > > > Hi All, > > We are facing also following issue on openshift origin while we are > creating pvc for pods. (atlease provide workaround to move further) > > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume Server busy. Retry > operation later.. > > Please find heketidb dump and log > > [negroni] Completed 429 Too Many Requests in 250.763?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 169.08?s > [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c > [negroni] Completed 404 Not Found in 148.125?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 496.624?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 101.673?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 209.681?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 103.595?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 297.594?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 96.75?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 477.007?s > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 165.38?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 488.253?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 171.836?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 208.59?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 125.141?s > [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 > [negroni] Completed 404 Not Found in 138.687?s > [negroni] Started POST /volumes > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > > > -- > Madhu Rajanna > Software Engineer > Red Hat Bangalore, India > mrajanna at redhat.com M: +91-9741133155 > > > > -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Thu Jan 24 08:07:53 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Thu, 24 Jan 2019 13:37:53 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Shaik, Previously I was suspecting, whether brick pid file is missing. But I see it is present. >From second node (this brick is in offline state): /var/run/gluster/vols/vol_3442e86b6d994a14de73f1b8c82cf0b8/192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid 271 Are you still seeing the error "Unable to read pidfile:" in glusterd log? I also suspect whether brick is missing its extended attributes. Are you seeing "brick is deemed not to be a part of the volume" error in glusterd log? If not can you please provide us output of "getfattr -m -d -e hex " On Thu, Jan 24, 2019 at 12:18 PM Shaik Salam wrote: > Hi Sanju, > > Could you please have look my issue if you have time (atleast provide > workaround). > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: "Sanju Rakonde" > Cc: "Amar Tumballi Suryanarayan" , " > gluster-users at gluster.org List" , "Murali > Kottakota" > Date: 01/23/2019 05:50 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > > Hi Sanju, > > Please find requested information. > > Sorry to repeat again I am trying start force command once brick log > enabled to debug by taking one volume example. > Please correct me If I am doing wrong. > > > [root at master ~]# oc rsh glusterfs-storage-vll7x > sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 > > Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Type: Replicate > Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 192.168.3.6: > /var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick > Brick2: 192.168.3.5: > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ > brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > Brick3: 192.168.3.15: > /var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick > Options Reconfigured: > diagnostics.brick-log-level: INFO > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 108434 > Self-heal Daemon on matrix1.matrix.orange.l > ab N/A N/A Y > 69525 > Self-heal Daemon on matrix2.matrix.orange.l > ab N/A N/A Y > 18569 > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > volume set: success > sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep > log > cluster.entry-change-log on > cluster.data-change-log on > cluster.metadata-change-log on > diagnostics.brick-log-level DEBUG > > sh-4.2# cd /var/log/glusterfs/bricks/ > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > >>> Noting in log > > -rw-------. 1 root root 189057 Jan 18 09:20 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 > > [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o > diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd > [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o > diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd > [2019-01-23 11:48:59.111292] W [MSGID: 106036] > [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: > Snapshot list failed > [2019-01-23 11:50:14.112271] E [MSGID: 106026] > [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: > Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid > argument] > [2019-01-23 11:50:14.112305] W [MSGID: 106036] > [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: > Snapshot list failed > [2019-01-23 11:50:20.322902] I > [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered > already-running brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > [2019-01-23 11:50:20.322925] I [MSGID: 106142] > [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > on port 49165 > [2019-01-23 11:50:20.327557] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already > stopped > [2019-01-23 11:50:20.327586] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is > stopped > [2019-01-23 11:50:20.327604] I [MSGID: 106599] > [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so > xlator is not installed > [2019-01-23 11:50:20.337735] I [MSGID: 106568] > [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 69525 > [2019-01-23 11:50:21.338058] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd > service is stopped > [2019-01-23 11:50:21.338180] I [MSGID: 106567] > [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-01-23 11:50:21.348234] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already > stopped > [2019-01-23 11:50:21.348285] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is > stopped > [2019-01-23 11:50:21.348866] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already > stopped > [2019-01-23 11:50:21.348883] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is > stopped > [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 109550 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 52557 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 16946 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > > > From: "Sanju Rakonde" > To: "Shaik Salam" > Cc: "Amar Tumballi Suryanarayan" , " > gluster-users at gluster.org List" , "Murali > Kottakota" > Date: 01/23/2019 02:15 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > *"External email. Open with Caution"* > Hi Shaik, > > I can see below errors in glusterd logs. > > [2019-01-22 09:20:17.540196] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid > [2019-01-22 09:20:17.546408] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid > [2019-01-22 09:20:17.552575] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid > [2019-01-22 09:20:17.558888] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid > [2019-01-22 09:20:17.565266] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid > [2019-01-22 09:20:17.585926] E [MSGID: 106028] > [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid > of brick process > [2019-01-22 09:20:17.617806] E [MSGID: 106028] > [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid > of brick process > [2019-01-22 09:20:17.649628] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/glustershd/glustershd.pid > [2019-01-22 09:20:17.649700] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/glustershd/glustershd.pid > > So it looks like, neither gf_is_service_running() > nor glusterd_brick_signal() are able to read the pid file. That means > pidfiles might be having nothing to read. > > Can you please paste the contents of brick pidfiles. You can find brick > pidfiles in /var/run/gluster/vols// or you can just run this > command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat $i;done" > > On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Please find requested information attached logs. > > > > > Below brick is offline and try to start force/heal commands but doesn't > makes up. > > sh-4.2# > sh-4.2# gluster --version > glusterfs 4.1.5 > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > Enabled DEBUG mode for brick level. But nothing writing to brick log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > > > > > > From: Sanju Rakonde <*srakonde at redhat.com* > > To: Shaik Salam <*shaik.salam at tcs.com* > > Cc: Amar Tumballi Suryanarayan <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > > > Date: 01/22/2019 02:21 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you please provide us complete glusterd and cmd_history logs from all > the nodes in the cluster? Also please paste output of the following > commands (from all nodes): > 1. gluster --version > 2. gluster volume info > 3. gluster volume status > 4. gluster peer status > 5. ps -ax | grep glusterfsd > > On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Surya, > > It is already customer setup and cant redeploy again. > Enabled debug for brick level log but nothing writing to it. > Can you tell me is any other ways to troubleshoot or logs to look?? > > > From: Shaik Salam/HYD/TCS > To: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 12:06 PM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > Hi Surya, > > I have enabled DEBUG mode for brick level. But nothing writing to brick > log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > BR > Salam > > > > > From: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 11:38 AM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you check what is there in brick logs? They are located in > /var/log/glusterfs/bricks/*? > > Looks like the samba hooks script failed, but that shouldn't matter in > this use case. > > Also, I see that you are trying to setup heketi to provision volumes, > which means you may be using gluster in container usecases. If you are > still in 'PoC' phase, can you give *https://github.com/gluster/gcs* > a try? That makes the deployment and the > stack little simpler. > > -Amar > > > > > On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Can anyone respond how to recover bricks apart from heal/start force > according to below events from logs. > Please let me know any other logs required. > Thanks in advance. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: *bugs at gluster.org* , > *gluster-users at gluster.org* > Date: 01/21/2019 10:03 PM > Subject: Bricks are going offline unable to recover with > heal/start force commands > ------------------------------ > > > Hi, > > Bricks are in offline and unable to recover with following commands > > gluster volume heal > > gluster volume start force > > But still bricks are offline. > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > > Please let us know steps to recover bricks. > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > _______________________________________________ > Bugs mailing list > *Bugs at gluster.org* > *https://lists.gluster.org/mailman/listinfo/bugs* > > > > -- > Amar Tumballi (amarts) > _______________________________________________ > Gluster-users mailing list > *Gluster-users at gluster.org* > *https://lists.gluster.org/mailman/listinfo/gluster-users* > > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 08:10:41 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 13:40:41 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi.log.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ps-aux from 3nodes.txt URL: From mrajanna at redhat.com Thu Jan 24 08:24:45 2019 From: mrajanna at redhat.com (Madhu Rajanna) Date: Thu, 24 Jan 2019 13:54:45 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: > Hi Madhu, > > Please find requested info. > > BR > Salam > > > > > > From: Madhu Rajanna > To: Shaik Salam > Cc: "gluster-users at gluster.org List" , > Michael Adam > Date: 01/24/2019 01:33 PM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > > *"External email. Open with Caution"* > the heketi logs you have attached is not complete i believe, can you > povide the complete heketi logs > and also an we get the output of "ps aux" from the gluster pods ? I want > to see if any lvm commands or gluster commands are "stuck". > > > On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Madhu. > > I tried lot of times restarted heketi pod but not resolved. > > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 0 > New: 0 > Stale: 0 > > Now you can see all operations are zero. Now I try to create single volume > below is observation in-flight reaching slowly to 8. > > sh-4.4# heketi-cli server operations infoCLI_SERVER= > *http://localhost:8080* ; export HEKETI_CLI_USE > Operation Counts: > Total: 0 > In-Flight: 6 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > > [negroni] Completed 200 OK in 186.286?s > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 166.294?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 186.411?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 179.796?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 131.108?s > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 111.392?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 265.023?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 179.364?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 295.058?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 146.857?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 403.166?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 193.554?s > > > But for pod volume is not creating. > 1:15:36 PM > Warning > Provisioning failed Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume > Server busy. Retry operation later.. > 9 times in the last 2 minutes > 1:13:21 PM > Warning > Provisioning failed Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume . > 8 times in the last > > > > > > From: "Madhu Rajanna" <*mrajanna at redhat.com* > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* >, "Michael Adam" > <*madam at redhat.com* > > Date: 01/24/2019 12:51 PM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > > * "External email. Open with Caution"* > HI Shaik, > > can you provide me the outpout of $heketi-cli server operations info > from heketi pod > > as a workround you can try restarting the heketi pod. This will cause the > current operations to go stale, but other pending pvcs may go to Bound > state > > Regards, > > Madhu R > > On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > H Madhu, > > Could you please have look my issue If you have time (atleast workaround). > I am unable to send mail to "John Mulligan" <*John_Mulligan at redhat.com* > >" who is currently handling issue > *https://bugzilla.redhat.com/show_bug.cgi?id=1636912* > > > BR > Salam > > > From: Shaik Salam/HYD/TCS > To: "John Mulligan" <*John_Mulligan at redhat.com* > >, "Michael Adam" <*madam at redhat.com* > >, "Madhu Rajanna" <*mrajanna at redhat.com* > > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/24/2019 12:21 PM > Subject: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > > ------------------------------ > > > > > Hi All, > > We are facing also following issue on openshift origin while we are > creating pvc for pods. (atlease provide workaround to move further) > > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume Server busy. Retry > operation later.. > > Please find heketidb dump and log > > [negroni] Completed 429 Too Many Requests in 250.763?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 169.08?s > [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c > [negroni] Completed 404 Not Found in 148.125?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 496.624?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 101.673?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 209.681?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 103.595?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 297.594?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 96.75?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 477.007?s > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 165.38?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 488.253?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 171.836?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 208.59?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 125.141?s > [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 > [negroni] Completed 404 Not Found in 138.687?s > [negroni] Started POST /volumes > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > > > -- > Madhu Rajanna > Software Engineer > *Red Hat Bangalore, India* > *mrajanna at redhat.com* M: +91-9741133155 > > > > > > -- > Madhu Rajanna > Software Engineer > Red Hat Bangalore, India > mrajanna at redhat.com M: +91-9741133155 > > > > -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 08:35:04 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 14:05:04 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Sanju, Please find requsted information. Are you still seeing the error "Unable to read pidfile:" in glusterd log? >>>> No Are you seeing "brick is deemed not to be a part of the volume" error in glusterd log?>>>> No sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae1^C8ab7782dd57cf5b6c1/brick sh-4.2# pwd /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -d -m . -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ getfattr: Removing leading '/' from absolute path names # file: var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_3442e86b6d994a14de73f1b8c82cf0b8-client-0=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.volume-id=0x15477f3622e84757a0ce9000b63fa849 sh-4.2# ls -la |wc -l 86 sh-4.2# pwd /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , "gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/24/2019 01:38 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Shaik, Previously I was suspecting, whether brick pid file is missing. But I see it is present. >From second node (this brick is in offline state): /var/run/gluster/vols/vol_3442e86b6d994a14de73f1b8c82cf0b8/192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid 271 Are you still seeing the error "Unable to read pidfile:" in glusterd log? I also suspect whether brick is missing its extended attributes. Are you seeing "brick is deemed not to be a part of the volume" error in glusterd log? If not can you please provide us output of "getfattr -m -d -e hex " On Thu, Jan 24, 2019 at 12:18 PM Shaik Salam wrote: Hi Sanju, Could you please have look my issue if you have time (atleast provide workaround). BR Salam From: Shaik Salam/HYD/TCS To: "Sanju Rakonde" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 05:50 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Sanju, Please find requested information. Sorry to repeat again I am trying start force command once brick log enabled to debug by taking one volume example. Please correct me If I am doing wrong. [root at master ~]# oc rsh glusterfs-storage-vll7x sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 Type: Replicate Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.3.6:/var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick Brick2: 192.168.3.5:/var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick Brick3: 192.168.3.15:/var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick Options Reconfigured: diagnostics.brick-log-level: INFO performance.client-io-threads: off nfs.disable: on transport.address-family: inet sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 108434 Self-heal Daemon on matrix1.matrix.orange.l ab N/A N/A Y 69525 Self-heal Daemon on matrix2.matrix.orange.l ab N/A N/A Y 18569 gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG volume set: success sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep log cluster.entry-change-log on cluster.data-change-log on cluster.metadata-change-log on diagnostics.brick-log-level DEBUG sh-4.2# cd /var/log/glusterfs/bricks/ sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log >>> Noting in log -rw-------. 1 root root 189057 Jan 18 09:20 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:48:59.111292] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:14.112271] E [MSGID: 106026] [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid argument] [2019-01-23 11:50:14.112305] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:20.322902] I [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered already-running brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick [2019-01-23 11:50:20.322925] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick on port 49165 [2019-01-23 11:50:20.327557] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2019-01-23 11:50:20.327586] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped [2019-01-23 11:50:20.327604] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-01-23 11:50:20.337735] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 69525 [2019-01-23 11:50:21.338058] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-01-23 11:50:21.338180] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service [2019-01-23 11:50:21.348234] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2019-01-23 11:50:21.348285] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped [2019-01-23 11:50:21.348866] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2019-01-23 11:50:21.348883] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 109550 Self-heal Daemon on 192.168.3.6 N/A N/A Y 52557 Self-heal Daemon on 192.168.3.15 N/A N/A Y 16946 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ There are no active volume tasks From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 02:15 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, I can see below errors in glusterd logs. [2019-01-22 09:20:17.540196] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid [2019-01-22 09:20:17.546408] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid [2019-01-22 09:20:17.552575] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid [2019-01-22 09:20:17.558888] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid [2019-01-22 09:20:17.565266] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid [2019-01-22 09:20:17.585926] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.617806] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.649628] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid [2019-01-22 09:20:17.649700] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid So it looks like, neither gf_is_service_running() nor glusterd_brick_signal() are able to read the pid file. That means pidfiles might be having nothing to read. Can you please paste the contents of brick pidfiles. You can find brick pidfiles in /var/run/gluster/vols// or you can just run this command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat $i;done" On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam wrote: Hi Sanju, Please find requested information attached logs. Below brick is offline and try to start force/heal commands but doesn't makes up. sh-4.2# sh-4.2# gluster --version glusterfs 4.1.5 sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log From: Sanju Rakonde To: Shaik Salam Cc: Amar Tumballi Suryanarayan , " gluster-users at gluster.org List" Date: 01/22/2019 02:21 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you please provide us complete glusterd and cmd_history logs from all the nodes in the cluster? Also please paste output of the following commands (from all nodes): 1. gluster --version 2. gluster volume info 3. gluster volume status 4. gluster peer status 5. ps -ax | grep glusterfsd On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam wrote: Hi Surya, It is already customer setup and cant redeploy again. Enabled debug for brick level log but nothing writing to it. Can you tell me is any other ways to troubleshoot or logs to look?? From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 12:06 PM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Surya, I have enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 11:38 AM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -- Amar Tumballi (amarts) _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -- Thanks, Sanju -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Thu Jan 24 09:02:26 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Thu, 24 Jan 2019 14:32:26 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Shaik, Sorry to ask this again. What errors are you seeing in glusterd logs? Can you share the latest logs? On Thu, Jan 24, 2019 at 2:05 PM Shaik Salam wrote: > Hi Sanju, > > Please find requsted information. > > Are you still seeing the error "Unable to read pidfile:" in glusterd log? > >>>> No > Are you seeing "brick is deemed not to be a part of the volume" error in > glusterd log?>>>> No > > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae1^C8ab7782dd57cf5b6c1/brick > sh-4.2# pwd > > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -d -m . -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > getfattr: Removing leading '/' from absolute path names > # file: > var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > > security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 > trusted.afr.dirty=0x000000000000000000000000 > > trusted.afr.vol_3442e86b6d994a14de73f1b8c82cf0b8-client-0=0x000000000000000000000000 > trusted.gfid=0x00000000000000000000000000000001 > trusted.glusterfs.dht=0x000000010000000000000000ffffffff > trusted.glusterfs.volume-id=0x15477f3622e84757a0ce9000b63fa849 > > sh-4.2# ls -la |wc -l > 86 > sh-4.2# pwd > > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# > > > > From: "Sanju Rakonde" > To: "Shaik Salam" > Cc: "Amar Tumballi Suryanarayan" , " > gluster-users at gluster.org List" , "Murali > Kottakota" > Date: 01/24/2019 01:38 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > *"External email. Open with Caution"* > Shaik, > > Previously I was suspecting, whether brick pid file is missing. But I see > it is present. > > From second node (this brick is in offline state): > > /var/run/gluster/vols/vol_3442e86b6d994a14de73f1b8c82cf0b8/192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid > 271 > Are you still seeing the error "Unable to read pidfile:" in glusterd log? > > I also suspect whether brick is missing its extended attributes. Are you > seeing "brick is deemed not to be a part of the volume" error in glusterd > log? If not can you please provide us output of "getfattr -m -d -e hex > " > > On Thu, Jan 24, 2019 at 12:18 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Could you please have look my issue if you have time (atleast provide > workaround). > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: "Sanju Rakonde" <*srakonde at redhat.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/23/2019 05:50 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > > Hi Sanju, > > Please find requested information. > > Sorry to repeat again I am trying start force command once brick log > enabled to debug by taking one volume example. > Please correct me If I am doing wrong. > > > [root at master ~]# oc rsh glusterfs-storage-vll7x > sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 > > Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Type: Replicate > Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 192.168.3.6: > /var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick > Brick2: 192.168.3.5: > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ > brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > Brick3: 192.168.3.15: > /var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick > Options Reconfigured: > diagnostics.brick-log-level: INFO > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 108434 > Self-heal Daemon on matrix1.matrix.orange.l > ab N/A N/A Y > 69525 > Self-heal Daemon on matrix2.matrix.orange.l > ab N/A N/A Y > 18569 > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > volume set: success > sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep > log > cluster.entry-change-log on > cluster.data-change-log on > cluster.metadata-change-log on > diagnostics.brick-log-level DEBUG > > sh-4.2# cd /var/log/glusterfs/bricks/ > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > >>> Noting in log > > -rw-------. 1 root root 189057 Jan 18 09:20 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 > > [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o > diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd > [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o > diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd > [2019-01-23 11:48:59.111292] W [MSGID: 106036] > [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: > Snapshot list failed > [2019-01-23 11:50:14.112271] E [MSGID: 106026] > [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: > Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid > argument] > [2019-01-23 11:50:14.112305] W [MSGID: 106036] > [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: > Snapshot list failed > [2019-01-23 11:50:20.322902] I > [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered > already-running brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > [2019-01-23 11:50:20.322925] I [MSGID: 106142] > [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > on port 49165 > [2019-01-23 11:50:20.327557] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already > stopped > [2019-01-23 11:50:20.327586] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is > stopped > [2019-01-23 11:50:20.327604] I [MSGID: 106599] > [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so > xlator is not installed > [2019-01-23 11:50:20.337735] I [MSGID: 106568] > [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 69525 > [2019-01-23 11:50:21.338058] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd > service is stopped > [2019-01-23 11:50:21.338180] I [MSGID: 106567] > [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-01-23 11:50:21.348234] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already > stopped > [2019-01-23 11:50:21.348285] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is > stopped > [2019-01-23 11:50:21.348866] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already > stopped > [2019-01-23 11:50:21.348883] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is > stopped > [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 109550 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 52557 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 16946 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > > > From: "Sanju Rakonde" <*srakonde at redhat.com* > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/23/2019 02:15 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > I can see below errors in glusterd logs. > > [2019-01-22 09:20:17.540196] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid > > [2019-01-22 09:20:17.546408] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid > > [2019-01-22 09:20:17.552575] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid > > [2019-01-22 09:20:17.558888] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid > > [2019-01-22 09:20:17.565266] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid > > [2019-01-22 09:20:17.585926] E [MSGID: 106028] > [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid > of brick process > [2019-01-22 09:20:17.617806] E [MSGID: 106028] > [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid > of brick process > [2019-01-22 09:20:17.649628] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/glustershd/glustershd.pid > [2019-01-22 09:20:17.649700] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/glustershd/glustershd.pid > > So it looks like, neither gf_is_service_running() > nor glusterd_brick_signal() are able to read the pid file. That means > pidfiles might be having nothing to read. > > Can you please paste the contents of brick pidfiles. You can find brick > pidfiles in /var/run/gluster/vols// or you can just run this > command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat > $i;done" > > On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Please find requested information attached logs. > > > > > Below brick is offline and try to start force/heal commands but doesn't > makes up. > > sh-4.2# > sh-4.2# gluster --version > glusterfs 4.1.5 > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > Enabled DEBUG mode for brick level. But nothing writing to brick log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > > > > > > From: Sanju Rakonde <*srakonde at redhat.com* > > To: Shaik Salam <*shaik.salam at tcs.com* > > Cc: Amar Tumballi Suryanarayan <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > > > Date: 01/22/2019 02:21 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you please provide us complete glusterd and cmd_history logs from all > the nodes in the cluster? Also please paste output of the following > commands (from all nodes): > 1. gluster --version > 2. gluster volume info > 3. gluster volume status > 4. gluster peer status > 5. ps -ax | grep glusterfsd > > On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Surya, > > It is already customer setup and cant redeploy again. > Enabled debug for brick level log but nothing writing to it. > Can you tell me is any other ways to troubleshoot or logs to look?? > > > From: Shaik Salam/HYD/TCS > To: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 12:06 PM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > Hi Surya, > > I have enabled DEBUG mode for brick level. But nothing writing to brick > log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > BR > Salam > > > > > From: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 11:38 AM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you check what is there in brick logs? They are located in > /var/log/glusterfs/bricks/*? > > Looks like the samba hooks script failed, but that shouldn't matter in > this use case. > > Also, I see that you are trying to setup heketi to provision volumes, > which means you may be using gluster in container usecases. If you are > still in 'PoC' phase, can you give *https://github.com/gluster/gcs* > a try? That makes the deployment and the > stack little simpler. > > -Amar > > > > > On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Can anyone respond how to recover bricks apart from heal/start force > according to below events from logs. > Please let me know any other logs required. > Thanks in advance. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: *bugs at gluster.org* , > *gluster-users at gluster.org* > Date: 01/21/2019 10:03 PM > Subject: Bricks are going offline unable to recover with > heal/start force commands > ------------------------------ > > > Hi, > > Bricks are in offline and unable to recover with following commands > > gluster volume heal > > gluster volume start force > > But still bricks are offline. > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > > Please let us know steps to recover bricks. > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > _______________________________________________ > Bugs mailing list > *Bugs at gluster.org* > *https://lists.gluster.org/mailman/listinfo/bugs* > > > > -- > Amar Tumballi (amarts) > _______________________________________________ > Gluster-users mailing list > *Gluster-users at gluster.org* > *https://lists.gluster.org/mailman/listinfo/gluster-users* > > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 09:53:12 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 15:23:12 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: heketi-pod-complete.log Type: application/octet-stream Size: 135146 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ps-aux.txt URL: From shaik.salam at tcs.com Thu Jan 24 10:29:02 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 15:59:02 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Sanju, Please find requested information (these are latest logs :) ). I can see only following error messages related to brick "brick_e15c12cceae12c8ab7782dd57cf5b6c1" (on secondnode log) [2019-01-23 11:50:20.322902] I [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered already-running brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick [2019-01-23 11:50:20.322925] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/ brick on port 49165 >> showing running on port but not [2019-01-23 11:50:20.327557] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2019-01-23 11:50:20.327586] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped [2019-01-23 11:50:20.327604] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-01-23 11:50:20.337735] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 69525 [2019-01-23 11:50:21.338058] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-01-23 11:50:21.338180] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service [2019-01-23 11:50:21.348234] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2019-01-23 11:50:21.348285] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped [2019-01-23 11:50:21.348866] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2019-01-23 11:50:21.348883] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 109550 Self-heal Daemon on 192.168.3.6 N/A N/A Y 52557 Self-heal Daemon on 192.168.3.15 N/A N/A Y 16946 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ There are no active volume tasks BR Salam From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , "gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/24/2019 02:32 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Shaik, Sorry to ask this again. What errors are you seeing in glusterd logs? Can you share the latest logs? On Thu, Jan 24, 2019 at 2:05 PM Shaik Salam wrote: Hi Sanju, Please find requsted information. Are you still seeing the error "Unable to read pidfile:" in glusterd log? >>>> No Are you seeing "brick is deemed not to be a part of the volume" error in glusterd log?>>>> No sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae1^C8ab7782dd57cf5b6c1/brick sh-4.2# pwd /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -d -m . -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ getfattr: Removing leading '/' from absolute path names # file: var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_3442e86b6d994a14de73f1b8c82cf0b8-client-0=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.volume-id=0x15477f3622e84757a0ce9000b63fa849 sh-4.2# ls -la |wc -l 86 sh-4.2# pwd /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/24/2019 01:38 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Shaik, Previously I was suspecting, whether brick pid file is missing. But I see it is present. >From second node (this brick is in offline state): /var/run/gluster/vols/vol_3442e86b6d994a14de73f1b8c82cf0b8/192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid 271 Are you still seeing the error "Unable to read pidfile:" in glusterd log? I also suspect whether brick is missing its extended attributes. Are you seeing "brick is deemed not to be a part of the volume" error in glusterd log? If not can you please provide us output of "getfattr -m -d -e hex " On Thu, Jan 24, 2019 at 12:18 PM Shaik Salam wrote: Hi Sanju, Could you please have look my issue if you have time (atleast provide workaround). BR Salam From: Shaik Salam/HYD/TCS To: "Sanju Rakonde" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 05:50 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Sanju, Please find requested information. Sorry to repeat again I am trying start force command once brick log enabled to debug by taking one volume example. Please correct me If I am doing wrong. [root at master ~]# oc rsh glusterfs-storage-vll7x sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 Type: Replicate Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.3.6:/var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick Brick2: 192.168.3.5:/var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick Brick3: 192.168.3.15:/var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick Options Reconfigured: diagnostics.brick-log-level: INFO performance.client-io-threads: off nfs.disable: on transport.address-family: inet sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 108434 Self-heal Daemon on matrix1.matrix.orange.l ab N/A N/A Y 69525 Self-heal Daemon on matrix2.matrix.orange.l ab N/A N/A Y 18569 gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG volume set: success sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep log cluster.entry-change-log on cluster.data-change-log on cluster.metadata-change-log on diagnostics.brick-log-level DEBUG sh-4.2# cd /var/log/glusterfs/bricks/ sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log >>> Noting in log -rw-------. 1 root root 189057 Jan 18 09:20 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:48:59.111292] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:14.112271] E [MSGID: 106026] [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid argument] [2019-01-23 11:50:14.112305] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:20.322902] I [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered already-running brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick [2019-01-23 11:50:20.322925] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick on port 49165 [2019-01-23 11:50:20.327557] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2019-01-23 11:50:20.327586] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped [2019-01-23 11:50:20.327604] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-01-23 11:50:20.337735] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 69525 [2019-01-23 11:50:21.338058] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-01-23 11:50:21.338180] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service [2019-01-23 11:50:21.348234] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2019-01-23 11:50:21.348285] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped [2019-01-23 11:50:21.348866] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2019-01-23 11:50:21.348883] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 109550 Self-heal Daemon on 192.168.3.6 N/A N/A Y 52557 Self-heal Daemon on 192.168.3.15 N/A N/A Y 16946 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ There are no active volume tasks From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 02:15 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, I can see below errors in glusterd logs. [2019-01-22 09:20:17.540196] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid [2019-01-22 09:20:17.546408] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid [2019-01-22 09:20:17.552575] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid [2019-01-22 09:20:17.558888] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid [2019-01-22 09:20:17.565266] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid [2019-01-22 09:20:17.585926] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.617806] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.649628] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid [2019-01-22 09:20:17.649700] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid So it looks like, neither gf_is_service_running() nor glusterd_brick_signal() are able to read the pid file. That means pidfiles might be having nothing to read. Can you please paste the contents of brick pidfiles. You can find brick pidfiles in /var/run/gluster/vols// or you can just run this command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat $i;done" On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam wrote: Hi Sanju, Please find requested information attached logs. Below brick is offline and try to start force/heal commands but doesn't makes up. sh-4.2# sh-4.2# gluster --version glusterfs 4.1.5 sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log From: Sanju Rakonde To: Shaik Salam Cc: Amar Tumballi Suryanarayan , " gluster-users at gluster.org List" Date: 01/22/2019 02:21 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you please provide us complete glusterd and cmd_history logs from all the nodes in the cluster? Also please paste output of the following commands (from all nodes): 1. gluster --version 2. gluster volume info 3. gluster volume status 4. gluster peer status 5. ps -ax | grep glusterfsd On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam wrote: Hi Surya, It is already customer setup and cant redeploy again. Enabled debug for brick level log but nothing writing to it. Can you tell me is any other ways to troubleshoot or logs to look?? From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 12:06 PM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Surya, I have enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 11:38 AM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -- Amar Tumballi (amarts) _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -- Thanks, Sanju -- Thanks, Sanju -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: firstnode.log Type: application/octet-stream Size: 294510 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: secondnode.log Type: application/octet-stream Size: 1260140 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Thirdnode.log Type: application/octet-stream Size: 295999 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: volume-level-info.log.txt URL: From hunter86_bg at yahoo.com Thu Jan 24 10:32:48 2019 From: hunter86_bg at yahoo.com (Strahil Nikolov) Date: Thu, 24 Jan 2019 10:32:48 +0000 (UTC) Subject: [Gluster-users] =?utf-8?b?0J7RgtC9OiAgR2x1c3RlciBwZXJmb3JtYW5j?= =?utf-8?q?e_issues_-_need_advise?= In-Reply-To: <54e562bd-3465-4c69-8fda-8060a52e9c22@email.android.com> References: <54e562bd-3465-4c69-8fda-8060a52e9c22@email.android.com> Message-ID: <1172976507.352481.1548325968837@mail.yahoo.com> Dear Amar, Community, it seems the issue is in the fuse client itself. Here is the latest update:1. I have added the following:server.event-threads: 4 client.event-threads: 4 performance.stat-prefetch: onperformance.strict-o-direct: off Results: no change 2. Allowed nfs and connected ovirt1 to the gluster volume:nfs.disable: off Results: Drastic improvement in performance as follows: [root at ovirt1 data]# dd if=/dev/zero of=largeio bs=1M count=5000 status=progress 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB) copied, 53.0443 s, 98.8 MB/s So I would be happy if anyone guide me in order to fix the situation as the fuse client is the best way to use glusterfs, and it seems the glusterfs-server is not the guilty one. Thanks in advance for your guidance.I have learned so much. Best Regards,Strahil Nikolov ??: Strahil ??: Amar Tumballi Suryanarayan ?????: Gluster-users ????????: ?????, 23 ?????? 2019 ?. 18:44 ????: Re: [Gluster-users] Gluster performance issues - need advise Dear Amar, Thanks for your email. Actually my concerns were on both topics.Would you recommend any perf options that will be suitable ? After mentioning the network usage, I just checked it and it seems duringthe test session, ovirt1 (both client and host) is using no more than 455Mbit/s which is half the network bandwidth. I'm still in the middle of nowhere, so any ideas are welcome. Best Regards,Strahil Nikolov On Jan 23, 2019 17:49, Amar Tumballi Suryanarayan wrote: I didn't understand the issue properly. Mostly I missed something. Are you concerned the performance is 49MB/s with and without perf options? or are you expecting it to be 123MB/s as over the n/w you get that speed? If it is the first problem, then you are actually having 'performance.write-behind on' in both options, and it is the only perf xlator which comes into action during the test you ran. If it is the second, then please be informed that gluster does client side replication, which means, n/w would be split in half for write operations (like write(), creat() etc), so the number you are getting is almost the maximum with 1GbE. Regards,Amar On Wed, Jan 23, 2019 at 8:38 PM Strahil Nikolov wrote: Hello Community, recently I have built a new lab based on oVirt and CentOS 7. During deployment I had some hicups, but now the engine is up and running - but gluster is causing me trouble. Symptoms: Slow VM install from DVD, poor write performance. The latter has been tested via: dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_data bs=1M count=1000 status=progress The reported speed is 60MB/s which is way too low for my setup. My lab design: https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing Gluster version is 3.12.15 So far I have done: 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure on' in glusterd.vol) Volume info after that change: Volume Name: data Type: Replicate Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt1.localdomain:/gluster_bricks/data/data Brick2: ovirt2.localdomain:/gluster_bricks/data/data Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off storage.owner-uid: 36 storage.owner-gid: 36 network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable server.allow-insecure: on Seems no positive or negative effect so far. 2. Tested with tmpfs? on all bricks -> ovirt1 mounted gluster volume ->? max 60MB/s (bs=1M without 'oflag=direct') [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M? count=4000 status=progress 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s 4000+0 records in 4000+0 records out 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s [root at ovirt1 data]# rm -f large_io [root at ovirt1 data]# gluster volume profile data info Brick: ovirt1.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size:? ? ? ? ? ? 131072b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 8 No. of Writes:? ? ? ? ? ? ? ? 44968 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 3? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 35? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 28? RELEASEDIR 0.00? ? ? 78.00 us? ? ? 78.00 us? ? ? 78.00 us? ? ? ? ? ? ? 1? ? ? FSTAT 0.00? ? ? 35.67 us? ? ? 26.00 us? ? ? 73.00 us? ? ? ? ? ? ? 6? ? ? FLUSH 0.00? ? 324.00 us? ? 324.00 us? ? 324.00 us? ? ? ? ? ? ? 1? ? XATTROP 0.00? ? ? 45.80 us? ? ? 38.00 us? ? ? 54.00 us? ? ? ? ? ? 10? ? ? ? STAT 0.00? ? 227.67 us? ? 216.00 us? ? 242.00 us? ? ? ? ? ? ? 3? ? ? CREATE 0.00? ? 113.38 us? ? ? 68.00 us? ? 381.00 us? ? ? ? ? ? ? 8? ? ? ? READ 0.00? ? ? 39.82 us? ? ? 1.00 us? ? 148.00 us? ? ? ? ? ? 28? ? OPENDIR 0.00? ? ? 67.54 us? ? ? 10.00 us? ? 283.00 us? ? ? ? ? ? 24? ? GETXATTR 0.00? ? ? 59.97 us? ? ? 45.00 us? ? 113.00 us? ? ? ? ? ? 32? ? ? ? OPEN 0.00? ? ? 24.41 us? ? ? 13.00 us? ? ? 89.00 us? ? ? ? ? ? 161? ? INODELK 0.00? ? ? 43.43 us? ? ? 28.00 us? ? 214.00 us? ? ? ? ? ? 93? ? ? STATFS 0.00? ? 246.35 us? ? ? 11.00 us? ? 1155.00 us? ? ? ? ? ? 20? ? READDIR 0.00? ? 283.00 us? ? 233.00 us? ? 353.00 us? ? ? ? ? ? 18? ? READDIRP 0.00? ? 153.23 us? ? 122.00 us? ? 259.00 us? ? ? ? ? ? 87? ? ? MKNOD 0.01? ? ? 99.77 us? ? ? 10.00 us? ? 258.00 us? ? ? ? ? ? 442? ? ? LOOKUP 0.31? ? ? 49.22 us? ? ? 27.00 us? ? 540.00 us? ? ? ? ? 45620? ? FXATTROP 0.77? ? 124.24 us? ? ? 87.00 us? ? 604.00 us? ? ? ? ? 44968? ? ? WRITE 0.93? 15767.71 us? ? ? 15.00 us? 305833.00 us? ? ? ? ? ? 431? ? ENTRYLK 1.99? 160711.39 us? ? 3332.00 us? 406037.00 us? ? ? ? ? ? 90? ? ? UNLINK 96.00? ? 5167.82 us? ? ? 18.00 us? 55972.00 us? ? ? ? 135349? ? FINODELK Duration: 380 seconds Data Read: 1048576 bytes Data Written: 5894045696 bytes Interval 0 Stats: Block Size:? ? ? ? ? ? 131072b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 8 No. of Writes:? ? ? ? ? ? ? ? 44968 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 3? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 35? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 28? RELEASEDIR 0.00? ? ? 78.00 us? ? ? 78.00 us? ? ? 78.00 us? ? ? ? ? ? ? 1? ? ? FSTAT 0.00? ? ? 35.67 us? ? ? 26.00 us? ? ? 73.00 us? ? ? ? ? ? ? 6? ? ? FLUSH 0.00? ? 324.00 us? ? 324.00 us? ? 324.00 us? ? ? ? ? ? ? 1? ? XATTROP 0.00? ? ? 45.80 us? ? ? 38.00 us? ? ? 54.00 us? ? ? ? ? ? 10? ? ? ? STAT 0.00? ? 227.67 us? ? 216.00 us? ? 242.00 us? ? ? ? ? ? ? 3? ? ? CREATE 0.00? ? 113.38 us? ? ? 68.00 us? ? 381.00 us? ? ? ? ? ? ? 8? ? ? ? READ 0.00? ? ? 39.82 us? ? ? 1.00 us? ? 148.00 us? ? ? ? ? ? 28? ? OPENDIR 0.00? ? ? 67.54 us? ? ? 10.00 us? ? 283.00 us? ? ? ? ? ? 24? ? GETXATTR 0.00? ? ? 59.97 us? ? ? 45.00 us? ? 113.00 us? ? ? ? ? ? 32? ? ? ? OPEN 0.00? ? ? 24.41 us? ? ? 13.00 us? ? ? 89.00 us? ? ? ? ? ? 161? ? INODELK 0.00? ? ? 43.43 us? ? ? 28.00 us? ? 214.00 us? ? ? ? ? ? 93? ? ? STATFS 0.00? ? 246.35 us? ? ? 11.00 us? ? 1155.00 us? ? ? ? ? ? 20? ? READDIR 0.00? ? 283.00 us? ? 233.00 us? ? 353.00 us? ? ? ? ? ? 18? ? READDIRP 0.00? ? 153.23 us? ? 122.00 us? ? 259.00 us? ? ? ? ? ? 87? ? ? MKNOD 0.01? ? ? 99.77 us? ? ? 10.00 us? ? 258.00 us? ? ? ? ? ? 442? ? ? LOOKUP 0.31? ? ? 49.22 us? ? ? 27.00 us? ? 540.00 us? ? ? ? ? 45620? ? FXATTROP 0.77? ? 124.24 us? ? ? 87.00 us? ? 604.00 us? ? ? ? ? 44968? ? ? WRITE 0.93? 15767.71 us? ? ? 15.00 us? 305833.00 us? ? ? ? ? ? 431? ? ENTRYLK 1.99? 160711.39 us? ? 3332.00 us? 406037.00 us? ? ? ? ? ? 90? ? ? UNLINK 96.00? ? 5167.82 us? ? ? 18.00 us? 55972.00 us? ? ? ? 135349? ? FINODELK Duration: 380 seconds Data Read: 1048576 bytes Data Written: 5894045696 bytes Brick: ovirt3.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size:? ? ? ? ? ? ? ? ? 1b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 0 No. of Writes:? ? ? ? ? ? ? ? 39328 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 2? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 12? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 17? RELEASEDIR 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 1? ? ? FSTAT 0.00? ? ? 51.50 us? ? ? 20.00 us? ? ? 81.00 us? ? ? ? ? ? ? 4? ? ? FLUSH 0.01? ? 219.50 us? ? 188.00 us? ? 251.00 us? ? ? ? ? ? ? 2? ? ? CREATE 0.01? ? ? 43.45 us? ? ? 11.00 us? ? ? 90.00 us? ? ? ? ? ? 11? ? GETXATTR 0.01? ? ? 62.30 us? ? ? 38.00 us? ? 119.00 us? ? ? ? ? ? 10? ? ? ? OPEN 0.01? ? ? 50.59 us? ? ? 1.00 us? ? 102.00 us? ? ? ? ? ? 17? ? OPENDIR 0.01? ? ? 24.60 us? ? ? 12.00 us? ? ? 64.00 us? ? ? ? ? ? 40? ? INODELK 0.02? ? 176.30 us? ? ? 10.00 us? ? 765.00 us? ? ? ? ? ? 10? ? READDIR 0.07? ? ? 63.08 us? ? ? 39.00 us? ? 133.00 us? ? ? ? ? ? 78? ? ? UNLINK 0.13? ? ? 27.35 us? ? ? 10.00 us? ? ? 91.00 us? ? ? ? ? ? 333? ? ENTRYLK 0.13? ? 126.89 us? ? ? 99.00 us? ? 179.00 us? ? ? ? ? ? 76? ? ? MKNOD 0.42? ? 116.70 us? ? ? 8.00 us? ? 8661.00 us? ? ? ? ? ? 261? ? ? LOOKUP 28.73? ? ? 51.79 us? ? ? 22.00 us? ? 2574.00 us? ? ? ? ? 39822? ? FXATTROP 29.52? ? ? 53.87 us? ? ? 16.00 us? ? 3290.00 us? ? ? ? ? 39328? ? ? WRITE 40.92? ? ? 24.71 us? ? ? 10.00 us? ? 3224.00 us? ? ? ? 118864? ? FINODELK Duration: 189 seconds Data Read: 0 bytes Data Written: 39328 bytes Interval 0 Stats: Block Size:? ? ? ? ? ? ? ? ? 1b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 0 No. of Writes:? ? ? ? ? ? ? ? 39328 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 2? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 12? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 17? RELEASEDIR 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 1? ? ? FSTAT 0.00? ? ? 51.50 us? ? ? 20.00 us? ? ? 81.00 us? ? ? ? ? ? ? 4? ? ? FLUSH 0.01? ? 219.50 us? ? 188.00 us? ? 251.00 us? ? ? ? ? ? ? 2? ? ? CREATE 0.01? ? ? 43.45 us? ? ? 11.00 us? ? ? 90.00 us? ? ? ? ? ? 11? ? GETXATTR 0.01? ? ? 62.30 us? ? ? 38.00 us? ? 119.00 us? ? ? ? ? ? 10? ? ? ? OPEN 0.01? ? ? 50.59 us? ? ? 1.00 us? ? 102.00 us? ? ? ? ? ? 17? ? OPENDIR 0.01? ? ? 24.60 us? ? ? 12.00 us? ? ? 64.00 us? ? ? ? ? ? 40? ? INODELK 0.02? ? 176.30 us? ? ? 10.00 us? ? 765.00 us? ? ? ? ? ? 10? ? READDIR 0.07? ? ? 63.08 us? ? ? 39.00 us? ? 133.00 us? ? ? ? ? ? 78? ? ? UNLINK 0.13? ? ? 27.35 us? ? ? 10.00 us? ? ? 91.00 us? ? ? ? ? ? 333? ? ENTRYLK 0.13? ? 126.89 us? ? ? 99.00 us? ? 179.00 us? ? ? ? ? ? 76? ? ? MKNOD 0.42? ? 116.70 us? ? ? 8.00 us? ? 8661.00 us? ? ? ? ? ? 261? ? ? LOOKUP 28.73? ? ? 51.79 us? ? ? 22.00 us? ? 2574.00 us? ? ? ? ? 39822? ? FXATTROP 29.52? ? ? 53.87 us? ? ? 16.00 us? ? 3290.00 us? ? ? ? ? 39328? ? ? WRITE 40.92? ? ? 24.71 us? ? ? 10.00 us? ? 3224.00 us? ? ? ? 118864? ? FINODELK Duration: 189 seconds Data Read: 0 bytes Data Written: 39328 bytes Brick: ovirt2.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size:? ? ? ? ? ? ? ? 512b+? ? ? ? ? ? ? 131072b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 0? ? ? ? ? ? ? ? ? ? 0 No. of Writes:? ? ? ? ? ? ? ? ? 36? ? ? ? ? ? ? ? 76758 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 6? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 87? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 96? RELEASEDIR 0.00? ? 100.50 us? ? ? 80.00 us? ? 121.00 us? ? ? ? ? ? ? 2 REMOVEXATTR 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 2? ? SETXATTR 0.00? ? ? 36.18 us? ? ? 22.00 us? ? ? 62.00 us? ? ? ? ? ? 11? ? ? FLUSH 0.00? ? ? 57.44 us? ? ? 42.00 us? ? ? 77.00 us? ? ? ? ? ? ? 9? FTRUNCATE 0.00? ? ? 82.56 us? ? ? 59.00 us? ? 138.00 us? ? ? ? ? ? ? 9? ? ? FSTAT 0.00? ? ? 89.42 us? ? ? 67.00 us? ? 161.00 us? ? ? ? ? ? 12? ? SETATTR 0.00? ? 272.40 us? ? 235.00 us? ? 296.00 us? ? ? ? ? ? ? 5? ? ? CREATE 0.01? ? 154.28 us? ? ? 88.00 us? ? 320.00 us? ? ? ? ? ? 18? ? XATTROP 0.01? ? ? 45.29 us? ? ? 1.00 us? ? 319.00 us? ? ? ? ? ? 96? ? OPENDIR 0.01? ? ? 86.69 us? ? ? 30.00 us? ? 379.00 us? ? ? ? ? ? 62? ? ? ? STAT 0.01? ? ? 64.30 us? ? ? 47.00 us? ? 169.00 us? ? ? ? ? ? 84? ? ? ? OPEN 0.02? ? 107.34 us? ? ? 23.00 us? ? 273.00 us? ? ? ? ? ? 73? ? READDIRP 0.02? ? 4688.00 us? ? ? 86.00 us? ? 9290.00 us? ? ? ? ? ? ? 2? ? TRUNCATE 0.02? ? ? 59.29 us? ? ? 13.00 us? ? 394.00 us? ? ? ? ? ? 165? ? GETXATTR 0.03? ? 128.51 us? ? ? 27.00 us? ? 338.00 us? ? ? ? ? ? 96? ? ? FSYNC 0.03? ? 240.75 us? ? ? 14.00 us? ? 1943.00 us? ? ? ? ? ? 52? ? READDIR 0.04? ? ? 65.59 us? ? ? 26.00 us? ? 293.00 us? ? ? ? ? ? 279? ? ? STATFS 0.06? ? 180.77 us? ? 118.00 us? ? 306.00 us? ? ? ? ? ? 148? ? ? MKNOD 0.14? ? ? 37.98 us? ? ? 17.00 us? ? 192.00 us? ? ? ? ? 1598? ? INODELK 0.67? ? ? 91.68 us? ? ? 12.00 us? ? 1141.00 us? ? ? ? ? 3186? ? ? LOOKUP 10.10? ? ? 55.92 us? ? ? 28.00 us? ? 1658.00 us? ? ? ? ? 78608? ? FXATTROP 11.89? ? 6814.76 us? ? ? 18.00 us? 301246.00 us? ? ? ? ? ? 760? ? ENTRYLK 19.44? ? ? 36.55 us? ? ? 14.00 us? ? 2353.00 us? ? ? ? 231535? ? FINODELK 25.21? ? 142.92 us? ? ? 62.00 us? ? 593.00 us? ? ? ? ? 76794? ? ? WRITE 32.28? 91283.68 us? ? ? 28.00 us? 316658.00 us? ? ? ? ? ? 154? ? ? UNLINK Duration: 1206 seconds Data Read: 0 bytes Data Written: 10060843008 bytes Interval 0 Stats: Block Size:? ? ? ? ? ? ? ? 512b+? ? ? ? ? ? ? 131072b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 0? ? ? ? ? ? ? ? ? ? 0 No. of Writes:? ? ? ? ? ? ? ? ? 36? ? ? ? ? ? ? ? 76758 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 6? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 87? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 96? RELEASEDIR 0.00? ? 100.50 us? ? ? 80.00 us? ? 121.00 us? ? ? ? ? ? ? 2 REMOVEXATTR 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 2? ? SETXATTR 0.00? ? ? 36.18 us? ? ? 22.00 us? ? ? 62.00 us? ? ? ? ? ? 11? ? ? FLUSH 0.00? ? ? 57.44 us? ? ? 42.00 us? ? ? 77.00 us? ? ? ? ? ? ? 9? FTRUNCATE 0.00? ? ? 82.56 us? ? ? 59.00 us? ? 138.00 us? ? ? ? ? ? ? 9? ? ? FSTAT 0.00? ? ? 89.42 us? ? ? 67.00 us? ? 161.00 us? ? ? ? ? ? 12? ? SETATTR 0.00? ? 272.40 us? ? 235.00 us? ? 296.00 us? ? ? ? ? ? ? 5? ? ? CREATE 0.01? ? 154.28 us? ? ? 88.00 us? ? 320.00 us? ? ? ? ? ? 18? ? XATTROP 0.01? ? ? 45.29 us? ? ? 1.00 us? ? 319.00 us? ? ? ? ? ? 96? ? OPENDIR 0.01? ? ? 86.69 us? ? ? 30.00 us? ? 379.00 us? ? ? ? ? ? 62? ? ? ? STAT 0.01? ? ? 64.30 us? ? ? 47.00 us? ? 169.00 us? ? ? ? ? ? 84? ? ? ? OPEN 0.02? ? 107.34 us? ? ? 23.00 us? ? 273.00 us? ? ? ? ? ? 73? ? READDIRP 0.02? ? 4688.00 us? ? ? 86.00 us? ? 9290.00 us? ? ? ? ? ? ? 2? ? TRUNCATE 0.02? ? ? 59.29 us? ? ? 13.00 us? ? 394.00 us? ? ? ? ? ? 165? ? GETXATTR 0.03? ? 128.51 us? ? ? 27.00 us? ? 338.00 us? ? ? ? ? ? 96? ? ? FSYNC 0.03? ? 240.75 us? ? ? 14.00 us? ? 1943.00 us? ? ? ? ? ? 52? ? READDIR 0.04? ? ? 65.59 us? ? ? 26.00 us? ? 293.00 us? ? ? ? ? ? 279? ? ? STATFS 0.06? ? 180.77 us? ? 118.00 us? ? 306.00 us? ? ? ? ? ? 148? ? ? MKNOD 0.14? ? ? 37.98 us? ? ? 17.00 us? ? 192.00 us? ? ? ? ? 1598? ? INODELK 0.67? ? ? 91.66 us? ? ? 12.00 us? ? 1141.00 us? ? ? ? ? 3186? ? ? LOOKUP 10.10? ? ? 55.92 us? ? ? 28.00 us? ? 1658.00 us? ? ? ? ? 78608? ? FXATTROP 11.89? ? 6814.76 us? ? ? 18.00 us? 301246.00 us? ? ? ? ? ? 760? ? ENTRYLK 19.44? ? ? 36.55 us? ? ? 14.00 us? ? 2353.00 us? ? ? ? 231535? ? FINODELK 25.21? ? 142.92 us? ? ? 62.00 us? ? 593.00 us? ? ? ? ? 76794? ? ? WRITE 32.28? 91283.68 us? ? ? 28.00 us? 316658.00 us? ? ? ? ? ? 154? ? ? UNLINK Duration: 1206 seconds Data Read: 0 bytes Data Written: 10060843008 bytes This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. Most probably I haven't created the volume properly or some option/feature is disabled ?!? Network shows OK for a gigabit: [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C 7180980+0 records in 7180979+0 records out 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s I'm looking for any help... you can share your volume info also. Thanks in advance. Best Regards, Strahil Nikolov _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) Dear Amar, Thanks for your email. Actually my concerns were on both topics. Would you recommend any perf options that will be suitable ? After mentioning the network usage, I just checked it and it seems duringthe test session, ovirt1 (both client and host) is using no more than 455Mbit/s which is half the network bandwidth. I'm still in the middle of nowhere, so any ideas are welcome. Best Regards, Strahil Nikolov On Jan 23, 2019 17:49, Amar Tumballi Suryanarayan wrote: > > I didn't understand the issue properly. Mostly I missed something. > > Are you concerned the performance is 49MB/s with and without perf options? or are you expecting it to be 123MB/s as over the n/w you get that speed? > > If it is the first problem, then you are actually having 'performance.write-behind on' in both options, and it is the only perf xlator which comes into action during the test you ran. > > If it is the second, then please be informed that gluster does client side replication, which means, n/w would be split in half for write operations (like write(), creat() etc), so the number you are getting is almost the maximum with 1GbE. > > Regards, > Amar > > On Wed, Jan 23, 2019 at 8:38 PM Strahil Nikolov wrote: >> >> Hello Community, >> >> recently I have built a new lab based on oVirt and CentOS 7. >> During deployment I had some hicups, but now the engine is up and running - but gluster is causing me trouble. >> >> Symptoms: Slow VM install from DVD, poor write performance. The latter has been tested via: >> dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_data bs=1M count=1000 status=progress >> >> The reported speed is 60MB/s which is way too low for my setup. >> >> My lab design: >> https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing >> Gluster version is 3.12.15 >> >> So far I have done: >> >> 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure on' in glusterd.vol) >> Volume info after that change: >> >> Volume Name: data >> Type: Replicate >> Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 1 x (2 + 1) = 3 >> Transport-type: tcp >> Bricks: >> Brick1: ovirt1.localdomain:/gluster_bricks/data/data >> Brick2: ovirt2.localdomain:/gluster_bricks/data/data >> Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) >> Options Reconfigured: >> performance.client-io-threads: off >> nfs.disable: on >> transport.address-family: inet >> performance.quick-read: off >> performance.read-ahead: off >> performance.io-cache: off >> performance.low-prio-threads: 32 >> network.remote-dio: off >> cluster.eager-lock: enable >> cluster.quorum-type: auto >> cluster.server-quorum-type: server >> cluster.data-self-heal-algorithm: full >> cluster.locking-scheme: granular >> cluster.shd-max-threads: 8 >> cluster.shd-wait-qlength: 10000 >> features.shard: on >> user.cifs: off >> storage.owner-uid: 36 >> storage.owner-gid: 36 >> network.ping-timeout: 30 >> performance.strict-o-direct: on >> cluster.granular-entry-heal: enable >> server.allow-insecure: on >> >> Seems no positive or negative effect so far. >> >> 2. Tested with tmpfs? on all bricks -> ovirt1 mounted gluster volume ->? max 60MB/s (bs=1M without 'oflag=direct') >> >> >> [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M? count=4000 status=progress >> 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s >> 4000+0 records in >> 4000+0 records out >> 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s >> [root at ovirt1 data]# rm -f large_io >> [root at ovirt1 data]# gluster volume profile data info >> Brick: ovirt1.localdomain:/gluster_bricks/data/data >> --------------------------------------------------- >> Cumulative Stats: >> Block Size:? ? ? ? ? ? 131072b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 8 >> No. of Writes:? ? ? ? ? ? ? ? 44968 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 3? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 35? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 28? RELEASEDIR >> 0.00? ? ? 78.00 us? ? ? 78.00 us? ? ? 78.00 us? ? ? ? ? ? ? 1? ? ? FSTAT >> 0.00? ? ? 35.67 us? ? ? 26.00 us? ? ? 73.00 us? ? ? ? ? ? ? 6? ? ? FLUSH >> 0.00? ? 324.00 us? ? 324.00 us? ? 324.00 us? ? ? ? ? ? ? 1? ? XATTROP >> 0.00? ? ? 45.80 us? ? ? 38.00 us? ? ? 54.00 us? ? ? ? ? ? 10? ? ? ? STAT >> 0.00? ? 227.67 us? ? 216.00 us? ? 242.00 us? ? ? ? ? ? ? 3? ? ? CREATE >> 0.00? ? 113.38 us? ? ? 68.00 us? ? 381.00 us? ? ? ? ? ? ? 8? ? ? ? READ >> 0.00? ? ? 39.82 us? ? ? 1.00 us? ? 148.00 us? ? ? ? ? ? 28? ? OPENDIR >> 0.00? ? ? 67.54 us? ? ? 10.00 us? ? 283.00 us? ? ? ? ? ? 24? ? GETXATTR >> 0.00? ? ? 59.97 us? ? ? 45.00 us? ? 113.00 us? ? ? ? ? ? 32? ? ? ? OPEN >> 0.00? ? ? 24.41 us? ? ? 13.00 us? ? ? 89.00 us? ? ? ? ? ? 161? ? INODELK >> 0.00? ? ? 43.43 us? ? ? 28.00 us? ? 214.00 us? ? ? ? ? ? 93? ? ? STATFS >> 0.00? ? 246.35 us? ? ? 11.00 us? ? 1155.00 us? ? ? ? ? ? 20? ? READDIR >> 0.00? ? 283.00 us? ? 233.00 us? ? 353.00 us? ? ? ? ? ? 18? ? READDIRP >> 0.00? ? 153.23 us? ? 122.00 us? ? 259.00 us? ? ? ? ? ? 87? ? ? MKNOD >> 0.01? ? ? 99.77 us? ? ? 10.00 us? ? 258.00 us? ? ? ? ? ? 442? ? ? LOOKUP >> 0.31? ? ? 49.22 us? ? ? 27.00 us? ? 540.00 us? ? ? ? ? 45620? ? FXATTROP >> 0.77? ? 124.24 us? ? ? 87.00 us? ? 604.00 us? ? ? ? ? 44968? ? ? WRITE >> 0.93? 15767.71 us? ? ? 15.00 us? 305833.00 us? ? ? ? ? ? 431? ? ENTRYLK >> 1.99? 160711.39 us? ? 3332.00 us? 406037.00 us? ? ? ? ? ? 90? ? ? UNLINK >> 96.00? ? 5167.82 us? ? ? 18.00 us? 55972.00 us? ? ? ? 135349? ? FINODELK >> >> Duration: 380 seconds >> Data Read: 1048576 bytes >> Data Written: 5894045696 bytes >> >> Interval 0 Stats: >> Block Size:? ? ? ? ? ? 131072b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 8 >> No. of Writes:? ? ? ? ? ? ? ? 44968 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 3? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 35? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 28? RELEASEDIR >> 0.00? ? ? 78.00 us? ? ? 78.00 us? ? ? 78.00 us? ? ? ? ? ? ? 1? ? ? FSTAT >> 0.00? ? ? 35.67 us? ? ? 26.00 us? ? ? 73.00 us? ? ? ? ? ? ? 6? ? ? FLUSH >> 0.00? ? 324.00 us? ? 324.00 us? ? 324.00 us? ? ? ? ? ? ? 1? ? XATTROP >> 0.00? ? ? 45.80 us? ? ? 38.00 us? ? ? 54.00 us? ? ? ? ? ? 10? ? ? ? STAT >> 0.00? ? 227.67 us? ? 216.00 us? ? 242.00 us? ? ? ? ? ? ? 3? ? ? CREATE >> 0.00? ? 113.38 us? ? ? 68.00 us? ? 381.00 us? ? ? ? ? ? ? 8? ? ? ? READ >> 0.00? ? ? 39.82 us? ? ? 1.00 us? ? 148.00 us? ? ? ? ? ? 28? ? OPENDIR >> 0.00? ? ? 67.54 us? ? ? 10.00 us? ? 283.00 us? ? ? ? ? ? 24? ? GETXATTR >> 0.00? ? ? 59.97 us? ? ? 45.00 us? ? 113.00 us? ? ? ? ? ? 32? ? ? ? OPEN >> 0.00? ? ? 24.41 us? ? ? 13.00 us? ? ? 89.00 us? ? ? ? ? ? 161? ? INODELK >> 0.00? ? ? 43.43 us? ? ? 28.00 us? ? 214.00 us? ? ? ? ? ? 93? ? ? STATFS >> 0.00? ? 246.35 us? ? ? 11.00 us? ? 1155.00 us? ? ? ? ? ? 20? ? READDIR >> 0.00? ? 283.00 us? ? 233.00 us? ? 353.00 us? ? ? ? ? ? 18? ? READDIRP >> 0.00? ? 153.23 us? ? 122.00 us? ? 259.00 us? ? ? ? ? ? 87? ? ? MKNOD >> 0.01? ? ? 99.77 us? ? ? 10.00 us? ? 258.00 us? ? ? ? ? ? 442? ? ? LOOKUP >> 0.31? ? ? 49.22 us? ? ? 27.00 us? ? 540.00 us? ? ? ? ? 45620? ? FXATTROP >> 0.77? ? 124.24 us? ? ? 87.00 us? ? 604.00 us? ? ? ? ? 44968? ? ? WRITE >> 0.93? 15767.71 us? ? ? 15.00 us? 305833.00 us? ? ? ? ? ? 431? ? ENTRYLK >> 1.99? 160711.39 us? ? 3332.00 us? 406037.00 us? ? ? ? ? ? 90? ? ? UNLINK >> 96.00? ? 5167.82 us? ? ? 18.00 us? 55972.00 us? ? ? ? 135349? ? FINODELK >> >> Duration: 380 seconds >> Data Read: 1048576 bytes >> Data Written: 5894045696 bytes >> >> Brick: ovirt3.localdomain:/gluster_bricks/data/data >> --------------------------------------------------- >> Cumulative Stats: >> Block Size:? ? ? ? ? ? ? ? ? 1b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 0 >> No. of Writes:? ? ? ? ? ? ? ? 39328 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 2? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 12? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 17? RELEASEDIR >> 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 1? ? ? FSTAT >> 0.00? ? ? 51.50 us? ? ? 20.00 us? ? ? 81.00 us? ? ? ? ? ? ? 4? ? ? FLUSH >> 0.01? ? 219.50 us? ? 188.00 us? ? 251.00 us? ? ? ? ? ? ? 2? ? ? CREATE >> 0.01? ? ? 43.45 us? ? ? 11.00 us? ? ? 90.00 us? ? ? ? ? ? 11? ? GETXATTR >> 0.01? ? ? 62.30 us? ? ? 38.00 us? ? 119.00 us? ? ? ? ? ? 10? ? ? ? OPEN >> 0.01? ? ? 50.59 us? ? ? 1.00 us? ? 102.00 us? ? ? ? ? ? 17? ? OPENDIR >> 0.01? ? ? 24.60 us? ? ? 12.00 us? ? ? 64.00 us? ? ? ? ? ? 40? ? INODELK >> 0.02? ? 176.30 us? ? ? 10.00 us? ? 765.00 us? ? ? ? ? ? 10? ? READDIR >> 0.07? ? ? 63.08 us? ? ? 39.00 us? ? 133.00 us? ? ? ? ? ? 78? ? ? UNLINK >> 0.13? ? ? 27.35 us? ? ? 10.00 us? ? ? 91.00 us? ? ? ? ? ? 333? ? ENTRYLK >> 0.13? ? 126.89 us? ? ? 99.00 us? ? 179.00 us? ? ? ? ? ? 76? ? ? MKNOD >> 0.42? ? 116.70 us? ? ? 8.00 us? ? 8661.00 us? ? ? ? ? ? 261? ? ? LOOKUP >> 28.73? ? ? 51.79 us? ? ? 22.00 us? ? 2574.00 us? ? ? ? ? 39822? ? FXATTROP >> 29.52? ? ? 53.87 us? ? ? 16.00 us? ? 3290.00 us? ? ? ? ? 39328? ? ? WRITE >> 40.92? ? ? 24.71 us? ? ? 10.00 us? ? 3224.00 us? ? ? ? 118864? ? FINODELK >> >> Duration: 189 seconds >> Data Read: 0 bytes >> Data Written: 39328 bytes >> >> Interval 0 Stats: >> Block Size:? ? ? ? ? ? ? ? ? 1b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 0 >> No. of Writes:? ? ? ? ? ? ? ? 39328 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 2? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 12? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 17? RELEASEDIR >> 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 1? ? ? FSTAT >> 0.00? ? ? 51.50 us? ? ? 20.00 us? ? ? 81.00 us? ? ? ? ? ? ? 4? ? ? FLUSH >> 0.01? ? 219.50 us? ? 188.00 us? ? 251.00 us? ? ? ? ? ? ? 2? ? ? CREATE >> 0.01? ? ? 43.45 us? ? ? 11.00 us? ? ? 90.00 us? ? ? ? ? ? 11? ? GETXATTR >> 0.01? ? ? 62.30 us? ? ? 38.00 us? ? 119.00 us? ? ? ? ? ? 10? ? ? ? OPEN >> 0.01? ? ? 50.59 us? ? ? 1.00 us? ? 102.00 us? ? ? ? ? ? 17? ? OPENDIR >> 0.01? ? ? 24.60 us? ? ? 12.00 us? ? ? 64.00 us? ? ? ? ? ? 40? ? INODELK >> 0.02? ? 176.30 us? ? ? 10.00 us? ? 765.00 us? ? ? ? ? ? 10? ? READDIR >> 0.07? ? ? 63.08 us? ? ? 39.00 us? ? 133.00 us? ? ? ? ? ? 78? ? ? UNLINK >> 0.13? ? ? 27.35 us? ? ? 10.00 us? ? ? 91.00 us? ? ? ? ? ? 333? ? ENTRYLK >> 0.13? ? 126.89 us? ? ? 99.00 us? ? 179.00 us? ? ? ? ? ? 76? ? ? MKNOD >> 0.42? ? 116.70 us? ? ? 8.00 us? ? 8661.00 us? ? ? ? ? ? 261? ? ? LOOKUP >> 28.73? ? ? 51.79 us? ? ? 22.00 us? ? 2574.00 us? ? ? ? ? 39822? ? FXATTROP >> 29.52? ? ? 53.87 us? ? ? 16.00 us? ? 3290.00 us? ? ? ? ? 39328? ? ? WRITE >> 40.92? ? ? 24.71 us? ? ? 10.00 us? ? 3224.00 us? ? ? ? 118864? ? FINODELK >> >> Duration: 189 seconds >> Data Read: 0 bytes >> Data Written: 39328 bytes >> >> Brick: ovirt2.localdomain:/gluster_bricks/data/data >> --------------------------------------------------- >> Cumulative Stats: >> Block Size:? ? ? ? ? ? ? ? 512b+? ? ? ? ? ? ? 131072b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 0? ? ? ? ? ? ? ? ? ? 0 >> No. of Writes:? ? ? ? ? ? ? ? ? 36? ? ? ? ? ? ? ? 76758 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 6? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 87? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 96? RELEASEDIR >> 0.00? ? 100.50 us? ? ? 80.00 us? ? 121.00 us? ? ? ? ? ? ? 2 REMOVEXATTR >> 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 2? ? SETXATTR >> 0.00? ? ? 36.18 us? ? ? 22.00 us? ? ? 62.00 us? ? ? ? ? ? 11? ? ? FLUSH >> 0.00? ? ? 57.44 us? ? ? 42.00 us? ? ? 77.00 us? ? ? ? ? ? ? 9? FTRUNCATE >> 0.00? ? ? 82.56 us? ? ? 59.00 us? ? 138.00 us? ? ? ? ? ? ? 9? ? ? FSTAT >> 0.00? ? ? 89.42 us? ? ? 67.00 us? ? 161.00 us? ? ? ? ? ? 12? ? SETATTR >> 0.00? ? 272.40 us? ? 235.00 us? ? 296.00 us? ? ? ? ? ? ? 5? ? ? CREATE >> 0.01? ? 154.28 us? ? ? 88.00 us? ? 320.00 us? ? ? ? ? ? 18? ? XATTROP >> 0.01? ? ? 45.29 us? ? ? 1.00 us? ? 319.00 us? ? ? ? ? ? 96? ? OPENDIR >> 0.01? ? ? 86.69 us? ? ? 30.00 us? ? 379.00 us? ? ? ? ? ? 62? ? ? ? STAT >> 0.01? ? ? 64.30 us? ? ? 47.00 us? ? 169.00 us? ? ? ? ? ? 84? ? ? ? OPEN >> 0.02? ? 107.34 us? ? ? 23.00 us? ? 273.00 us? ? ? ? ? ? 73? ? READDIRP >> 0.02? ? 4688.00 us? ? ? 86.00 us? ? 9290.00 us? ? ? ? ? ? ? 2? ? TRUNCATE >> 0.02? ? ? 59.29 us? ? ? 13.00 us? ? 394.00 us? ? ? ? ? ? 165? ? GETXATTR >> 0.03? ? 128.51 us? ? ? 27.00 us? ? 338.00 us? ? ? ? ? ? 96? ? ? FSYNC >> 0.03? ? 240.75 us? ? ? 14.00 us? ? 1943.00 us? ? ? ? ? ? 52? ? READDIR >> 0.04? ? ? 65.59 us? ? ? 26.00 us? ? 293.00 us? ? ? ? ? ? 279? ? ? STATFS >> 0.06? ? 180.77 us? ? 118.00 us? ? 306.00 us? ? ? ? ? ? 148? ? ? MKNOD >> 0.14? ? ? 37.98 us? ? ? 17.00 us? ? 192.00 us? ? ? ? ? 1598? ? INODELK >> 0.67? ? ? 91.68 us? ? ? 12.00 us? ? 1141.00 us? ? ? ? ? 3186? ? ? LOOKUP >> 10.10? ? ? 55.92 us? ? ? 28.00 us? ? 1658.00 us? ? ? ? ? 78608? ? FXATTROP >> 11.89? ? 6814.76 us? ? ? 18.00 us? 301246.00 us? ? ? ? ? ? 760? ? ENTRYLK >> 19.44? ? ? 36.55 us? ? ? 14.00 us? ? 2353.00 us? ? ? ? 231535? ? FINODELK >> 25.21? ? 142.92 us? ? ? 62.00 us? ? 593.00 us? ? ? ? ? 76794? ? ? WRITE >> 32.28? 91283.68 us? ? ? 28.00 us? 316658.00 us? ? ? ? ? ? 154? ? ? UNLINK >> >> Duration: 1206 seconds >> Data Read: 0 bytes >> Data Written: 10060843008 bytes >> >> Interval 0 Stats: >> Block Size:? ? ? ? ? ? ? ? 512b+? ? ? ? ? ? ? 131072b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 0? ? ? ? ? ? ? ? ? ? 0 >> No. of Writes:? ? ? ? ? ? ? ? ? 36? ? ? ? ? ? ? ? 76758 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 6? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 87? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 96? RELEASEDIR >> 0.00? ? 100.50 us? ? ? 80.00 us? ? 121.00 us? ? ? ? ? ? ? 2 REMOVEXATTR >> 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 2? ? SETXATTR >> 0.00? ? ? 36.18 us? ? ? 22.00 us? ? ? 62.00 us? ? ? ? ? ? 11? ? ? FLUSH >> 0.00? ? ? 57.44 us? ? ? 42.00 us? ? ? 77.00 us? ? ? ? ? ? ? 9? FTRUNCATE >> 0.00? ? ? 82.56 us? ? ? 59.00 us? ? 138.00 us? ? ? ? ? ? ? 9? ? ? FSTAT >> 0.00? ? ? 89.42 us? ? ? 67.00 us? ? 161.00 us? ? ? ? ? ? 12? ? SETATTR >> 0.00? ? 272.40 us? ? 235.00 us? ? 296.00 us? ? ? ? ? ? ? 5? ? ? CREATE >> 0.01? ? 154.28 us? ? ? 88.00 us? ? 320.00 us? ? ? ? ? ? 18? ? XATTROP >> 0.01? ? ? 45.29 us? ? ? 1.00 us? ? 319.00 us? ? ? ? ? ? 96? ? OPENDIR >> 0.01? ? ? 86.69 us? ? ? 30.00 us? ? 379.00 us? ? ? ? ? ? 62? ? ? ? STAT >> 0.01? ? ? 64.30 us? ? ? 47.00 us? ? 169.00 us? ? ? ? ? ? 84? ? ? ? OPEN >> 0.02? ? 107.34 us? ? ? 23.00 us? ? 273.00 us? ? ? ? ? ? 73? ? READDIRP >> 0.02? ? 4688.00 us? ? ? 86.00 us? ? 9290.00 us? ? ? ? ? ? ? 2? ? TRUNCATE >> 0.02? ? ? 59.29 us? ? ? 13.00 us? ? 394.00 us? ? ? ? ? ? 165? ? GETXATTR >> 0.03? ? 128.51 us? ? ? 27.00 us? ? 338.00 us? ? ? ? ? ? 96? ? ? FSYNC >> 0.03? ? 240.75 us? ? ? 14.00 us? ? 1943.00 us? ? ? ? ? ? 52? ? READDIR >> 0.04? ? ? 65.59 us? ? ? 26.00 us? ? 293.00 us? ? ? ? ? ? 279? ? ? STATFS >> 0.06? ? 180.77 us? ? 118.00 us? ? 306.00 us? ? ? ? ? ? 148? ? ? MKNOD >> 0.14? ? ? 37.98 us? ? ? 17.00 us? ? 192.00 us? ? ? ? ? 1598? ? INODELK >> 0.67? ? ? 91.66 us? ? ? 12.00 us? ? 1141.00 us? ? ? ? ? 3186? ? ? LOOKUP >> 10.10? ? ? 55.92 us? ? ? 28.00 us? ? 1658.00 us? ? ? ? ? 78608? ? FXATTROP >> 11.89? ? 6814.76 us? ? ? 18.00 us? 301246.00 us? ? ? ? ? ? 760? ? ENTRYLK >> 19.44? ? ? 36.55 us? ? ? 14.00 us? ? 2353.00 us? ? ? ? 231535? ? FINODELK >> 25.21? ? 142.92 us? ? ? 62.00 us? ? 593.00 us? ? ? ? ? 76794? ? ? WRITE >> 32.28? 91283.68 us? ? ? 28.00 us? 316658.00 us? ? ? ? ? ? 154? ? ? UNLINK >> >> Duration: 1206 seconds >> Data Read: 0 bytes >> Data Written: 10060843008 bytes >> >> >> >> This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. >> >> Most probably I haven't created the volume properly or some option/feature is disabled ?!? >> Network shows OK for a gigabit: >> [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 >> 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C >> 7180980+0 records in >> 7180979+0 records out >> 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s >> >> >> I'm looking for any help... you can share your volume info also. >> >> Thanks in advance. >> >> Best Regards, >> Strahil Nikolov >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Thu Jan 24 10:41:56 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Thu, 24 Jan 2019 16:11:56 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Mohit, Have we came across this kind of issue? This user using gluster 4.1 version. Did we fix any related bug afterwards? Looks like setup has some issues but I'm not sure. On Thu, Jan 24, 2019 at 4:01 PM Shaik Salam wrote: > > > Hi Sanju, > > Please find requested information (these are latest logs :) ). > > I can see only following error messages related to brick > "brick_e15c12cceae12c8ab7782dd57cf5b6c1" (on secondnode log) > > [2019-01-23 11:50:20.322902] I > [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered > already-running brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > [2019-01-23 11:50:20.322925] I [MSGID: 106142] > [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > on port 49165 >> showing running on port but not > [2019-01-23 11:50:20.327557] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already > stopped > [2019-01-23 11:50:20.327586] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is > stopped > [2019-01-23 11:50:20.327604] I [MSGID: 106599] > [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so > xlator is not installed > [2019-01-23 11:50:20.337735] I [MSGID: 106568] > [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 69525 > [2019-01-23 11:50:21.338058] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd > service is stopped > [2019-01-23 11:50:21.338180] I [MSGID: 106567] > [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-01-23 11:50:21.348234] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already > stopped > [2019-01-23 11:50:21.348285] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is > stopped > [2019-01-23 11:50:21.348866] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already > stopped > [2019-01-23 11:50:21.348883] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is > stopped > [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 109550 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 52557 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 16946 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > BR > Salam > > > > From: "Sanju Rakonde" > To: "Shaik Salam" > Cc: "Amar Tumballi Suryanarayan" , " > gluster-users at gluster.org List" , "Murali > Kottakota" > Date: 01/24/2019 02:32 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > *"External email. Open with Caution"* > Shaik, > > Sorry to ask this again. What errors are you seeing in glusterd logs? Can > you share the latest logs? > > On Thu, Jan 24, 2019 at 2:05 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Please find requsted information. > > Are you still seeing the error "Unable to read pidfile:" in glusterd log? > >>>> No > Are you seeing "brick is deemed not to be a part of the volume" error in > glusterd log?>>>> No > > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae1^C8ab7782dd57cf5b6c1/brick > sh-4.2# pwd > > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -d -m . -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > getfattr: Removing leading '/' from absolute path names > # file: > var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > > security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 > trusted.afr.dirty=0x000000000000000000000000 > > trusted.afr.vol_3442e86b6d994a14de73f1b8c82cf0b8-client-0=0x000000000000000000000000 > trusted.gfid=0x00000000000000000000000000000001 > trusted.glusterfs.dht=0x000000010000000000000000ffffffff > trusted.glusterfs.volume-id=0x15477f3622e84757a0ce9000b63fa849 > > sh-4.2# ls -la |wc -l > 86 > sh-4.2# pwd > > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# > > > > From: "Sanju Rakonde" <*srakonde at redhat.com* > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/24/2019 01:38 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Shaik, > > Previously I was suspecting, whether brick pid file is missing. But I see > it is present. > > From second node (this brick is in offline state): > /var/run/gluster/vols/vol_3442e86b6d994a14de73f1b8c82cf0b8/192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid > > 271 > Are you still seeing the error "Unable to read pidfile:" in glusterd log? > > I also suspect whether brick is missing its extended attributes. Are you > seeing "brick is deemed not to be a part of the volume" error in glusterd > log? If not can you please provide us output of "getfattr -m -d -e hex > " > > On Thu, Jan 24, 2019 at 12:18 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Could you please have look my issue if you have time (atleast provide > workaround). > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: "Sanju Rakonde" <*srakonde at redhat.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/23/2019 05:50 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > > Hi Sanju, > > Please find requested information. > > Sorry to repeat again I am trying start force command once brick log > enabled to debug by taking one volume example. > Please correct me If I am doing wrong. > > > [root at master ~]# oc rsh glusterfs-storage-vll7x > sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 > > Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Type: Replicate > Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 192.168.3.6: > /var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick > Brick2: 192.168.3.5: > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ > brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > Brick3: 192.168.3.15: > /var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick > Options Reconfigured: > diagnostics.brick-log-level: INFO > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 108434 > Self-heal Daemon on matrix1.matrix.orange.l > ab N/A N/A Y > 69525 > Self-heal Daemon on matrix2.matrix.orange.l > ab N/A N/A Y > 18569 > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > volume set: success > sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep > log > cluster.entry-change-log on > cluster.data-change-log on > cluster.metadata-change-log on > diagnostics.brick-log-level DEBUG > > sh-4.2# cd /var/log/glusterfs/bricks/ > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > >>> Noting in log > > -rw-------. 1 root root 189057 Jan 18 09:20 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 > > [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o > diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd > [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o > diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd > [2019-01-23 11:48:59.111292] W [MSGID: 106036] > [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: > Snapshot list failed > [2019-01-23 11:50:14.112271] E [MSGID: 106026] > [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: > Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid > argument] > [2019-01-23 11:50:14.112305] W [MSGID: 106036] > [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: > Snapshot list failed > [2019-01-23 11:50:20.322902] I > [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered > already-running brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > [2019-01-23 11:50:20.322925] I [MSGID: 106142] > [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > on port 49165 > [2019-01-23 11:50:20.327557] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already > stopped > [2019-01-23 11:50:20.327586] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is > stopped > [2019-01-23 11:50:20.327604] I [MSGID: 106599] > [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so > xlator is not installed > [2019-01-23 11:50:20.337735] I [MSGID: 106568] > [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 69525 > [2019-01-23 11:50:21.338058] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd > service is stopped > [2019-01-23 11:50:21.338180] I [MSGID: 106567] > [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-01-23 11:50:21.348234] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already > stopped > [2019-01-23 11:50:21.348285] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is > stopped > [2019-01-23 11:50:21.348866] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already > stopped > [2019-01-23 11:50:21.348883] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is > stopped > [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 109550 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 52557 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 16946 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > > > From: "Sanju Rakonde" <*srakonde at redhat.com* > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/23/2019 02:15 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > I can see below errors in glusterd logs. > > [2019-01-22 09:20:17.540196] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid > > [2019-01-22 09:20:17.546408] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid > > [2019-01-22 09:20:17.552575] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid > > [2019-01-22 09:20:17.558888] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid > > [2019-01-22 09:20:17.565266] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid > > [2019-01-22 09:20:17.585926] E [MSGID: 106028] > [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid > of brick process > [2019-01-22 09:20:17.617806] E [MSGID: 106028] > [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid > of brick process > [2019-01-22 09:20:17.649628] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/glustershd/glustershd.pid > [2019-01-22 09:20:17.649700] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/glustershd/glustershd.pid > > So it looks like, neither gf_is_service_running() > nor glusterd_brick_signal() are able to read the pid file. That means > pidfiles might be having nothing to read. > > Can you please paste the contents of brick pidfiles. You can find brick > pidfiles in /var/run/gluster/vols// or you can just run this > command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat > $i;done" > > On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Please find requested information attached logs. > > > > > Below brick is offline and try to start force/heal commands but doesn't > makes up. > > sh-4.2# > sh-4.2# gluster --version > glusterfs 4.1.5 > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > Enabled DEBUG mode for brick level. But nothing writing to brick log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > > > > > > From: Sanju Rakonde <*srakonde at redhat.com* > > To: Shaik Salam <*shaik.salam at tcs.com* > > Cc: Amar Tumballi Suryanarayan <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > > > Date: 01/22/2019 02:21 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you please provide us complete glusterd and cmd_history logs from all > the nodes in the cluster? Also please paste output of the following > commands (from all nodes): > 1. gluster --version > 2. gluster volume info > 3. gluster volume status > 4. gluster peer status > 5. ps -ax | grep glusterfsd > > On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Surya, > > It is already customer setup and cant redeploy again. > Enabled debug for brick level log but nothing writing to it. > Can you tell me is any other ways to troubleshoot or logs to look?? > > > From: Shaik Salam/HYD/TCS > To: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 12:06 PM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > Hi Surya, > > I have enabled DEBUG mode for brick level. But nothing writing to brick > log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > BR > Salam > > > > > From: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 11:38 AM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you check what is there in brick logs? They are located in > /var/log/glusterfs/bricks/*? > > Looks like the samba hooks script failed, but that shouldn't matter in > this use case. > > Also, I see that you are trying to setup heketi to provision volumes, > which means you may be using gluster in container usecases. If you are > still in 'PoC' phase, can you give *https://github.com/gluster/gcs* > a try? That makes the deployment and the > stack little simpler. > > -Amar > > > > > On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Can anyone respond how to recover bricks apart from heal/start force > according to below events from logs. > Please let me know any other logs required. > Thanks in advance. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: *bugs at gluster.org* , > *gluster-users at gluster.org* > Date: 01/21/2019 10:03 PM > Subject: Bricks are going offline unable to recover with > heal/start force commands > ------------------------------ > > > Hi, > > Bricks are in offline and unable to recover with following commands > > gluster volume heal > > gluster volume start force > > But still bricks are offline. > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > > Please let us know steps to recover bricks. > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > _______________________________________________ > Bugs mailing list > *Bugs at gluster.org* > *https://lists.gluster.org/mailman/listinfo/bugs* > > > > -- > Amar Tumballi (amarts) > _______________________________________________ > Gluster-users mailing list > *Gluster-users at gluster.org* > *https://lists.gluster.org/mailman/listinfo/gluster-users* > > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 10:42:34 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 16:12:34 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: heketi-pod-complete.log Type: application/octet-stream Size: 135146 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ps-aux.txt URL: From shaik.salam at tcs.com Thu Jan 24 11:43:36 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 17:13:36 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Mohit, We are facing this issue from last one month could you please atleast provide workaround to move further. Please let me know any logs required. Thanks in advance. BR Salam From: "Sanju Rakonde" To: "Mohit Agrawal" , "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , "gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/24/2019 04:12 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Mohit, Have we came across this kind of issue? This user using gluster 4.1 version. Did we fix any related bug afterwards? Looks like setup has some issues but I'm not sure. On Thu, Jan 24, 2019 at 4:01 PM Shaik Salam wrote: Hi Sanju, Please find requested information (these are latest logs :) ). I can see only following error messages related to brick "brick_e15c12cceae12c8ab7782dd57cf5b6c1" (on secondnode log) [2019-01-23 11:50:20.322902] I [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered already-running brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick [2019-01-23 11:50:20.322925] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/ brick on port 49165 >> showing running on port but not [2019-01-23 11:50:20.327557] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2019-01-23 11:50:20.327586] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped [2019-01-23 11:50:20.327604] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-01-23 11:50:20.337735] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 69525 [2019-01-23 11:50:21.338058] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-01-23 11:50:21.338180] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service [2019-01-23 11:50:21.348234] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2019-01-23 11:50:21.348285] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped [2019-01-23 11:50:21.348866] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2019-01-23 11:50:21.348883] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 109550 Self-heal Daemon on 192.168.3.6 N/A N/A Y 52557 Self-heal Daemon on 192.168.3.15 N/A N/A Y 16946 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ There are no active volume tasks BR Salam From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/24/2019 02:32 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Shaik, Sorry to ask this again. What errors are you seeing in glusterd logs? Can you share the latest logs? On Thu, Jan 24, 2019 at 2:05 PM Shaik Salam wrote: Hi Sanju, Please find requsted information. Are you still seeing the error "Unable to read pidfile:" in glusterd log? >>>> No Are you seeing "brick is deemed not to be a part of the volume" error in glusterd log?>>>> No sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae1^C8ab7782dd57cf5b6c1/brick sh-4.2# pwd /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -m -d -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ sh-4.2# getfattr -d -m . -e hex /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ getfattr: Removing leading '/' from absolute path names # file: var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_3442e86b6d994a14de73f1b8c82cf0b8-client-0=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.volume-id=0x15477f3622e84757a0ce9000b63fa849 sh-4.2# ls -la |wc -l 86 sh-4.2# pwd /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick sh-4.2# From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/24/2019 01:38 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Shaik, Previously I was suspecting, whether brick pid file is missing. But I see it is present. >From second node (this brick is in offline state): /var/run/gluster/vols/vol_3442e86b6d994a14de73f1b8c82cf0b8/192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid 271 Are you still seeing the error "Unable to read pidfile:" in glusterd log? I also suspect whether brick is missing its extended attributes. Are you seeing "brick is deemed not to be a part of the volume" error in glusterd log? If not can you please provide us output of "getfattr -m -d -e hex " On Thu, Jan 24, 2019 at 12:18 PM Shaik Salam wrote: Hi Sanju, Could you please have look my issue if you have time (atleast provide workaround). BR Salam From: Shaik Salam/HYD/TCS To: "Sanju Rakonde" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 05:50 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Sanju, Please find requested information. Sorry to repeat again I am trying start force command once brick log enabled to debug by taking one volume example. Please correct me If I am doing wrong. [root at master ~]# oc rsh glusterfs-storage-vll7x sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 Type: Replicate Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.3.6:/var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick Brick2: 192.168.3.5:/var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick Brick3: 192.168.3.15:/var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick Options Reconfigured: diagnostics.brick-log-level: INFO performance.client-io-threads: off nfs.disable: on transport.address-family: inet sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 108434 Self-heal Daemon on matrix1.matrix.orange.l ab N/A N/A Y 69525 Self-heal Daemon on matrix2.matrix.orange.l ab N/A N/A Y 18569 gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG volume set: success sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep log cluster.entry-change-log on cluster.data-change-log on cluster.metadata-change-log on diagnostics.brick-log-level DEBUG sh-4.2# cd /var/log/glusterfs/bricks/ sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log >>> Noting in log -rw-------. 1 root root 189057 Jan 18 09:20 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd [2019-01-23 11:48:59.111292] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:14.112271] E [MSGID: 106026] [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid argument] [2019-01-23 11:50:14.112305] W [MSGID: 106036] [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: Snapshot list failed [2019-01-23 11:50:20.322902] I [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered already-running brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick [2019-01-23 11:50:20.322925] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick on port 49165 [2019-01-23 11:50:20.327557] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2019-01-23 11:50:20.327586] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped [2019-01-23 11:50:20.327604] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-01-23 11:50:20.337735] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 69525 [2019-01-23 11:50:21.338058] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-01-23 11:50:21.338180] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service [2019-01-23 11:50:21.348234] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2019-01-23 11:50:21.348285] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped [2019-01-23 11:50:21.348866] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2019-01-23 11:50:21.348883] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y 250 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 109550 Self-heal Daemon on 192.168.3.6 N/A N/A Y 52557 Self-heal Daemon on 192.168.3.15 N/A N/A Y 16946 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ There are no active volume tasks From: "Sanju Rakonde" To: "Shaik Salam" Cc: "Amar Tumballi Suryanarayan" , " gluster-users at gluster.org List" , "Murali Kottakota" Date: 01/23/2019 02:15 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, I can see below errors in glusterd logs. [2019-01-22 09:20:17.540196] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid [2019-01-22 09:20:17.546408] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid [2019-01-22 09:20:17.552575] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid [2019-01-22 09:20:17.558888] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid [2019-01-22 09:20:17.565266] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid [2019-01-22 09:20:17.585926] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.617806] E [MSGID: 106028] [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid of brick process [2019-01-22 09:20:17.649628] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid [2019-01-22 09:20:17.649700] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/glustershd/glustershd.pid So it looks like, neither gf_is_service_running() nor glusterd_brick_signal() are able to read the pid file. That means pidfiles might be having nothing to read. Can you please paste the contents of brick pidfiles. You can find brick pidfiles in /var/run/gluster/vols// or you can just run this command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat $i;done" On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam wrote: Hi Sanju, Please find requested information attached logs. Below brick is offline and try to start force/heal commands but doesn't makes up. sh-4.2# sh-4.2# gluster --version glusterfs 4.1.5 sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log From: Sanju Rakonde To: Shaik Salam Cc: Amar Tumballi Suryanarayan , " gluster-users at gluster.org List" Date: 01/22/2019 02:21 PM Subject: Re: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you please provide us complete glusterd and cmd_history logs from all the nodes in the cluster? Also please paste output of the following commands (from all nodes): 1. gluster --version 2. gluster volume info 3. gluster volume status 4. gluster peer status 5. ps -ax | grep glusterfsd On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam wrote: Hi Surya, It is already customer setup and cant redeploy again. Enabled debug for brick level log but nothing writing to it. Can you tell me is any other ways to troubleshoot or logs to look?? From: Shaik Salam/HYD/TCS To: "Amar Tumballi Suryanarayan" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 12:06 PM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands Hi Surya, I have enabled DEBUG mode for brick level. But nothing writing to brick log. gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 diagnostics.brick-log-level DEBUG sh-4.2# pwd /var/log/glusterfs/bricks sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 -rw-------. 1 root root 0 Jan 20 02:46 var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log BR Salam From: "Amar Tumballi Suryanarayan" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" Date: 01/22/2019 11:38 AM Subject: Re: [Bugs] Bricks are going offline unable to recover with heal/start force commands "External email. Open with Caution" Hi Shaik, Can you check what is there in brick logs? They are located in /var/log/glusterfs/bricks/*? Looks like the samba hooks script failed, but that shouldn't matter in this use case. Also, I see that you are trying to setup heketi to provision volumes, which means you may be using gluster in container usecases. If you are still in 'PoC' phase, can you give https://github.com/gluster/gcs a try? That makes the deployment and the stack little simpler. -Amar On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam wrote: Can anyone respond how to recover bricks apart from heal/start force according to below events from logs. Please let me know any other logs required. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: bugs at gluster.org, gluster-users at gluster.org Date: 01/21/2019 10:03 PM Subject: Bricks are going offline unable to recover with heal/start force commands Hi, Bricks are in offline and unable to recover with following commands gluster volume heal gluster volume start force But still bricks are offline. sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.3.6:/var/lib/heketi/mounts/vg _ca57f326195c243be2380ce4e42a4191/brick_952 d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y 269 Brick 192.168.3.5:/var/lib/heketi/mounts/vg _d5f17487744584e3652d3ca943b0b91b/brick_e15 c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N N/A Brick 192.168.3.15:/var/lib/heketi/mounts/v g_462ea199185376b03e4b0317363bb88c/brick_17 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y 225 Self-heal Daemon on localhost N/A N/A Y 45826 Self-heal Daemon on 192.168.3.6 N/A N/A Y 65196 Self-heal Daemon on 192.168.3.15 N/A N/A Y 52915 Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 ------------------------------------------------------------------------------ We can see following events from when we start forcing volumes /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fcaa346f0e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-01-21 08:22:53.389049] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 [2019-01-21 08:23:25.346839] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req We can see following events from when we heal volumes. [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:30.463648] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:34.581555] I [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start volume [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:22:53.387992] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running gluster with version 4.1.5 [2019-01-21 08:23:25.346319] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0 Please let us know steps to recover bricks. BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Bugs mailing list Bugs at gluster.org https://lists.gluster.org/mailman/listinfo/bugs -- Amar Tumballi (amarts) _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -- Thanks, Sanju -- Thanks, Sanju -- Thanks, Sanju -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From dijuremo at gmail.com Thu Jan 24 12:42:49 2019 From: dijuremo at gmail.com (Diego Remolina) Date: Thu, 24 Jan 2019 07:42:49 -0500 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: Glusterfs needs quorum, so if you have two servers and one goes down, there is no quorum, so all writes stop until the server comes back up. You can add a third server as an arbiter which does not store data in the bricks, but still uses some minimal space (to keep metadata for the files). HTH, DIego On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes wrote: > Hit there... > > I have set up two server as replica, like this: > > gluster vol create Vol01 server1:/data/storage server2:/data/storage > > Then I create a config file in client, like this: > volume remote1 > type protocol/client > option transport-type tcp > option remote-host server1 > option remote-subvolume /data/storage > end-volume > > volume remote2 > type protocol/client > option transport-type tcp > option remote-host server2 > option remote-subvolume /data/storage > end-volume > > volume replicate > type cluster/replicate > subvolumes remote1 remote2 > end-volume > > volume writebehind > type performance/write-behind > option window-size 1MB > subvolumes replicate > end-volume > > volume cache > type performance/io-cache > option cache-size 512MB > subvolumes writebehind > end-volume > > And add this line in /etc/fstab > > /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 > > After mount /mnt, I can access the servers. So far so good! > But when I make server1 crash, I was unable to access /mnt or even use > gluster vol status > on server2 > > Everything hangon! > > I have tried with replicated, distributed and replicated-distributed too. > I am using Debian Stretch, with gluster package installed via apt, > provided by Standard Debian Repo, glusterfs-server 3.8.8-1 > > I am sorry if this is a newbie question, but glusterfs share it's not > suppose to keep online if one server goes down? > > Any adviced will be welcome > > Best > > > > > > > --- > Gilberto Nunes Ferreira > > (47) 3025-5907 > (47) 99676-7530 - Whatsapp / Telegram > > Skype: gilberto.nunes36 > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From jim.kinney at gmail.com Thu Jan 24 12:45:23 2019 From: jim.kinney at gmail.com (Jim Kinney) Date: Thu, 24 Jan 2019 07:45:23 -0500 Subject: [Gluster-users] Can't write to volume using vim/nano In-Reply-To: References: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> Message-ID: I have rdma capability. Will test and report back. I'm still on v 3.12. On January 24, 2019 12:54:26 AM EST, Amar Tumballi Suryanarayan wrote: >I suspect this is a bug with 'Transport: rdma' part. We have called out >for >de-scoping that feature as we are lacking experts in that domain right >now. >Recommend you to use IPoIB option, and use tcp/socket transport type >(which >is default). That should mostly fix all the issues. > >-Amar > >On Thu, Jan 24, 2019 at 5:31 AM Jim Kinney >wrote: > >> That really sounds like a bug with the sharding. I'm not using >sharding on >> my setup and files are writeable (vim) with 2 bytes and no errors >occur. >> Perhaps the small size is cached until it's large enough to trigger a >write >> >> On Wed, 2019-01-23 at 21:46 -0200, Lindolfo Meira wrote: >> >> Also I noticed that any subsequent write (after the first write with >340 >> >> bytes or more), regardless the size, will work as expected. >> >> >> >> Lindolfo Meira, MSc >> >> Diretor Geral, Centro Nacional de Supercomputa??o >> >> Universidade Federal do Rio Grande do Sul >> >> +55 (51) 3308-3139 >> >> >> On Wed, 23 Jan 2019, Lindolfo Meira wrote: >> >> >> Just checked: when the write is >= 340 bytes, everything works as >> >> supposed. If the write is smaller, the error takes place. And when it >> >> does, nothing is logged on the server. The client, however, logs the >> >> following: >> >> >> [2019-01-23 23:28:54.554664] W [MSGID: 103046] >> >> [rdma.c:3502:gf_rdma_decode_header] 0-rpc-transport/rdma: received a >msg >> >> of type RDMA_ERROR >> >> >> [2019-01-23 23:28:54.554728] W [MSGID: 103046] >> >> [rdma.c:3939:gf_rdma_process_recv] 0-rpc-transport/rdma: peer >> >> (172.24.1.6:49152), couldn't encode or decode the msg properly or >write >> >> chunks were not provided for replies that were bigger than >> >> RDMA_INLINE_THRESHOLD (2048) >> >> >> [2019-01-23 23:28:54.554765] W [MSGID: 114031] >> >> [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-gfs-client-5: >remote >> >> operation failed [Transport endpoint is not connected] >> >> >> [2019-01-23 23:28:54.554850] W [fuse-bridge.c:1436:fuse_err_cbk] >> >> 0-glusterfs-fuse: 1723199: FLUSH() ERR => -1 (Transport endpoint is >not >> >> connected) >> >> >> >> >> Lindolfo Meira, MSc >> >> Diretor Geral, Centro Nacional de Supercomputa??o >> >> Universidade Federal do Rio Grande do Sul >> >> +55 (51) 3308-3139 >> >> >> On Wed, 23 Jan 2019, Lindolfo Meira wrote: >> >> >> Hi Jim. Thanks for taking the time. >> >> >> Sorry I didn't express myself properly. It's not a simple matter of >> >> permissions. Users can write to the volume alright. It's when vim and >nano >> >> are used, or when small file writes are performed (by cat or echo), >that >> >> it doesn't work. The file is updated with the write in the server, >but it >> >> shows up as empty in the client. >> >> >> I guess it has something to do with the size of the write, because I >ran a >> >> test writing to a file one byte at a time, and it never showed up as >> >> having any content in the client (although in the server it kept >growing >> >> accordingly). >> >> >> I should point out that I'm using a sharded volume. But when I was >testing >> >> a striped volume, it also happened. Output of "gluster volume info" >> >> follows bellow: >> >> >> Volume Name: gfs >> >> Type: Distribute >> >> Volume ID: b5ef065f-1ba2-481f-8108-e8f6d2d3f036 >> >> Status: Started >> >> Snapshot Count: 0 >> >> Number of Bricks: 6 >> >> Transport-type: rdma >> >> Bricks: >> >> Brick1: pfs01-ib:/mnt/data >> >> Brick2: pfs02-ib:/mnt/data >> >> Brick3: pfs03-ib:/mnt/data >> >> Brick4: pfs04-ib:/mnt/data >> >> Brick5: pfs05-ib:/mnt/data >> >> Brick6: pfs06-ib:/mnt/data >> >> Options Reconfigured: >> >> nfs.disable: on >> >> features.shard: on >> >> >> >> >> Lindolfo Meira, MSc >> >> Diretor Geral, Centro Nacional de Supercomputa??o >> >> Universidade Federal do Rio Grande do Sul >> >> +55 (51) 3308-3139 >> >> >> On Wed, 23 Jan 2019, Jim Kinney wrote: >> >> >> Check permissions on the mount. I have multiple dozens of systems >> >> mounting 18 "exports" using fuse and it works for multiple user >> >> read/write based on user access permissions to the mount point space. >> >> /home is mounted for 150+ users plus another dozen+ lab storage >spaces. >> >> I do manage user access with freeIPA across all systems to keep >things >> >> consistent. >> >> On Wed, 2019-01-23 at 19:31 -0200, Lindolfo Meira wrote: >> >> Am I missing something here? A mere write operation, using vim or >> >> nano, cannot be performed on a gluster volume mounted over fuse! What >> >> gives? >> >> Lindolfo Meira, MScDiretor Geral, Centro Nacional de >> >> Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) >> >> 3308-3139_______________________________________________Gluster-users >> >> mailing >> >> listGluster-users at gluster.org >> >> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> -- >> >> James P. Kinney III >> >> >> Every time you stop a school, you will have to build a jail. What you >> >> gain at one end you lose at the other. It's like feeding a dog on his >> >> own tail. It won't fatten the dog. >> >> - Speech 11/23/1900 Mark Twain >> >> >> http://heretothereideas.blogspot.com/ >> >> >> >> -- >> >> James P. Kinney III Every time you stop a school, you will have to >build a >> jail. What you gain at one end you lose at the other. It's like >feeding a >> dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 >Mark >> Twain http://heretothereideas.blogspot.com/ >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > >-- >Amar Tumballi (amarts) -- Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From moagrawa at redhat.com Thu Jan 24 12:48:41 2019 From: moagrawa at redhat.com (Mohit Agrawal) Date: Thu, 24 Jan 2019 18:18:41 +0530 Subject: [Gluster-users] [Bugs] Bricks are going offline unable to recover with heal/start force commands In-Reply-To: References: Message-ID: Hi Salem, On the basis of current available info it seems pidfile has "271" as a pid cat 192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid 271 Can you please share from the nodes if is there any process running with the same pid ? ps -aef | grep 271 If any process is running for the same pid "271" other than glusterfsd in that case you can follow workaround to resolve the same 1) cleanup pid-file >192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid 2) start volume with force option I hope volume should start. Let us know if you are not able to start the volume. If you are not able to start the volume kindly share glusterd logs as well as brick logs for the same brick. Thanks, Mohit Agrawal On Thu, Jan 24, 2019 at 5:14 PM Shaik Salam wrote: > Hi Mohit, > > We are facing this issue from last one month could you please atleast > provide workaround to move further. > Please let me know any logs required. Thanks in advance. > > BR > Salam > > > > From: "Sanju Rakonde" > To: "Mohit Agrawal" , "Shaik Salam" < > shaik.salam at tcs.com> > Cc: "Amar Tumballi Suryanarayan" , " > gluster-users at gluster.org List" , "Murali > Kottakota" > Date: 01/24/2019 04:12 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > *"External email. Open with Caution"* > Mohit, > > Have we came across this kind of issue? This user using gluster 4.1 > version. Did we fix any related bug afterwards? > > Looks like setup has some issues but I'm not sure. > > On Thu, Jan 24, 2019 at 4:01 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > > > Hi Sanju, > > Please find requested information (these are latest logs :) ). > > I can see only following error messages related to brick > "brick_e15c12cceae12c8ab7782dd57cf5b6c1" (on secondnode log) > > [2019-01-23 11:50:20.322902] I > [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered > already-running brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > [2019-01-23 11:50:20.322925] I [MSGID: 106142] > [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > on port 49165 >> showing running on port but not > [2019-01-23 11:50:20.327557] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already > stopped > [2019-01-23 11:50:20.327586] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is > stopped > [2019-01-23 11:50:20.327604] I [MSGID: 106599] > [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so > xlator is not installed > [2019-01-23 11:50:20.337735] I [MSGID: 106568] > [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 69525 > [2019-01-23 11:50:21.338058] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd > service is stopped > [2019-01-23 11:50:21.338180] I [MSGID: 106567] > [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-01-23 11:50:21.348234] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already > stopped > [2019-01-23 11:50:21.348285] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is > stopped > [2019-01-23 11:50:21.348866] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already > stopped > [2019-01-23 11:50:21.348883] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is > stopped > [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 109550 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 52557 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 16946 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > BR > Salam > > > > From: "Sanju Rakonde" <*srakonde at redhat.com* > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/24/2019 02:32 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Shaik, > > Sorry to ask this again. What errors are you seeing in glusterd logs? Can > you share the latest logs? > > On Thu, Jan 24, 2019 at 2:05 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Please find requsted information. > > Are you still seeing the error "Unable to read pidfile:" in glusterd log? > >>>> No > Are you seeing "brick is deemed not to be a part of the volume" error in > glusterd log?>>>> No > > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae1^C8ab7782dd57cf5b6c1/brick > sh-4.2# pwd > > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -m -d -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > sh-4.2# getfattr -d -m . -e hex > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > getfattr: Removing leading '/' from absolute path names > # file: > var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick/ > > security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 > trusted.afr.dirty=0x000000000000000000000000 > > trusted.afr.vol_3442e86b6d994a14de73f1b8c82cf0b8-client-0=0x000000000000000000000000 > trusted.gfid=0x00000000000000000000000000000001 > trusted.glusterfs.dht=0x000000010000000000000000ffffffff > trusted.glusterfs.volume-id=0x15477f3622e84757a0ce9000b63fa849 > > sh-4.2# ls -la |wc -l > 86 > sh-4.2# pwd > > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > sh-4.2# > > > > From: "Sanju Rakonde" <*srakonde at redhat.com* > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/24/2019 01:38 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Shaik, > > Previously I was suspecting, whether brick pid file is missing. But I see > it is present. > > From second node (this brick is in offline state): > /var/run/gluster/vols/vol_3442e86b6d994a14de73f1b8c82cf0b8/192.168.3.5-var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.pid > > 271 > Are you still seeing the error "Unable to read pidfile:" in glusterd log? > > I also suspect whether brick is missing its extended attributes. Are you > seeing "brick is deemed not to be a part of the volume" error in glusterd > log? If not can you please provide us output of "getfattr -m -d -e hex > " > > On Thu, Jan 24, 2019 at 12:18 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Could you please have look my issue if you have time (atleast provide > workaround). > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: "Sanju Rakonde" <*srakonde at redhat.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/23/2019 05:50 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > > Hi Sanju, > > Please find requested information. > > Sorry to repeat again I am trying start force command once brick log > enabled to debug by taking one volume example. > Please correct me If I am doing wrong. > > > [root at master ~]# oc rsh glusterfs-storage-vll7x > sh-4.2# gluster volume info vol_3442e86b6d994a14de73f1b8c82cf0b8 > > Volume Name: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Type: Replicate > Volume ID: 15477f36-22e8-4757-a0ce-9000b63fa849 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 192.168.3.6: > /var/lib/heketi/mounts/vg_ca57f326195c243be2380ce4e42a4191/brick_952d75fd193c7209c9a81acbc23a3747/brick > Brick2: 192.168.3.5: > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/ > brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > Brick3: 192.168.3.15: > /var/lib/heketi/mounts/vg_462ea199185376b03e4b0317363bb88c/brick_1736459d19e8aaa1dcb5a87f48747d04/brick > Options Reconfigured: > diagnostics.brick-log-level: INFO > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 108434 > Self-heal Daemon on matrix1.matrix.orange.l > ab N/A N/A Y > 69525 > Self-heal Daemon on matrix2.matrix.orange.l > ab N/A N/A Y > 18569 > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > volume set: success > sh-4.2# gluster volume get vol_3442e86b6d994a14de73f1b8c82cf0b8 all |grep > log > cluster.entry-change-log on > cluster.data-change-log on > cluster.metadata-change-log on > diagnostics.brick-log-level DEBUG > > sh-4.2# cd /var/log/glusterfs/bricks/ > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > >>> Noting in log > > -rw-------. 1 root root 189057 Jan 18 09:20 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log-20190120 > > [2019-01-23 11:49:32.475956] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o > diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd > [2019-01-23 11:49:32.483191] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 -o > diagnostics.brick-log-level=DEBUG --gd-workdir=/var/lib/glusterd > [2019-01-23 11:48:59.111292] W [MSGID: 106036] > [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: > Snapshot list failed > [2019-01-23 11:50:14.112271] E [MSGID: 106026] > [glusterd-snapshot.c:3962:glusterd_handle_snapshot_list] 0-management: > Volume (vol_63854b105c40802bdec77290e91858ea) does not exist [Invalid > argument] > [2019-01-23 11:50:14.112305] W [MSGID: 106036] > [glusterd-snapshot.c:9514:glusterd_handle_snapshot_fn] 0-management: > Snapshot list failed > [2019-01-23 11:50:20.322902] I > [glusterd-utils.c:5994:glusterd_brick_start] 0-management: discovered > already-running brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > [2019-01-23 11:50:20.322925] I [MSGID: 106142] > [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick > /var/lib/heketi/mounts/vg_d5f17487744584e3652d3ca943b0b91b/brick_e15c12cceae12c8ab7782dd57cf5b6c1/brick > on port 49165 > [2019-01-23 11:50:20.327557] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already > stopped > [2019-01-23 11:50:20.327586] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is > stopped > [2019-01-23 11:50:20.327604] I [MSGID: 106599] > [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so > xlator is not installed > [2019-01-23 11:50:20.337735] I [MSGID: 106568] > [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 69525 > [2019-01-23 11:50:21.338058] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd > service is stopped > [2019-01-23 11:50:21.338180] I [MSGID: 106567] > [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-01-23 11:50:21.348234] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already > stopped > [2019-01-23 11:50:21.348285] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is > stopped > [2019-01-23 11:50:21.348866] I [MSGID: 106131] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already > stopped > [2019-01-23 11:50:21.348883] I [MSGID: 106568] > [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is > stopped > [2019-01-23 11:50:22.356502] I [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-23 11:50:22.368845] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49157 0 Y > 250 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 109550 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 52557 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 16946 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > > > From: "Sanju Rakonde" <*srakonde at redhat.com* > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > >, "Murali Kottakota" < > *murali.kottakota at tcs.com* > > Date: 01/23/2019 02:15 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > I can see below errors in glusterd logs. > > [2019-01-22 09:20:17.540196] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_e1aa1283d5917485d88c4a742eeff422/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_9e7c382e5f853d471c347bc5590359af-brick.pid > > [2019-01-22 09:20:17.546408] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f0ed498d7e781d7bb896244175b31f9e/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_47ed9e0663ad0f6f676ddd6ad7e3dcde-brick.pid > > [2019-01-22 09:20:17.552575] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f387519c9b004ec14e80696db88ef0f8/192.168.3.6-var-lib-heketi-mounts-vg_56391bec3c8bfe4fc116de7bddfc2af4-brick_06ad6c73dfbf6a5fc21334f98c9973c2-brick.pid > > [2019-01-22 09:20:17.558888] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_f8ca343c60e6efe541fe02d16ca02a7d/192.168.3.6-var-lib-heketi-mounts-vg_526f35058433c6b03130bba4e0a7dd87-brick_525225f65753b05dfe33aeaeb9c5de39-brick.pid > > [2019-01-22 09:20:17.565266] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/vols/vol_fe882e074c0512fd9271fc2ff5a0bfe1/192.168.3.6-var-lib-heketi-mounts-vg_28708570b029e5eff0a996c453a11691-brick_d4f30d6e465a8544b759a7016fb5aab5-brick.pid > > [2019-01-22 09:20:17.585926] E [MSGID: 106028] > [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid > of brick process > [2019-01-22 09:20:17.617806] E [MSGID: 106028] > [glusterd-utils.c:8222:glusterd_brick_signal] 0-glusterd: Unable to get pid > of brick process > [2019-01-22 09:20:17.649628] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/glustershd/glustershd.pid > [2019-01-22 09:20:17.649700] E [MSGID: 101012] > [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: > /var/run/gluster/glustershd/glustershd.pid > > So it looks like, neither gf_is_service_running() > nor glusterd_brick_signal() are able to read the pid file. That means > pidfiles might be having nothing to read. > > Can you please paste the contents of brick pidfiles. You can find brick > pidfiles in /var/run/gluster/vols// or you can just run this > command "for i in `ls /var/run/gluster/vols/*/*.pid`;do echo $i;cat > $i;done" > > On Wed, Jan 23, 2019 at 12:49 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Sanju, > > Please find requested information attached logs. > > > > > Below brick is offline and try to start force/heal commands but doesn't > makes up. > > sh-4.2# > sh-4.2# gluster --version > glusterfs 4.1.5 > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > Enabled DEBUG mode for brick level. But nothing writing to brick log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > > > > > > From: Sanju Rakonde <*srakonde at redhat.com* > > To: Shaik Salam <*shaik.salam at tcs.com* > > Cc: Amar Tumballi Suryanarayan <*atumball at redhat.com* > >, "*gluster-users at gluster.org* > List" <*gluster-users at gluster.org* > > > Date: 01/22/2019 02:21 PM > Subject: Re: [Gluster-users] [Bugs] Bricks are going offline > unable to recover with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you please provide us complete glusterd and cmd_history logs from all > the nodes in the cluster? Also please paste output of the following > commands (from all nodes): > 1. gluster --version > 2. gluster volume info > 3. gluster volume status > 4. gluster peer status > 5. ps -ax | grep glusterfsd > > On Tue, Jan 22, 2019 at 12:47 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Surya, > > It is already customer setup and cant redeploy again. > Enabled debug for brick level log but nothing writing to it. > Can you tell me is any other ways to troubleshoot or logs to look?? > > > From: Shaik Salam/HYD/TCS > To: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 12:06 PM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > Hi Surya, > > I have enabled DEBUG mode for brick level. But nothing writing to brick > log. > > gluster volume set vol_3442e86b6d994a14de73f1b8c82cf0b8 > diagnostics.brick-log-level DEBUG > > sh-4.2# pwd > /var/log/glusterfs/bricks > > sh-4.2# ls -la |grep brick_e15c12cceae12c8ab7782dd57cf5b6c1 > -rw-------. 1 root root 0 Jan 20 02:46 > var-lib-heketi-mounts-vg_d5f17487744584e3652d3ca943b0b91b-brick_e15c12cceae12c8ab7782dd57cf5b6c1-brick.log > > BR > Salam > > > > > From: "Amar Tumballi Suryanarayan" <*atumball at redhat.com* > > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/22/2019 11:38 AM > Subject: Re: [Bugs] Bricks are going offline unable to recover > with heal/start force commands > ------------------------------ > > > > * "External email. Open with Caution"* > Hi Shaik, > > Can you check what is there in brick logs? They are located in > /var/log/glusterfs/bricks/*? > > Looks like the samba hooks script failed, but that shouldn't matter in > this use case. > > Also, I see that you are trying to setup heketi to provision volumes, > which means you may be using gluster in container usecases. If you are > still in 'PoC' phase, can you give *https://github.com/gluster/gcs* > a try? That makes the deployment and the > stack little simpler. > > -Amar > > > > > On Tue, Jan 22, 2019 at 11:29 AM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Can anyone respond how to recover bricks apart from heal/start force > according to below events from logs. > Please let me know any other logs required. > Thanks in advance. > > BR > Salam > > > > From: Shaik Salam/HYD/TCS > To: *bugs at gluster.org* , > *gluster-users at gluster.org* > Date: 01/21/2019 10:03 PM > Subject: Bricks are going offline unable to recover with > heal/start force commands > ------------------------------ > > > Hi, > > Bricks are in offline and unable to recover with following commands > > gluster volume heal > > gluster volume start force > > But still bricks are offline. > > > sh-4.2# gluster volume status vol_3442e86b6d994a14de73f1b8c82cf0b8 > Status of volume: vol_3442e86b6d994a14de73f1b8c82cf0b8 > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.3.6:/var/lib/heketi/mounts/vg > _ca57f326195c243be2380ce4e42a4191/brick_952 > d75fd193c7209c9a81acbc23a3747/brick 49166 0 Y > 269 > Brick 192.168.3.5:/var/lib/heketi/mounts/vg > _d5f17487744584e3652d3ca943b0b91b/brick_e15 > c12cceae12c8ab7782dd57cf5b6c1/brick N/A N/A N > N/A > Brick 192.168.3.15:/var/lib/heketi/mounts/v > g_462ea199185376b03e4b0317363bb88c/brick_17 > 36459d19e8aaa1dcb5a87f48747d04/brick 49173 0 Y > 225 > Self-heal Daemon on localhost N/A N/A Y > 45826 > Self-heal Daemon on 192.168.3.6 N/A N/A Y > 65196 > Self-heal Daemon on 192.168.3.15 N/A N/A Y > 52915 > > Task Status of Volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > > ------------------------------------------------------------------------------ > > > We can see following events from when we start forcing volumes > > /mgmt/glusterd.so(+0xe2b3a) [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2605) > [0x7fca9e139605] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:34.555068] E [run.c:241:runner_log] > (-->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2b3a) > [0x7fca9e139b3a] > -->/usr/lib64/glusterfs/4.1.5/xlator/mgmt/glusterd.so(+0xe2563) > [0x7fca9e139563] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fcaa346f0e5] ) 0-management: Failed to execute script: > /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh > --volname=vol_3442e86b6d994a14de73f1b8c82cf0b8 --first=no --version=1 > --volume-op=start --gd-workdir=/var/lib/glusterd > [2019-01-21 08:22:53.389049] I [MSGID: 106499] > [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: > Received status volume req for volume vol_3442e86b6d994a14de73f1b8c82cf0b8 > [2019-01-21 08:23:25.346839] I [MSGID: 106487] > [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: > Received cli list req > > > We can see following events from when we heal volumes. > > [2019-01-21 08:20:07.576070] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:20:07.580225] I [cli-rpc-ops.c:9182:gf_cli_heal_volume_cbk] > 0-cli: Received resp to heal volume > [2019-01-21 08:20:07.580326] I [input.c:31:cli_batch] 0-: Exiting with: -1 > [2019-01-21 08:22:30.423311] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:30.463648] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:30.463718] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:30.463859] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:33.427710] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:34.581555] I > [cli-rpc-ops.c:1472:gf_cli_start_volume_cbk] 0-cli: Received resp to start > volume > [2019-01-21 08:22:34.581678] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:22:53.345351] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:22:53.387992] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:22:53.388059] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:22:53.388138] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > [2019-01-21 08:22:53.394737] I [input.c:31:cli_batch] 0-: Exiting with: 0 > [2019-01-21 08:23:25.304688] I [cli.c:768:main] 0-cli: Started running > gluster with version 4.1.5 > [2019-01-21 08:23:25.346319] I [MSGID: 101190] > [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-01-21 08:23:25.346389] I [socket.c:2632:socket_event_handler] > 0-transport: EPOLLERR - disconnecting now > [2019-01-21 08:23:25.346500] W [rpc-clnt.c:1753:rpc_clnt_submit] > 0-glusterfs: error returned while attempting to connect to host:(null), > port:0 > > > > Please let us know steps to recover bricks. > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > _______________________________________________ > Bugs mailing list > *Bugs at gluster.org* > *https://lists.gluster.org/mailman/listinfo/bugs* > > > > -- > Amar Tumballi (amarts) > _______________________________________________ > Gluster-users mailing list > *Gluster-users at gluster.org* > *https://lists.gluster.org/mailman/listinfo/gluster-users* > > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > > > -- > Thanks, > Sanju > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gilberto.nunes32 at gmail.com Thu Jan 24 13:23:15 2019 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Thu, 24 Jan 2019 11:23:15 -0200 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: Yep! But as I mentioned in previously e-mail, even with 3 or 4 servers this issues occurr. I don't know what's happen. --- Gilberto Nunes Ferreira (47) 3025-5907 (47) 99676-7530 - Whatsapp / Telegram Skype: gilberto.nunes36 Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina escreveu: > Glusterfs needs quorum, so if you have two servers and one goes down, > there is no quorum, so all writes stop until the server comes back up. You > can add a third server as an arbiter which does not store data in the > bricks, but still uses some minimal space (to keep metadata for the files). > > HTH, > > DIego > > On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes > wrote: > >> Hit there... >> >> I have set up two server as replica, like this: >> >> gluster vol create Vol01 server1:/data/storage server2:/data/storage >> >> Then I create a config file in client, like this: >> volume remote1 >> type protocol/client >> option transport-type tcp >> option remote-host server1 >> option remote-subvolume /data/storage >> end-volume >> >> volume remote2 >> type protocol/client >> option transport-type tcp >> option remote-host server2 >> option remote-subvolume /data/storage >> end-volume >> >> volume replicate >> type cluster/replicate >> subvolumes remote1 remote2 >> end-volume >> >> volume writebehind >> type performance/write-behind >> option window-size 1MB >> subvolumes replicate >> end-volume >> >> volume cache >> type performance/io-cache >> option cache-size 512MB >> subvolumes writebehind >> end-volume >> >> And add this line in /etc/fstab >> >> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >> >> After mount /mnt, I can access the servers. So far so good! >> But when I make server1 crash, I was unable to access /mnt or even use >> gluster vol status >> on server2 >> >> Everything hangon! >> >> I have tried with replicated, distributed and replicated-distributed too. >> I am using Debian Stretch, with gluster package installed via apt, >> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >> >> I am sorry if this is a newbie question, but glusterfs share it's not >> suppose to keep online if one server goes down? >> >> Any adviced will be welcome >> >> Best >> >> >> >> >> >> >> --- >> Gilberto Nunes Ferreira >> >> (47) 3025-5907 >> (47) 99676-7530 - Whatsapp / Telegram >> >> Skype: gilberto.nunes36 >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scott.c.worthington at gmail.com Thu Jan 24 13:27:45 2019 From: scott.c.worthington at gmail.com (Scott Worthington) Date: Thu, 24 Jan 2019 08:27:45 -0500 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: I think your mount statement in /etc/fstab is only referencing ONE of the gluster servers. Please take a look at "More redundant mount" section: https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume Then try taking down one of the gluster servers and report back results. On Thu, Jan 24, 2019 at 8:24 AM Gilberto Nunes wrote: > Yep! > But as I mentioned in previously e-mail, even with 3 or 4 servers this > issues occurr. > I don't know what's happen. > > --- > Gilberto Nunes Ferreira > > (47) 3025-5907 > (47) 99676-7530 - Whatsapp / Telegram > > Skype: gilberto.nunes36 > > > > > > Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina > escreveu: > >> Glusterfs needs quorum, so if you have two servers and one goes down, >> there is no quorum, so all writes stop until the server comes back up. You >> can add a third server as an arbiter which does not store data in the >> bricks, but still uses some minimal space (to keep metadata for the files). >> >> HTH, >> >> DIego >> >> On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes < >> gilberto.nunes32 at gmail.com> wrote: >> >>> Hit there... >>> >>> I have set up two server as replica, like this: >>> >>> gluster vol create Vol01 server1:/data/storage server2:/data/storage >>> >>> Then I create a config file in client, like this: >>> volume remote1 >>> type protocol/client >>> option transport-type tcp >>> option remote-host server1 >>> option remote-subvolume /data/storage >>> end-volume >>> >>> volume remote2 >>> type protocol/client >>> option transport-type tcp >>> option remote-host server2 >>> option remote-subvolume /data/storage >>> end-volume >>> >>> volume replicate >>> type cluster/replicate >>> subvolumes remote1 remote2 >>> end-volume >>> >>> volume writebehind >>> type performance/write-behind >>> option window-size 1MB >>> subvolumes replicate >>> end-volume >>> >>> volume cache >>> type performance/io-cache >>> option cache-size 512MB >>> subvolumes writebehind >>> end-volume >>> >>> And add this line in /etc/fstab >>> >>> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >>> >>> After mount /mnt, I can access the servers. So far so good! >>> But when I make server1 crash, I was unable to access /mnt or even use >>> gluster vol status >>> on server2 >>> >>> Everything hangon! >>> >>> I have tried with replicated, distributed and replicated-distributed too. >>> I am using Debian Stretch, with gluster package installed via apt, >>> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >>> >>> I am sorry if this is a newbie question, but glusterfs share it's not >>> suppose to keep online if one server goes down? >>> >>> Any adviced will be welcome >>> >>> Best >>> >>> >>> >>> >>> >>> >>> --- >>> Gilberto Nunes Ferreira >>> >>> (47) 3025-5907 >>> (47) 99676-7530 - Whatsapp / Telegram >>> >>> Skype: gilberto.nunes36 >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From gilberto.nunes32 at gmail.com Thu Jan 24 13:43:47 2019 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Thu, 24 Jan 2019 11:43:47 -0200 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: >I think your mount statement in /etc/fstab is only referencing ONE of the gluster servers. > >Please take a look at "More redundant mount" section: > >https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume > >Then try taking down one of the gluster servers and report back results. Guys! I have followed the very same instruction that found in the James's website. One of method his mentioned in that website, is create a file into /etc/glusterfs directory, named datastore.vol, for instance, with this content: volume remote1 type protocol/client option transport-type tcp option remote-host server1 option remote-subvolume /data/storage end-volume volume remote2 type protocol/client option transport-type tcp option remote-host server2 option remote-subvolume /data/storage end-volume volume remote3 type protocol/client option transport-type tcp option remote-host server3 option remote-subvolume /data/storage end-volume volume replicate type cluster/replicate subvolumes remote1 remote2 remote3 end-volume volume writebehind type performance/write-behind option window-size 1MB subvolumes replicate end-volume volume cache type performance/io-cache option cache-size 512MB subvolumes writebehind end-volume and then include this line into fstab: /etc/glusterfs/datastore.vol [MOUNT] glusterfs rw,allow_other, default_permissions,max_read=131072 0 0 What I doing wrong??? Thanks --- Gilberto Nunes Ferreira (47) 3025-5907 (47) 99676-7530 - Whatsapp / Telegram Skype: gilberto.nunes36 Em qui, 24 de jan de 2019 ?s 11:27, Scott Worthington < scott.c.worthington at gmail.com> escreveu: > I think your mount statement in /etc/fstab is only referencing ONE of the > gluster servers. > > Please take a look at "More redundant mount" section: > > https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume > > Then try taking down one of the gluster servers and report back results. > > On Thu, Jan 24, 2019 at 8:24 AM Gilberto Nunes > wrote: > >> Yep! >> But as I mentioned in previously e-mail, even with 3 or 4 servers this >> issues occurr. >> I don't know what's happen. >> >> --- >> Gilberto Nunes Ferreira >> >> (47) 3025-5907 >> (47) 99676-7530 - Whatsapp / Telegram >> >> Skype: gilberto.nunes36 >> >> >> >> >> >> Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina >> escreveu: >> >>> Glusterfs needs quorum, so if you have two servers and one goes down, >>> there is no quorum, so all writes stop until the server comes back up. You >>> can add a third server as an arbiter which does not store data in the >>> bricks, but still uses some minimal space (to keep metadata for the files). >>> >>> HTH, >>> >>> DIego >>> >>> On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes < >>> gilberto.nunes32 at gmail.com> wrote: >>> >>>> Hit there... >>>> >>>> I have set up two server as replica, like this: >>>> >>>> gluster vol create Vol01 server1:/data/storage server2:/data/storage >>>> >>>> Then I create a config file in client, like this: >>>> volume remote1 >>>> type protocol/client >>>> option transport-type tcp >>>> option remote-host server1 >>>> option remote-subvolume /data/storage >>>> end-volume >>>> >>>> volume remote2 >>>> type protocol/client >>>> option transport-type tcp >>>> option remote-host server2 >>>> option remote-subvolume /data/storage >>>> end-volume >>>> >>>> volume replicate >>>> type cluster/replicate >>>> subvolumes remote1 remote2 >>>> end-volume >>>> >>>> volume writebehind >>>> type performance/write-behind >>>> option window-size 1MB >>>> subvolumes replicate >>>> end-volume >>>> >>>> volume cache >>>> type performance/io-cache >>>> option cache-size 512MB >>>> subvolumes writebehind >>>> end-volume >>>> >>>> And add this line in /etc/fstab >>>> >>>> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >>>> >>>> After mount /mnt, I can access the servers. So far so good! >>>> But when I make server1 crash, I was unable to access /mnt or even use >>>> gluster vol status >>>> on server2 >>>> >>>> Everything hangon! >>>> >>>> I have tried with replicated, distributed and replicated-distributed >>>> too. >>>> I am using Debian Stretch, with gluster package installed via apt, >>>> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >>>> >>>> I am sorry if this is a newbie question, but glusterfs share it's not >>>> suppose to keep online if one server goes down? >>>> >>>> Any adviced will be welcome >>>> >>>> Best >>>> >>>> >>>> >>>> >>>> >>>> >>>> --- >>>> Gilberto Nunes Ferreira >>>> >>>> (47) 3025-5907 >>>> (47) 99676-7530 - Whatsapp / Telegram >>>> >>>> Skype: gilberto.nunes36 >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dijuremo at gmail.com Thu Jan 24 13:47:19 2019 From: dijuremo at gmail.com (Diego Remolina) Date: Thu, 24 Jan 2019 08:47:19 -0500 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: Show us output of: gluster v status Have you configured firewall rules properly for all ports being used? Diego On Thu, Jan 24, 2019 at 8:44 AM Gilberto Nunes wrote: > >I think your mount statement in /etc/fstab is only referencing ONE of the > gluster servers. > > > >Please take a look at "More redundant mount" section: > > > >https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume > > > >Then try taking down one of the gluster servers and report back results. > > Guys! I have followed the very same instruction that found in the James's > website. > One of method his mentioned in that website, is create a file into > /etc/glusterfs directory, named datastore.vol, for instance, with this > content: > > volume remote1 > type protocol/client > option transport-type tcp > option remote-host server1 > option remote-subvolume /data/storage > end-volume > > volume remote2 > type protocol/client > option transport-type tcp > option remote-host server2 > option remote-subvolume /data/storage > end-volume > > volume remote3 > type protocol/client > option transport-type tcp > option remote-host server3 > option remote-subvolume /data/storage > end-volume > > volume replicate > type cluster/replicate > subvolumes remote1 remote2 remote3 > end-volume > > volume writebehind > type performance/write-behind > option window-size 1MB > subvolumes replicate > end-volume > > volume cache > type performance/io-cache > option cache-size 512MB > subvolumes writebehind > end-volume > > > and then include this line into fstab: > > /etc/glusterfs/datastore.vol [MOUNT] glusterfs rw,allow_other, > default_permissions,max_read=131072 0 0 > > What I doing wrong??? > > Thanks > > > > > > > --- > Gilberto Nunes Ferreira > > (47) 3025-5907 > (47) 99676-7530 - Whatsapp / Telegram > > Skype: gilberto.nunes36 > > > > > > Em qui, 24 de jan de 2019 ?s 11:27, Scott Worthington < > scott.c.worthington at gmail.com> escreveu: > >> I think your mount statement in /etc/fstab is only referencing ONE of the >> gluster servers. >> >> Please take a look at "More redundant mount" section: >> >> https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >> >> Then try taking down one of the gluster servers and report back results. >> >> On Thu, Jan 24, 2019 at 8:24 AM Gilberto Nunes < >> gilberto.nunes32 at gmail.com> wrote: >> >>> Yep! >>> But as I mentioned in previously e-mail, even with 3 or 4 servers this >>> issues occurr. >>> I don't know what's happen. >>> >>> --- >>> Gilberto Nunes Ferreira >>> >>> (47) 3025-5907 >>> (47) 99676-7530 - Whatsapp / Telegram >>> >>> Skype: gilberto.nunes36 >>> >>> >>> >>> >>> >>> Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina >>> escreveu: >>> >>>> Glusterfs needs quorum, so if you have two servers and one goes down, >>>> there is no quorum, so all writes stop until the server comes back up. You >>>> can add a third server as an arbiter which does not store data in the >>>> bricks, but still uses some minimal space (to keep metadata for the files). >>>> >>>> HTH, >>>> >>>> DIego >>>> >>>> On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes < >>>> gilberto.nunes32 at gmail.com> wrote: >>>> >>>>> Hit there... >>>>> >>>>> I have set up two server as replica, like this: >>>>> >>>>> gluster vol create Vol01 server1:/data/storage server2:/data/storage >>>>> >>>>> Then I create a config file in client, like this: >>>>> volume remote1 >>>>> type protocol/client >>>>> option transport-type tcp >>>>> option remote-host server1 >>>>> option remote-subvolume /data/storage >>>>> end-volume >>>>> >>>>> volume remote2 >>>>> type protocol/client >>>>> option transport-type tcp >>>>> option remote-host server2 >>>>> option remote-subvolume /data/storage >>>>> end-volume >>>>> >>>>> volume replicate >>>>> type cluster/replicate >>>>> subvolumes remote1 remote2 >>>>> end-volume >>>>> >>>>> volume writebehind >>>>> type performance/write-behind >>>>> option window-size 1MB >>>>> subvolumes replicate >>>>> end-volume >>>>> >>>>> volume cache >>>>> type performance/io-cache >>>>> option cache-size 512MB >>>>> subvolumes writebehind >>>>> end-volume >>>>> >>>>> And add this line in /etc/fstab >>>>> >>>>> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >>>>> >>>>> After mount /mnt, I can access the servers. So far so good! >>>>> But when I make server1 crash, I was unable to access /mnt or even use >>>>> gluster vol status >>>>> on server2 >>>>> >>>>> Everything hangon! >>>>> >>>>> I have tried with replicated, distributed and replicated-distributed >>>>> too. >>>>> I am using Debian Stretch, with gluster package installed via apt, >>>>> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >>>>> >>>>> I am sorry if this is a newbie question, but glusterfs share it's not >>>>> suppose to keep online if one server goes down? >>>>> >>>>> Any adviced will be welcome >>>>> >>>>> Best >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> --- >>>>> Gilberto Nunes Ferreira >>>>> >>>>> (47) 3025-5907 >>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>> >>>>> Skype: gilberto.nunes36 >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Thu Jan 24 13:49:37 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 24 Jan 2019 19:19:37 +0530 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: Also note that, this way of mounting with a 'static' volfile is not recommended as you wouldn't get any features out of gluster's Software Defined Storage behavior. this was an approach we used to have say 8 years before. With the introduction of management daemon called glusterd, the way of dealing with volfiles have changed, and it is created with gluster CLI. About having /etc/fstab not hang when a server is down, search for 'backup-volfile-server' option with glusterfs, and that should be used. Regards, Amar On Thu, Jan 24, 2019 at 7:17 PM Diego Remolina wrote: > Show us output of: > > gluster v status > > Have you configured firewall rules properly for all ports being used? > > Diego > > On Thu, Jan 24, 2019 at 8:44 AM Gilberto Nunes > wrote: > >> >I think your mount statement in /etc/fstab is only referencing ONE of >> the gluster servers. >> > >> >Please take a look at "More redundant mount" section: >> > >> >https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >> > >> >Then try taking down one of the gluster servers and report back results. >> >> Guys! I have followed the very same instruction that found in the James's >> website. >> One of method his mentioned in that website, is create a file into >> /etc/glusterfs directory, named datastore.vol, for instance, with this >> content: >> >> volume remote1 >> type protocol/client >> option transport-type tcp >> option remote-host server1 >> option remote-subvolume /data/storage >> end-volume >> >> volume remote2 >> type protocol/client >> option transport-type tcp >> option remote-host server2 >> option remote-subvolume /data/storage >> end-volume >> >> volume remote3 >> type protocol/client >> option transport-type tcp >> option remote-host server3 >> option remote-subvolume /data/storage >> end-volume >> >> volume replicate >> type cluster/replicate >> subvolumes remote1 remote2 remote3 >> end-volume >> >> volume writebehind >> type performance/write-behind >> option window-size 1MB >> subvolumes replicate >> end-volume >> >> volume cache >> type performance/io-cache >> option cache-size 512MB >> subvolumes writebehind >> end-volume >> >> >> and then include this line into fstab: >> >> /etc/glusterfs/datastore.vol [MOUNT] glusterfs rw,allow_other, >> default_permissions,max_read=131072 0 0 >> >> What I doing wrong??? >> >> Thanks >> >> >> >> >> >> >> --- >> Gilberto Nunes Ferreira >> >> (47) 3025-5907 >> (47) 99676-7530 - Whatsapp / Telegram >> >> Skype: gilberto.nunes36 >> >> >> >> >> >> Em qui, 24 de jan de 2019 ?s 11:27, Scott Worthington < >> scott.c.worthington at gmail.com> escreveu: >> >>> I think your mount statement in /etc/fstab is only referencing ONE of >>> the gluster servers. >>> >>> Please take a look at "More redundant mount" section: >>> >>> https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>> >>> Then try taking down one of the gluster servers and report back results. >>> >>> On Thu, Jan 24, 2019 at 8:24 AM Gilberto Nunes < >>> gilberto.nunes32 at gmail.com> wrote: >>> >>>> Yep! >>>> But as I mentioned in previously e-mail, even with 3 or 4 servers this >>>> issues occurr. >>>> I don't know what's happen. >>>> >>>> --- >>>> Gilberto Nunes Ferreira >>>> >>>> (47) 3025-5907 >>>> (47) 99676-7530 - Whatsapp / Telegram >>>> >>>> Skype: gilberto.nunes36 >>>> >>>> >>>> >>>> >>>> >>>> Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina >>>> escreveu: >>>> >>>>> Glusterfs needs quorum, so if you have two servers and one goes down, >>>>> there is no quorum, so all writes stop until the server comes back up. You >>>>> can add a third server as an arbiter which does not store data in the >>>>> bricks, but still uses some minimal space (to keep metadata for the files). >>>>> >>>>> HTH, >>>>> >>>>> DIego >>>>> >>>>> On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes < >>>>> gilberto.nunes32 at gmail.com> wrote: >>>>> >>>>>> Hit there... >>>>>> >>>>>> I have set up two server as replica, like this: >>>>>> >>>>>> gluster vol create Vol01 server1:/data/storage server2:/data/storage >>>>>> >>>>>> Then I create a config file in client, like this: >>>>>> volume remote1 >>>>>> type protocol/client >>>>>> option transport-type tcp >>>>>> option remote-host server1 >>>>>> option remote-subvolume /data/storage >>>>>> end-volume >>>>>> >>>>>> volume remote2 >>>>>> type protocol/client >>>>>> option transport-type tcp >>>>>> option remote-host server2 >>>>>> option remote-subvolume /data/storage >>>>>> end-volume >>>>>> >>>>>> volume replicate >>>>>> type cluster/replicate >>>>>> subvolumes remote1 remote2 >>>>>> end-volume >>>>>> >>>>>> volume writebehind >>>>>> type performance/write-behind >>>>>> option window-size 1MB >>>>>> subvolumes replicate >>>>>> end-volume >>>>>> >>>>>> volume cache >>>>>> type performance/io-cache >>>>>> option cache-size 512MB >>>>>> subvolumes writebehind >>>>>> end-volume >>>>>> >>>>>> And add this line in /etc/fstab >>>>>> >>>>>> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >>>>>> >>>>>> After mount /mnt, I can access the servers. So far so good! >>>>>> But when I make server1 crash, I was unable to access /mnt or even >>>>>> use >>>>>> gluster vol status >>>>>> on server2 >>>>>> >>>>>> Everything hangon! >>>>>> >>>>>> I have tried with replicated, distributed and replicated-distributed >>>>>> too. >>>>>> I am using Debian Stretch, with gluster package installed via apt, >>>>>> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >>>>>> >>>>>> I am sorry if this is a newbie question, but glusterfs share it's >>>>>> not suppose to keep online if one server goes down? >>>>>> >>>>>> Any adviced will be welcome >>>>>> >>>>>> Best >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --- >>>>>> Gilberto Nunes Ferreira >>>>>> >>>>>> (47) 3025-5907 >>>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>>> >>>>>> Skype: gilberto.nunes36 >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From gilberto.nunes32 at gmail.com Thu Jan 24 13:55:07 2019 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Thu, 24 Jan 2019 11:55:07 -0200 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: Thanks, I'll check it out. --- Gilberto Nunes Ferreira (47) 3025-5907 (47) 99676-7530 - Whatsapp / Telegram Skype: gilberto.nunes36 Em qui, 24 de jan de 2019 ?s 11:50, Amar Tumballi Suryanarayan < atumball at redhat.com> escreveu: > Also note that, this way of mounting with a 'static' volfile is not > recommended as you wouldn't get any features out of gluster's Software > Defined Storage behavior. > > this was an approach we used to have say 8 years before. With the > introduction of management daemon called glusterd, the way of dealing with > volfiles have changed, and it is created with gluster CLI. > > About having /etc/fstab not hang when a server is down, search for > 'backup-volfile-server' option with glusterfs, and that should be used. > > Regards, > Amar > > On Thu, Jan 24, 2019 at 7:17 PM Diego Remolina wrote: > >> Show us output of: >> >> gluster v status >> >> Have you configured firewall rules properly for all ports being used? >> >> Diego >> >> On Thu, Jan 24, 2019 at 8:44 AM Gilberto Nunes < >> gilberto.nunes32 at gmail.com> wrote: >> >>> >I think your mount statement in /etc/fstab is only referencing ONE of >>> the gluster servers. >>> > >>> >Please take a look at "More redundant mount" section: >>> > >>> >https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>> > >>> >Then try taking down one of the gluster servers and report back results. >>> >>> Guys! I have followed the very same instruction that found in the >>> James's website. >>> One of method his mentioned in that website, is create a file into >>> /etc/glusterfs directory, named datastore.vol, for instance, with this >>> content: >>> >>> volume remote1 >>> type protocol/client >>> option transport-type tcp >>> option remote-host server1 >>> option remote-subvolume /data/storage >>> end-volume >>> >>> volume remote2 >>> type protocol/client >>> option transport-type tcp >>> option remote-host server2 >>> option remote-subvolume /data/storage >>> end-volume >>> >>> volume remote3 >>> type protocol/client >>> option transport-type tcp >>> option remote-host server3 >>> option remote-subvolume /data/storage >>> end-volume >>> >>> volume replicate >>> type cluster/replicate >>> subvolumes remote1 remote2 remote3 >>> end-volume >>> >>> volume writebehind >>> type performance/write-behind >>> option window-size 1MB >>> subvolumes replicate >>> end-volume >>> >>> volume cache >>> type performance/io-cache >>> option cache-size 512MB >>> subvolumes writebehind >>> end-volume >>> >>> >>> and then include this line into fstab: >>> >>> /etc/glusterfs/datastore.vol [MOUNT] glusterfs rw,allow_other, >>> default_permissions,max_read=131072 0 0 >>> >>> What I doing wrong??? >>> >>> Thanks >>> >>> >>> >>> >>> >>> >>> --- >>> Gilberto Nunes Ferreira >>> >>> (47) 3025-5907 >>> (47) 99676-7530 - Whatsapp / Telegram >>> >>> Skype: gilberto.nunes36 >>> >>> >>> >>> >>> >>> Em qui, 24 de jan de 2019 ?s 11:27, Scott Worthington < >>> scott.c.worthington at gmail.com> escreveu: >>> >>>> I think your mount statement in /etc/fstab is only referencing ONE of >>>> the gluster servers. >>>> >>>> Please take a look at "More redundant mount" section: >>>> >>>> https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>>> >>>> Then try taking down one of the gluster servers and report back results. >>>> >>>> On Thu, Jan 24, 2019 at 8:24 AM Gilberto Nunes < >>>> gilberto.nunes32 at gmail.com> wrote: >>>> >>>>> Yep! >>>>> But as I mentioned in previously e-mail, even with 3 or 4 servers this >>>>> issues occurr. >>>>> I don't know what's happen. >>>>> >>>>> --- >>>>> Gilberto Nunes Ferreira >>>>> >>>>> (47) 3025-5907 >>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>> >>>>> Skype: gilberto.nunes36 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina >>>>> escreveu: >>>>> >>>>>> Glusterfs needs quorum, so if you have two servers and one goes down, >>>>>> there is no quorum, so all writes stop until the server comes back up. You >>>>>> can add a third server as an arbiter which does not store data in the >>>>>> bricks, but still uses some minimal space (to keep metadata for the files). >>>>>> >>>>>> HTH, >>>>>> >>>>>> DIego >>>>>> >>>>>> On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes < >>>>>> gilberto.nunes32 at gmail.com> wrote: >>>>>> >>>>>>> Hit there... >>>>>>> >>>>>>> I have set up two server as replica, like this: >>>>>>> >>>>>>> gluster vol create Vol01 server1:/data/storage server2:/data/storage >>>>>>> >>>>>>> Then I create a config file in client, like this: >>>>>>> volume remote1 >>>>>>> type protocol/client >>>>>>> option transport-type tcp >>>>>>> option remote-host server1 >>>>>>> option remote-subvolume /data/storage >>>>>>> end-volume >>>>>>> >>>>>>> volume remote2 >>>>>>> type protocol/client >>>>>>> option transport-type tcp >>>>>>> option remote-host server2 >>>>>>> option remote-subvolume /data/storage >>>>>>> end-volume >>>>>>> >>>>>>> volume replicate >>>>>>> type cluster/replicate >>>>>>> subvolumes remote1 remote2 >>>>>>> end-volume >>>>>>> >>>>>>> volume writebehind >>>>>>> type performance/write-behind >>>>>>> option window-size 1MB >>>>>>> subvolumes replicate >>>>>>> end-volume >>>>>>> >>>>>>> volume cache >>>>>>> type performance/io-cache >>>>>>> option cache-size 512MB >>>>>>> subvolumes writebehind >>>>>>> end-volume >>>>>>> >>>>>>> And add this line in /etc/fstab >>>>>>> >>>>>>> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >>>>>>> >>>>>>> After mount /mnt, I can access the servers. So far so good! >>>>>>> But when I make server1 crash, I was unable to access /mnt or even >>>>>>> use >>>>>>> gluster vol status >>>>>>> on server2 >>>>>>> >>>>>>> Everything hangon! >>>>>>> >>>>>>> I have tried with replicated, distributed and replicated-distributed >>>>>>> too. >>>>>>> I am using Debian Stretch, with gluster package installed via apt, >>>>>>> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >>>>>>> >>>>>>> I am sorry if this is a newbie question, but glusterfs share it's >>>>>>> not suppose to keep online if one server goes down? >>>>>>> >>>>>>> Any adviced will be welcome >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- >>>>>>> Gilberto Nunes Ferreira >>>>>>> >>>>>>> (47) 3025-5907 >>>>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>>>> >>>>>>> Skype: gilberto.nunes36 >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gilberto.nunes32 at gmail.com Thu Jan 24 14:02:56 2019 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Thu, 24 Jan 2019 12:02:56 -0200 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: Ok! Now I get it... However on server, server2, get me timeout when try gluster v status. Cmd gluster v info works ok, but gluster v status hangs and give up with Time out information. Here is the gluster v status cmd: gluster v status Status of volume: Vol01 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick server1:/data/storage 49164 0 Y 684 Brick server2:/data/storage 49157 0 Y 610 Brick server3:/data/storage 49155 0 Y 621 Brick server4:/data/storage 49155 0 Y 594 Self-heal Daemon on localhost N/A N/A Y 667 Self-heal Daemon on server4 N/A N/A Y 623 Self-heal Daemon on server1 N/A N/A Y 720 Self-heal Daemon on server3 N/A N/A Y 651 Task Status of Volume Vol01 ------------------------------------------------------------------------------ There are no active volume tasks And gluster v info gluster vol info Volume Name: Vol01 Type: Distributed-Replicate Volume ID: 7b71f498-8512-4160-93a0-e32a8c70ecac Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: server1:/data/storage Brick2: server2:/data/storage Brick3: server3:/data/storage Brick4: server4:/data/storage Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet Now fstab is like that: server1:/Vol01 /mnt glusterfs defaults,_netdev,backupvolfile-server=server2:server3:server4 0 0 As I am working here with 4 KVM VM - it's just a lab - when I suspend the server1 - virsh suspend server1 - after some seconds I was able to access /mnt mounted point with no problems. So it's seems backupvolfile-server works fine, after all! Thanks --- Gilberto Nunes Ferreira (47) 3025-5907 (47) 99676-7530 - Whatsapp / Telegram Skype: gilberto.nunes36 Em qui, 24 de jan de 2019 ?s 11:55, Gilberto Nunes < gilberto.nunes32 at gmail.com> escreveu: > Thanks, I'll check it out. > --- > Gilberto Nunes Ferreira > > (47) 3025-5907 > (47) 99676-7530 - Whatsapp / Telegram > > Skype: gilberto.nunes36 > > > > > > Em qui, 24 de jan de 2019 ?s 11:50, Amar Tumballi Suryanarayan < > atumball at redhat.com> escreveu: > >> Also note that, this way of mounting with a 'static' volfile is not >> recommended as you wouldn't get any features out of gluster's Software >> Defined Storage behavior. >> >> this was an approach we used to have say 8 years before. With the >> introduction of management daemon called glusterd, the way of dealing with >> volfiles have changed, and it is created with gluster CLI. >> >> About having /etc/fstab not hang when a server is down, search for >> 'backup-volfile-server' option with glusterfs, and that should be used. >> >> Regards, >> Amar >> >> On Thu, Jan 24, 2019 at 7:17 PM Diego Remolina >> wrote: >> >>> Show us output of: >>> >>> gluster v status >>> >>> Have you configured firewall rules properly for all ports being used? >>> >>> Diego >>> >>> On Thu, Jan 24, 2019 at 8:44 AM Gilberto Nunes < >>> gilberto.nunes32 at gmail.com> wrote: >>> >>>> >I think your mount statement in /etc/fstab is only referencing ONE of >>>> the gluster servers. >>>> > >>>> >Please take a look at "More redundant mount" section: >>>> > >>>> >https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>>> > >>>> >Then try taking down one of the gluster servers and report back >>>> results. >>>> >>>> Guys! I have followed the very same instruction that found in the >>>> James's website. >>>> One of method his mentioned in that website, is create a file into >>>> /etc/glusterfs directory, named datastore.vol, for instance, with this >>>> content: >>>> >>>> volume remote1 >>>> type protocol/client >>>> option transport-type tcp >>>> option remote-host server1 >>>> option remote-subvolume /data/storage >>>> end-volume >>>> >>>> volume remote2 >>>> type protocol/client >>>> option transport-type tcp >>>> option remote-host server2 >>>> option remote-subvolume /data/storage >>>> end-volume >>>> >>>> volume remote3 >>>> type protocol/client >>>> option transport-type tcp >>>> option remote-host server3 >>>> option remote-subvolume /data/storage >>>> end-volume >>>> >>>> volume replicate >>>> type cluster/replicate >>>> subvolumes remote1 remote2 remote3 >>>> end-volume >>>> >>>> volume writebehind >>>> type performance/write-behind >>>> option window-size 1MB >>>> subvolumes replicate >>>> end-volume >>>> >>>> volume cache >>>> type performance/io-cache >>>> option cache-size 512MB >>>> subvolumes writebehind >>>> end-volume >>>> >>>> >>>> and then include this line into fstab: >>>> >>>> /etc/glusterfs/datastore.vol [MOUNT] glusterfs rw,allow_other, >>>> default_permissions,max_read=131072 0 0 >>>> >>>> What I doing wrong??? >>>> >>>> Thanks >>>> >>>> >>>> >>>> >>>> >>>> >>>> --- >>>> Gilberto Nunes Ferreira >>>> >>>> (47) 3025-5907 >>>> (47) 99676-7530 - Whatsapp / Telegram >>>> >>>> Skype: gilberto.nunes36 >>>> >>>> >>>> >>>> >>>> >>>> Em qui, 24 de jan de 2019 ?s 11:27, Scott Worthington < >>>> scott.c.worthington at gmail.com> escreveu: >>>> >>>>> I think your mount statement in /etc/fstab is only referencing ONE of >>>>> the gluster servers. >>>>> >>>>> Please take a look at "More redundant mount" section: >>>>> >>>>> https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>>>> >>>>> Then try taking down one of the gluster servers and report back >>>>> results. >>>>> >>>>> On Thu, Jan 24, 2019 at 8:24 AM Gilberto Nunes < >>>>> gilberto.nunes32 at gmail.com> wrote: >>>>> >>>>>> Yep! >>>>>> But as I mentioned in previously e-mail, even with 3 or 4 servers >>>>>> this issues occurr. >>>>>> I don't know what's happen. >>>>>> >>>>>> --- >>>>>> Gilberto Nunes Ferreira >>>>>> >>>>>> (47) 3025-5907 >>>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>>> >>>>>> Skype: gilberto.nunes36 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina < >>>>>> dijuremo at gmail.com> escreveu: >>>>>> >>>>>>> Glusterfs needs quorum, so if you have two servers and one goes >>>>>>> down, there is no quorum, so all writes stop until the server comes back >>>>>>> up. You can add a third server as an arbiter which does not store data in >>>>>>> the bricks, but still uses some minimal space (to keep metadata for the >>>>>>> files). >>>>>>> >>>>>>> HTH, >>>>>>> >>>>>>> DIego >>>>>>> >>>>>>> On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes < >>>>>>> gilberto.nunes32 at gmail.com> wrote: >>>>>>> >>>>>>>> Hit there... >>>>>>>> >>>>>>>> I have set up two server as replica, like this: >>>>>>>> >>>>>>>> gluster vol create Vol01 server1:/data/storage server2:/data/storage >>>>>>>> >>>>>>>> Then I create a config file in client, like this: >>>>>>>> volume remote1 >>>>>>>> type protocol/client >>>>>>>> option transport-type tcp >>>>>>>> option remote-host server1 >>>>>>>> option remote-subvolume /data/storage >>>>>>>> end-volume >>>>>>>> >>>>>>>> volume remote2 >>>>>>>> type protocol/client >>>>>>>> option transport-type tcp >>>>>>>> option remote-host server2 >>>>>>>> option remote-subvolume /data/storage >>>>>>>> end-volume >>>>>>>> >>>>>>>> volume replicate >>>>>>>> type cluster/replicate >>>>>>>> subvolumes remote1 remote2 >>>>>>>> end-volume >>>>>>>> >>>>>>>> volume writebehind >>>>>>>> type performance/write-behind >>>>>>>> option window-size 1MB >>>>>>>> subvolumes replicate >>>>>>>> end-volume >>>>>>>> >>>>>>>> volume cache >>>>>>>> type performance/io-cache >>>>>>>> option cache-size 512MB >>>>>>>> subvolumes writebehind >>>>>>>> end-volume >>>>>>>> >>>>>>>> And add this line in /etc/fstab >>>>>>>> >>>>>>>> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >>>>>>>> >>>>>>>> After mount /mnt, I can access the servers. So far so good! >>>>>>>> But when I make server1 crash, I was unable to access /mnt or even >>>>>>>> use >>>>>>>> gluster vol status >>>>>>>> on server2 >>>>>>>> >>>>>>>> Everything hangon! >>>>>>>> >>>>>>>> I have tried with replicated, distributed and >>>>>>>> replicated-distributed too. >>>>>>>> I am using Debian Stretch, with gluster package installed via apt, >>>>>>>> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >>>>>>>> >>>>>>>> I am sorry if this is a newbie question, but glusterfs share it's >>>>>>>> not suppose to keep online if one server goes down? >>>>>>>> >>>>>>>> Any adviced will be welcome >>>>>>>> >>>>>>>> Best >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --- >>>>>>>> Gilberto Nunes Ferreira >>>>>>>> >>>>>>>> (47) 3025-5907 >>>>>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>>>>> >>>>>>>> Skype: gilberto.nunes36 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> -- >> Amar Tumballi (amarts) >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scott.c.worthington at gmail.com Thu Jan 24 14:04:30 2019 From: scott.c.worthington at gmail.com (Scott Worthington) Date: Thu, 24 Jan 2019 09:04:30 -0500 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: Amar, Is this documentation relevant for Diego? https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Setting%20Up%20Clients/#manual-mount "If backupvolfile-server option is added while mounting fuse client, when the first volfile server fails, then the server specified in backupvolfile-server option is used as volfile server to mount the client." Or is there 'better' documentation? On Thu, Jan 24, 2019 at 8:51 AM Amar Tumballi Suryanarayan < atumball at redhat.com> wrote: > Also note that, this way of mounting with a 'static' volfile is not > recommended as you wouldn't get any features out of gluster's Software > Defined Storage behavior. > > this was an approach we used to have say 8 years before. With the > introduction of management daemon called glusterd, the way of dealing with > volfiles have changed, and it is created with gluster CLI. > > About having /etc/fstab not hang when a server is down, search for > 'backup-volfile-server' option with glusterfs, and that should be used. > > Regards, > Amar > > On Thu, Jan 24, 2019 at 7:17 PM Diego Remolina wrote: > >> Show us output of: >> >> gluster v status >> >> Have you configured firewall rules properly for all ports being used? >> >> Diego >> >> On Thu, Jan 24, 2019 at 8:44 AM Gilberto Nunes < >> gilberto.nunes32 at gmail.com> wrote: >> >>> >I think your mount statement in /etc/fstab is only referencing ONE of >>> the gluster servers. >>> > >>> >Please take a look at "More redundant mount" section: >>> > >>> >https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>> > >>> >Then try taking down one of the gluster servers and report back results. >>> >>> Guys! I have followed the very same instruction that found in the >>> James's website. >>> One of method his mentioned in that website, is create a file into >>> /etc/glusterfs directory, named datastore.vol, for instance, with this >>> content: >>> >>> volume remote1 >>> type protocol/client >>> option transport-type tcp >>> option remote-host server1 >>> option remote-subvolume /data/storage >>> end-volume >>> >>> volume remote2 >>> type protocol/client >>> option transport-type tcp >>> option remote-host server2 >>> option remote-subvolume /data/storage >>> end-volume >>> >>> volume remote3 >>> type protocol/client >>> option transport-type tcp >>> option remote-host server3 >>> option remote-subvolume /data/storage >>> end-volume >>> >>> volume replicate >>> type cluster/replicate >>> subvolumes remote1 remote2 remote3 >>> end-volume >>> >>> volume writebehind >>> type performance/write-behind >>> option window-size 1MB >>> subvolumes replicate >>> end-volume >>> >>> volume cache >>> type performance/io-cache >>> option cache-size 512MB >>> subvolumes writebehind >>> end-volume >>> >>> >>> and then include this line into fstab: >>> >>> /etc/glusterfs/datastore.vol [MOUNT] glusterfs rw,allow_other, >>> default_permissions,max_read=131072 0 0 >>> >>> What I doing wrong??? >>> >>> Thanks >>> >>> >>> >>> >>> >>> >>> --- >>> Gilberto Nunes Ferreira >>> >>> (47) 3025-5907 >>> (47) 99676-7530 - Whatsapp / Telegram >>> >>> Skype: gilberto.nunes36 >>> >>> >>> >>> >>> >>> Em qui, 24 de jan de 2019 ?s 11:27, Scott Worthington < >>> scott.c.worthington at gmail.com> escreveu: >>> >>>> I think your mount statement in /etc/fstab is only referencing ONE of >>>> the gluster servers. >>>> >>>> Please take a look at "More redundant mount" section: >>>> >>>> https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>>> >>>> Then try taking down one of the gluster servers and report back results. >>>> >>>> On Thu, Jan 24, 2019 at 8:24 AM Gilberto Nunes < >>>> gilberto.nunes32 at gmail.com> wrote: >>>> >>>>> Yep! >>>>> But as I mentioned in previously e-mail, even with 3 or 4 servers this >>>>> issues occurr. >>>>> I don't know what's happen. >>>>> >>>>> --- >>>>> Gilberto Nunes Ferreira >>>>> >>>>> (47) 3025-5907 >>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>> >>>>> Skype: gilberto.nunes36 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina >>>>> escreveu: >>>>> >>>>>> Glusterfs needs quorum, so if you have two servers and one goes down, >>>>>> there is no quorum, so all writes stop until the server comes back up. You >>>>>> can add a third server as an arbiter which does not store data in the >>>>>> bricks, but still uses some minimal space (to keep metadata for the files). >>>>>> >>>>>> HTH, >>>>>> >>>>>> DIego >>>>>> >>>>>> On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes < >>>>>> gilberto.nunes32 at gmail.com> wrote: >>>>>> >>>>>>> Hit there... >>>>>>> >>>>>>> I have set up two server as replica, like this: >>>>>>> >>>>>>> gluster vol create Vol01 server1:/data/storage server2:/data/storage >>>>>>> >>>>>>> Then I create a config file in client, like this: >>>>>>> volume remote1 >>>>>>> type protocol/client >>>>>>> option transport-type tcp >>>>>>> option remote-host server1 >>>>>>> option remote-subvolume /data/storage >>>>>>> end-volume >>>>>>> >>>>>>> volume remote2 >>>>>>> type protocol/client >>>>>>> option transport-type tcp >>>>>>> option remote-host server2 >>>>>>> option remote-subvolume /data/storage >>>>>>> end-volume >>>>>>> >>>>>>> volume replicate >>>>>>> type cluster/replicate >>>>>>> subvolumes remote1 remote2 >>>>>>> end-volume >>>>>>> >>>>>>> volume writebehind >>>>>>> type performance/write-behind >>>>>>> option window-size 1MB >>>>>>> subvolumes replicate >>>>>>> end-volume >>>>>>> >>>>>>> volume cache >>>>>>> type performance/io-cache >>>>>>> option cache-size 512MB >>>>>>> subvolumes writebehind >>>>>>> end-volume >>>>>>> >>>>>>> And add this line in /etc/fstab >>>>>>> >>>>>>> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >>>>>>> >>>>>>> After mount /mnt, I can access the servers. So far so good! >>>>>>> But when I make server1 crash, I was unable to access /mnt or even >>>>>>> use >>>>>>> gluster vol status >>>>>>> on server2 >>>>>>> >>>>>>> Everything hangon! >>>>>>> >>>>>>> I have tried with replicated, distributed and replicated-distributed >>>>>>> too. >>>>>>> I am using Debian Stretch, with gluster package installed via apt, >>>>>>> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >>>>>>> >>>>>>> I am sorry if this is a newbie question, but glusterfs share it's >>>>>>> not suppose to keep online if one server goes down? >>>>>>> >>>>>>> Any adviced will be welcome >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- >>>>>>> Gilberto Nunes Ferreira >>>>>>> >>>>>>> (47) 3025-5907 >>>>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>>>> >>>>>>> Skype: gilberto.nunes36 >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Jan 24 14:11:50 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 24 Jan 2019 19:41:50 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Madhu, Sorry to disturb could you please provide atleast work around (to clear requests which stuck) to move further. We are also not able to find root cause from glusterd logs. Please find attachment. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 04:12 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: first-node.log.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: secondnode.log.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: thirdnode.log.txt URL: From budic at onholyground.com Thu Jan 24 15:53:26 2019 From: budic at onholyground.com (Darrell Budic) Date: Thu, 24 Jan 2019 09:53:26 -0600 Subject: [Gluster-users] =?utf-8?b?0J7RgtC9OiAgR2x1c3RlciBwZXJmb3JtYW5j?= =?utf-8?q?e_issues_-_need_advise?= In-Reply-To: <1172976507.352481.1548325968837@mail.yahoo.com> References: <54e562bd-3465-4c69-8fda-8060a52e9c22@email.android.com> <1172976507.352481.1548325968837@mail.yahoo.com> Message-ID: Strahil- The fuse client is what it is, it?s limited by operating in user land and waiting for the gluster servers to acknowledge all the writes. I noted you're using ovirt, you should look into enabling the libgfapi engine setting to run your VMs with libgf natively. You can?t test directly from the host with that, but you can run your tests inside the VMs. I saw significant throughput and latency improvements that way. It?s still somewhat beta, so you?ll probably need to search the overt-users mailing list to find info on enabling it. Good luck! > On Jan 24, 2019, at 4:32 AM, Strahil Nikolov wrote: > > Dear Amar, Community, > > it seems the issue is in the fuse client itself. > > Here is the latest update: > 1. I have added the following: > server.event-threads: 4 > client.event-threads: 4 > performance.stat-prefetch: on > performance.strict-o-direct: off > Results: no change > > 2. Allowed nfs and connected ovirt1 to the gluster volume: > nfs.disable: off > Results: Drastic improvement in performance as follows: > > [root at ovirt1 data]# dd if=/dev/zero of=largeio bs=1M count=5000 status=progress > 5000+0 records in > 5000+0 records out > 5242880000 bytes (5.2 GB) copied, 53.0443 s, 98.8 MB/s > > So I would be happy if anyone guide me in order to fix the situation as the fuse client is the best way to use glusterfs, and it seems the glusterfs-server is not the guilty one. > > Thanks in advance for your guidance.I have learned so much. > > Best Regards, > Strahil Nikolov > > > ??: Strahil > ??: Amar Tumballi Suryanarayan > ?????: Gluster-users > ????????: ?????, 23 ?????? 2019 ?. 18:44 > ????: Re: [Gluster-users] Gluster performance issues - need advise > > Dear Amar, > > Thanks for your email. > > Actually my concerns were on both topics. > Would you recommend any perf options that will be suitable ? > > After mentioning the network usage, I just checked it and it seems duringthe test session, ovirt1 (both client and host) is using no more than 455Mbit/s which is half the network bandwidth. > > I'm still in the middle of nowhere, so any ideas are welcome. > > Best Regards, > Strahil Nikolov > > On Jan 23, 2019 17:49, Amar Tumballi Suryanarayan wrote: > I didn't understand the issue properly. Mostly I missed something. > > Are you concerned the performance is 49MB/s with and without perf options? or are you expecting it to be 123MB/s as over the n/w you get that speed? > > If it is the first problem, then you are actually having 'performance.write-behind on' in both options, and it is the only perf xlator which comes into action during the test you ran. > > If it is the second, then please be informed that gluster does client side replication, which means, n/w would be split in half for write operations (like write(), creat() etc), so the number you are getting is almost the maximum with 1GbE. > > Regards, > Amar > > On Wed, Jan 23, 2019 at 8:38 PM Strahil Nikolov > wrote: > Hello Community, > > recently I have built a new lab based on oVirt and CentOS 7. > During deployment I had some hicups, but now the engine is up and running - but gluster is causing me trouble. > > Symptoms: Slow VM install from DVD, poor write performance. The latter has been tested via: > dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_data bs=1M count=1000 status=progress > > The reported speed is 60MB/s which is way too low for my setup. > > My lab design: > https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing > Gluster version is 3.12.15 > > So far I have done: > > 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure on' in glusterd.vol) > Volume info after that change: > > Volume Name: data > Type: Replicate > Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.localdomain:/gluster_bricks/data/data > Brick2: ovirt2.localdomain:/gluster_bricks/data/data > Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) > Options Reconfigured: > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.low-prio-threads: 32 > network.remote-dio: off > cluster.eager-lock: enable > cluster.quorum-type: auto > cluster.server-quorum-type: server > cluster.data-self-heal-algorithm: full > cluster.locking-scheme: granular > cluster.shd-max-threads: 8 > cluster.shd-wait-qlength: 10000 > features.shard: on > user.cifs: off > storage.owner-uid: 36 > storage.owner-gid: 36 > network.ping-timeout: 30 > performance.strict-o-direct: on > cluster.granular-entry-heal: enable > server.allow-insecure: on > > Seems no positive or negative effect so far. > > 2. Tested with tmpfs on all bricks -> ovirt1 mounted gluster volume -> max 60MB/s (bs=1M without 'oflag=direct') > > > [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M count=4000 status=progress > 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s > 4000+0 records in > 4000+0 records out > 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s > [root at ovirt1 data]# rm -f large_io > [root at ovirt1 data]# gluster volume profile data info > Brick: ovirt1.localdomain:/gluster_bricks/data/data > --------------------------------------------------- > Cumulative Stats: > Block Size: 131072b+ > No. of Reads: 8 > No. of Writes: 44968 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 3 FORGET > 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR > 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT > 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH > 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP > 0.00 45.80 us 38.00 us 54.00 us 10 STAT > 0.00 227.67 us 216.00 us 242.00 us 3 CREATE > 0.00 113.38 us 68.00 us 381.00 us 8 READ > 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR > 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR > 0.00 59.97 us 45.00 us 113.00 us 32 OPEN > 0.00 24.41 us 13.00 us 89.00 us 161 INODELK > 0.00 43.43 us 28.00 us 214.00 us 93 STATFS > 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR > 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP > 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD > 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP > 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP > 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE > 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK > 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK > 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK > > Duration: 380 seconds > Data Read: 1048576 bytes > Data Written: 5894045696 bytes > > Interval 0 Stats: > Block Size: 131072b+ > No. of Reads: 8 > No. of Writes: 44968 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 3 FORGET > 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR > 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT > 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH > 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP > 0.00 45.80 us 38.00 us 54.00 us 10 STAT > 0.00 227.67 us 216.00 us 242.00 us 3 CREATE > 0.00 113.38 us 68.00 us 381.00 us 8 READ > 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR > 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR > 0.00 59.97 us 45.00 us 113.00 us 32 OPEN > 0.00 24.41 us 13.00 us 89.00 us 161 INODELK > 0.00 43.43 us 28.00 us 214.00 us 93 STATFS > 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR > 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP > 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD > 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP > 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP > 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE > 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK > 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK > 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK > > Duration: 380 seconds > Data Read: 1048576 bytes > Data Written: 5894045696 bytes > > Brick: ovirt3.localdomain:/gluster_bricks/data/data > --------------------------------------------------- > Cumulative Stats: > Block Size: 1b+ > No. of Reads: 0 > No. of Writes: 39328 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 2 FORGET > 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR > 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT > 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH > 0.01 219.50 us 188.00 us 251.00 us 2 CREATE > 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR > 0.01 62.30 us 38.00 us 119.00 us 10 OPEN > 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR > 0.01 24.60 us 12.00 us 64.00 us 40 INODELK > 0.02 176.30 us 10.00 us 765.00 us 10 READDIR > 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK > 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK > 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD > 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP > 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP > 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE > 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK > > Duration: 189 seconds > Data Read: 0 bytes > Data Written: 39328 bytes > > Interval 0 Stats: > Block Size: 1b+ > No. of Reads: 0 > No. of Writes: 39328 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 2 FORGET > 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR > 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT > 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH > 0.01 219.50 us 188.00 us 251.00 us 2 CREATE > 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR > 0.01 62.30 us 38.00 us 119.00 us 10 OPEN > 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR > 0.01 24.60 us 12.00 us 64.00 us 40 INODELK > 0.02 176.30 us 10.00 us 765.00 us 10 READDIR > 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK > 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK > 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD > 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP > 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP > 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE > 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK > > Duration: 189 seconds > Data Read: 0 bytes > Data Written: 39328 bytes > > Brick: ovirt2.localdomain:/gluster_bricks/data/data > --------------------------------------------------- > Cumulative Stats: > Block Size: 512b+ 131072b+ > No. of Reads: 0 0 > No. of Writes: 36 76758 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 6 FORGET > 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR > 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR > 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR > 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH > 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE > 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT > 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR > 0.00 272.40 us 235.00 us 296.00 us 5 CREATE > 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP > 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR > 0.01 86.69 us 30.00 us 379.00 us 62 STAT > 0.01 64.30 us 47.00 us 169.00 us 84 OPEN > 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP > 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE > 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR > 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC > 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR > 0.04 65.59 us 26.00 us 293.00 us 279 STATFS > 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD > 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK > 0.67 91.68 us 12.00 us 1141.00 us 3186 LOOKUP > 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP > 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK > 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK > 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE > 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK > > Duration: 1206 seconds > Data Read: 0 bytes > Data Written: 10060843008 bytes > > Interval 0 Stats: > Block Size: 512b+ 131072b+ > No. of Reads: 0 0 > No. of Writes: 36 76758 > %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > --------- ----------- ----------- ----------- ------------ ---- > 0.00 0.00 us 0.00 us 0.00 us 6 FORGET > 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE > 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR > 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR > 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR > 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH > 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE > 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT > 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR > 0.00 272.40 us 235.00 us 296.00 us 5 CREATE > 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP > 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR > 0.01 86.69 us 30.00 us 379.00 us 62 STAT > 0.01 64.30 us 47.00 us 169.00 us 84 OPEN > 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP > 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE > 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR > 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC > 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR > 0.04 65.59 us 26.00 us 293.00 us 279 STATFS > 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD > 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK > 0.67 91.66 us 12.00 us 1141.00 us 3186 LOOKUP > 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP > 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK > 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK > 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE > 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK > > Duration: 1206 seconds > Data Read: 0 bytes > Data Written: 10060843008 bytes > > > > This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. > > Most probably I haven't created the volume properly or some option/feature is disabled ?!? > Network shows OK for a gigabit: > [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 > 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C > 7180980+0 records in > 7180979+0 records out > 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s > > > I'm looking for any help... you can share your volume info also. > > Thanks in advance. > > Best Regards, > Strahil Nikolov > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > -- > Amar Tumballi (amarts) > > Dear Amar, > > Thanks for your email. > > Actually my concerns were on both topics. > Would you recommend any perf options that will be suitable ? > > After mentioning the network usage, I just checked it and it seems duringthe test session, ovirt1 (both client and host) is using no more than 455Mbit/s which is half the network bandwidth. > > I'm still in the middle of nowhere, so any ideas are welcome. > > Best Regards, > Strahil Nikolov > > On Jan 23, 2019 17:49, Amar Tumballi Suryanarayan > wrote: > > > > I didn't understand the issue properly. Mostly I missed something. > > > > Are you concerned the performance is 49MB/s with and without perf options? or are you expecting it to be 123MB/s as over the n/w you get that speed? > > > > If it is the first problem, then you are actually having 'performance.write-behind on' in both options, and it is the only perf xlator which comes into action during the test you ran. > > > > If it is the second, then please be informed that gluster does client side replication, which means, n/w would be split in half for write operations (like write(), creat() etc), so the number you are getting is almost the maximum with 1GbE. > > > > Regards, > > Amar > > > > On Wed, Jan 23, 2019 at 8:38 PM Strahil Nikolov > wrote: > >> > >> Hello Community, > >> > >> recently I have built a new lab based on oVirt and CentOS 7. > >> During deployment I had some hicups, but now the engine is up and running - but gluster is causing me trouble. > >> > >> Symptoms: Slow VM install from DVD, poor write performance. The latter has been tested via: > >> dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_data bs=1M count=1000 status=progress > >> > >> The reported speed is 60MB/s which is way too low for my setup. > >> > >> My lab design: > >> https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing > >> Gluster version is 3.12.15 > >> > >> So far I have done: > >> > >> 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure on' in glusterd.vol) > >> Volume info after that change: > >> > >> Volume Name: data > >> Type: Replicate > >> Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 > >> Status: Started > >> Snapshot Count: 0 > >> Number of Bricks: 1 x (2 + 1) = 3 > >> Transport-type: tcp > >> Bricks: > >> Brick1: ovirt1.localdomain:/gluster_bricks/data/data > >> Brick2: ovirt2.localdomain:/gluster_bricks/data/data > >> Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) > >> Options Reconfigured: > >> performance.client-io-threads: off > >> nfs.disable: on > >> transport.address-family: inet > >> performance.quick-read: off > >> performance.read-ahead: off > >> performance.io-cache: off > >> performance.low-prio-threads: 32 > >> network.remote-dio: off > >> cluster.eager-lock: enable > >> cluster.quorum-type: auto > >> cluster.server-quorum-type: server > >> cluster.data-self-heal-algorithm: full > >> cluster.locking-scheme: granular > >> cluster.shd-max-threads: 8 > >> cluster.shd-wait-qlength: 10000 > >> features.shard: on > >> user.cifs: off > >> storage.owner-uid: 36 > >> storage.owner-gid: 36 > >> network.ping-timeout: 30 > >> performance.strict-o-direct: on > >> cluster.granular-entry-heal: enable > >> server.allow-insecure: on > >> > >> Seems no positive or negative effect so far. > >> > >> 2. Tested with tmpfs on all bricks -> ovirt1 mounted gluster volume -> max 60MB/s (bs=1M without 'oflag=direct') > >> > >> > >> [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M count=4000 status=progress > >> 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s > >> 4000+0 records in > >> 4000+0 records out > >> 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s > >> [root at ovirt1 data]# rm -f large_io > >> [root at ovirt1 data]# gluster volume profile data info > >> Brick: ovirt1.localdomain:/gluster_bricks/data/data > >> --------------------------------------------------- > >> Cumulative Stats: > >> Block Size: 131072b+ > >> No. of Reads: 8 > >> No. of Writes: 44968 > >> %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > >> --------- ----------- ----------- ----------- ------------ ---- > >> 0.00 0.00 us 0.00 us 0.00 us 3 FORGET > >> 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE > >> 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR > >> 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT > >> 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH > >> 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP > >> 0.00 45.80 us 38.00 us 54.00 us 10 STAT > >> 0.00 227.67 us 216.00 us 242.00 us 3 CREATE > >> 0.00 113.38 us 68.00 us 381.00 us 8 READ > >> 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR > >> 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR > >> 0.00 59.97 us 45.00 us 113.00 us 32 OPEN > >> 0.00 24.41 us 13.00 us 89.00 us 161 INODELK > >> 0.00 43.43 us 28.00 us 214.00 us 93 STATFS > >> 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR > >> 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP > >> 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD > >> 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP > >> 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP > >> 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE > >> 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK > >> 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK > >> 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK > >> > >> Duration: 380 seconds > >> Data Read: 1048576 bytes > >> Data Written: 5894045696 bytes > >> > >> Interval 0 Stats: > >> Block Size: 131072b+ > >> No. of Reads: 8 > >> No. of Writes: 44968 > >> %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > >> --------- ----------- ----------- ----------- ------------ ---- > >> 0.00 0.00 us 0.00 us 0.00 us 3 FORGET > >> 0.00 0.00 us 0.00 us 0.00 us 35 RELEASE > >> 0.00 0.00 us 0.00 us 0.00 us 28 RELEASEDIR > >> 0.00 78.00 us 78.00 us 78.00 us 1 FSTAT > >> 0.00 35.67 us 26.00 us 73.00 us 6 FLUSH > >> 0.00 324.00 us 324.00 us 324.00 us 1 XATTROP > >> 0.00 45.80 us 38.00 us 54.00 us 10 STAT > >> 0.00 227.67 us 216.00 us 242.00 us 3 CREATE > >> 0.00 113.38 us 68.00 us 381.00 us 8 READ > >> 0.00 39.82 us 1.00 us 148.00 us 28 OPENDIR > >> 0.00 67.54 us 10.00 us 283.00 us 24 GETXATTR > >> 0.00 59.97 us 45.00 us 113.00 us 32 OPEN > >> 0.00 24.41 us 13.00 us 89.00 us 161 INODELK > >> 0.00 43.43 us 28.00 us 214.00 us 93 STATFS > >> 0.00 246.35 us 11.00 us 1155.00 us 20 READDIR > >> 0.00 283.00 us 233.00 us 353.00 us 18 READDIRP > >> 0.00 153.23 us 122.00 us 259.00 us 87 MKNOD > >> 0.01 99.77 us 10.00 us 258.00 us 442 LOOKUP > >> 0.31 49.22 us 27.00 us 540.00 us 45620 FXATTROP > >> 0.77 124.24 us 87.00 us 604.00 us 44968 WRITE > >> 0.93 15767.71 us 15.00 us 305833.00 us 431 ENTRYLK > >> 1.99 160711.39 us 3332.00 us 406037.00 us 90 UNLINK > >> 96.00 5167.82 us 18.00 us 55972.00 us 135349 FINODELK > >> > >> Duration: 380 seconds > >> Data Read: 1048576 bytes > >> Data Written: 5894045696 bytes > >> > >> Brick: ovirt3.localdomain:/gluster_bricks/data/data > >> --------------------------------------------------- > >> Cumulative Stats: > >> Block Size: 1b+ > >> No. of Reads: 0 > >> No. of Writes: 39328 > >> %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > >> --------- ----------- ----------- ----------- ------------ ---- > >> 0.00 0.00 us 0.00 us 0.00 us 2 FORGET > >> 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE > >> 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR > >> 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT > >> 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH > >> 0.01 219.50 us 188.00 us 251.00 us 2 CREATE > >> 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR > >> 0.01 62.30 us 38.00 us 119.00 us 10 OPEN > >> 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR > >> 0.01 24.60 us 12.00 us 64.00 us 40 INODELK > >> 0.02 176.30 us 10.00 us 765.00 us 10 READDIR > >> 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK > >> 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK > >> 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD > >> 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP > >> 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP > >> 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE > >> 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK > >> > >> Duration: 189 seconds > >> Data Read: 0 bytes > >> Data Written: 39328 bytes > >> > >> Interval 0 Stats: > >> Block Size: 1b+ > >> No. of Reads: 0 > >> No. of Writes: 39328 > >> %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > >> --------- ----------- ----------- ----------- ------------ ---- > >> 0.00 0.00 us 0.00 us 0.00 us 2 FORGET > >> 0.00 0.00 us 0.00 us 0.00 us 12 RELEASE > >> 0.00 0.00 us 0.00 us 0.00 us 17 RELEASEDIR > >> 0.00 101.00 us 101.00 us 101.00 us 1 FSTAT > >> 0.00 51.50 us 20.00 us 81.00 us 4 FLUSH > >> 0.01 219.50 us 188.00 us 251.00 us 2 CREATE > >> 0.01 43.45 us 11.00 us 90.00 us 11 GETXATTR > >> 0.01 62.30 us 38.00 us 119.00 us 10 OPEN > >> 0.01 50.59 us 1.00 us 102.00 us 17 OPENDIR > >> 0.01 24.60 us 12.00 us 64.00 us 40 INODELK > >> 0.02 176.30 us 10.00 us 765.00 us 10 READDIR > >> 0.07 63.08 us 39.00 us 133.00 us 78 UNLINK > >> 0.13 27.35 us 10.00 us 91.00 us 333 ENTRYLK > >> 0.13 126.89 us 99.00 us 179.00 us 76 MKNOD > >> 0.42 116.70 us 8.00 us 8661.00 us 261 LOOKUP > >> 28.73 51.79 us 22.00 us 2574.00 us 39822 FXATTROP > >> 29.52 53.87 us 16.00 us 3290.00 us 39328 WRITE > >> 40.92 24.71 us 10.00 us 3224.00 us 118864 FINODELK > >> > >> Duration: 189 seconds > >> Data Read: 0 bytes > >> Data Written: 39328 bytes > >> > >> Brick: ovirt2.localdomain:/gluster_bricks/data/data > >> --------------------------------------------------- > >> Cumulative Stats: > >> Block Size: 512b+ 131072b+ > >> No. of Reads: 0 0 > >> No. of Writes: 36 76758 > >> %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > >> --------- ----------- ----------- ----------- ------------ ---- > >> 0.00 0.00 us 0.00 us 0.00 us 6 FORGET > >> 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE > >> 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR > >> 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR > >> 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR > >> 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH > >> 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE > >> 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT > >> 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR > >> 0.00 272.40 us 235.00 us 296.00 us 5 CREATE > >> 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP > >> 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR > >> 0.01 86.69 us 30.00 us 379.00 us 62 STAT > >> 0.01 64.30 us 47.00 us 169.00 us 84 OPEN > >> 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP > >> 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE > >> 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR > >> 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC > >> 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR > >> 0.04 65.59 us 26.00 us 293.00 us 279 STATFS > >> 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD > >> 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK > >> 0.67 91.68 us 12.00 us 1141.00 us 3186 LOOKUP > >> 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP > >> 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK > >> 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK > >> 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE > >> 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK > >> > >> Duration: 1206 seconds > >> Data Read: 0 bytes > >> Data Written: 10060843008 bytes > >> > >> Interval 0 Stats: > >> Block Size: 512b+ 131072b+ > >> No. of Reads: 0 0 > >> No. of Writes: 36 76758 > >> %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop > >> --------- ----------- ----------- ----------- ------------ ---- > >> 0.00 0.00 us 0.00 us 0.00 us 6 FORGET > >> 0.00 0.00 us 0.00 us 0.00 us 87 RELEASE > >> 0.00 0.00 us 0.00 us 0.00 us 96 RELEASEDIR > >> 0.00 100.50 us 80.00 us 121.00 us 2 REMOVEXATTR > >> 0.00 101.00 us 101.00 us 101.00 us 2 SETXATTR > >> 0.00 36.18 us 22.00 us 62.00 us 11 FLUSH > >> 0.00 57.44 us 42.00 us 77.00 us 9 FTRUNCATE > >> 0.00 82.56 us 59.00 us 138.00 us 9 FSTAT > >> 0.00 89.42 us 67.00 us 161.00 us 12 SETATTR > >> 0.00 272.40 us 235.00 us 296.00 us 5 CREATE > >> 0.01 154.28 us 88.00 us 320.00 us 18 XATTROP > >> 0.01 45.29 us 1.00 us 319.00 us 96 OPENDIR > >> 0.01 86.69 us 30.00 us 379.00 us 62 STAT > >> 0.01 64.30 us 47.00 us 169.00 us 84 OPEN > >> 0.02 107.34 us 23.00 us 273.00 us 73 READDIRP > >> 0.02 4688.00 us 86.00 us 9290.00 us 2 TRUNCATE > >> 0.02 59.29 us 13.00 us 394.00 us 165 GETXATTR > >> 0.03 128.51 us 27.00 us 338.00 us 96 FSYNC > >> 0.03 240.75 us 14.00 us 1943.00 us 52 READDIR > >> 0.04 65.59 us 26.00 us 293.00 us 279 STATFS > >> 0.06 180.77 us 118.00 us 306.00 us 148 MKNOD > >> 0.14 37.98 us 17.00 us 192.00 us 1598 INODELK > >> 0.67 91.66 us 12.00 us 1141.00 us 3186 LOOKUP > >> 10.10 55.92 us 28.00 us 1658.00 us 78608 FXATTROP > >> 11.89 6814.76 us 18.00 us 301246.00 us 760 ENTRYLK > >> 19.44 36.55 us 14.00 us 2353.00 us 231535 FINODELK > >> 25.21 142.92 us 62.00 us 593.00 us 76794 WRITE > >> 32.28 91283.68 us 28.00 us 316658.00 us 154 UNLINK > >> > >> Duration: 1206 seconds > >> Data Read: 0 bytes > >> Data Written: 10060843008 bytes > >> > >> > >> > >> This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. > >> > >> Most probably I haven't created the volume properly or some option/feature is disabled ?!? > >> Network shows OK for a gigabit: > >> [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 > >> 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C > >> 7180980+0 records in > >> 7180979+0 records out > >> 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s > >> > >> > >> I'm looking for any help... you can share your volume info also. > >> > >> Thanks in advance. > >> > >> Best Regards, > >> Strahil Nikolov > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > > > > -- > > Amar Tumballi (amarts) > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From gilberto.nunes32 at gmail.com Thu Jan 24 17:20:39 2019 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Thu, 24 Jan 2019 15:20:39 -0200 Subject: [Gluster-users] Access to Servers hangs after stop one server... In-Reply-To: References: Message-ID: I just wanna let you know, guys, that I was able to create 2-server HA with bellow cmd: gluster vol create Vol01 replica 2 transport tcp server1:/data/storage server2:/data/storage (so here work as replicated mode) After that, I put this into fstab: server1:/Vol01 /mnt glusterfs defaults,_netdev,backupvolfile-server=server2 0 0 Then, I shutdown server1 and after a few seconds, the /mnt mounted point works fine.... I could create others files into it and after server1 back online, the files create was replicated from server2 to server1. Everything works as expected! Thanks a lot --- Gilberto Nunes Ferreira (47) 3025-5907 (47) 99676-7530 - Whatsapp / Telegram Skype: gilberto.nunes36 Em qui, 24 de jan de 2019 ?s 12:04, Scott Worthington < scott.c.worthington at gmail.com> escreveu: > Amar, > > Is this documentation relevant for Diego? > > > https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Setting%20Up%20Clients/#manual-mount > > "If backupvolfile-server option is added while mounting fuse client, when > the first volfile server fails, then the server specified in > backupvolfile-server option is used as volfile server to mount the > client." > > Or is there 'better' documentation? > > > On Thu, Jan 24, 2019 at 8:51 AM Amar Tumballi Suryanarayan < > atumball at redhat.com> wrote: > >> Also note that, this way of mounting with a 'static' volfile is not >> recommended as you wouldn't get any features out of gluster's Software >> Defined Storage behavior. >> >> this was an approach we used to have say 8 years before. With the >> introduction of management daemon called glusterd, the way of dealing with >> volfiles have changed, and it is created with gluster CLI. >> >> About having /etc/fstab not hang when a server is down, search for >> 'backup-volfile-server' option with glusterfs, and that should be used. >> >> Regards, >> Amar >> >> On Thu, Jan 24, 2019 at 7:17 PM Diego Remolina >> wrote: >> >>> Show us output of: >>> >>> gluster v status >>> >>> Have you configured firewall rules properly for all ports being used? >>> >>> Diego >>> >>> On Thu, Jan 24, 2019 at 8:44 AM Gilberto Nunes < >>> gilberto.nunes32 at gmail.com> wrote: >>> >>>> >I think your mount statement in /etc/fstab is only referencing ONE of >>>> the gluster servers. >>>> > >>>> >Please take a look at "More redundant mount" section: >>>> > >>>> >https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>>> > >>>> >Then try taking down one of the gluster servers and report back >>>> results. >>>> >>>> Guys! I have followed the very same instruction that found in the >>>> James's website. >>>> One of method his mentioned in that website, is create a file into >>>> /etc/glusterfs directory, named datastore.vol, for instance, with this >>>> content: >>>> >>>> volume remote1 >>>> type protocol/client >>>> option transport-type tcp >>>> option remote-host server1 >>>> option remote-subvolume /data/storage >>>> end-volume >>>> >>>> volume remote2 >>>> type protocol/client >>>> option transport-type tcp >>>> option remote-host server2 >>>> option remote-subvolume /data/storage >>>> end-volume >>>> >>>> volume remote3 >>>> type protocol/client >>>> option transport-type tcp >>>> option remote-host server3 >>>> option remote-subvolume /data/storage >>>> end-volume >>>> >>>> volume replicate >>>> type cluster/replicate >>>> subvolumes remote1 remote2 remote3 >>>> end-volume >>>> >>>> volume writebehind >>>> type performance/write-behind >>>> option window-size 1MB >>>> subvolumes replicate >>>> end-volume >>>> >>>> volume cache >>>> type performance/io-cache >>>> option cache-size 512MB >>>> subvolumes writebehind >>>> end-volume >>>> >>>> >>>> and then include this line into fstab: >>>> >>>> /etc/glusterfs/datastore.vol [MOUNT] glusterfs rw,allow_other, >>>> default_permissions,max_read=131072 0 0 >>>> >>>> What I doing wrong??? >>>> >>>> Thanks >>>> >>>> >>>> >>>> >>>> >>>> >>>> --- >>>> Gilberto Nunes Ferreira >>>> >>>> (47) 3025-5907 >>>> (47) 99676-7530 - Whatsapp / Telegram >>>> >>>> Skype: gilberto.nunes36 >>>> >>>> >>>> >>>> >>>> >>>> Em qui, 24 de jan de 2019 ?s 11:27, Scott Worthington < >>>> scott.c.worthington at gmail.com> escreveu: >>>> >>>>> I think your mount statement in /etc/fstab is only referencing ONE of >>>>> the gluster servers. >>>>> >>>>> Please take a look at "More redundant mount" section: >>>>> >>>>> https://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume >>>>> >>>>> Then try taking down one of the gluster servers and report back >>>>> results. >>>>> >>>>> On Thu, Jan 24, 2019 at 8:24 AM Gilberto Nunes < >>>>> gilberto.nunes32 at gmail.com> wrote: >>>>> >>>>>> Yep! >>>>>> But as I mentioned in previously e-mail, even with 3 or 4 servers >>>>>> this issues occurr. >>>>>> I don't know what's happen. >>>>>> >>>>>> --- >>>>>> Gilberto Nunes Ferreira >>>>>> >>>>>> (47) 3025-5907 >>>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>>> >>>>>> Skype: gilberto.nunes36 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Em qui, 24 de jan de 2019 ?s 10:43, Diego Remolina < >>>>>> dijuremo at gmail.com> escreveu: >>>>>> >>>>>>> Glusterfs needs quorum, so if you have two servers and one goes >>>>>>> down, there is no quorum, so all writes stop until the server comes back >>>>>>> up. You can add a third server as an arbiter which does not store data in >>>>>>> the bricks, but still uses some minimal space (to keep metadata for the >>>>>>> files). >>>>>>> >>>>>>> HTH, >>>>>>> >>>>>>> DIego >>>>>>> >>>>>>> On Wed, Jan 23, 2019 at 3:06 PM Gilberto Nunes < >>>>>>> gilberto.nunes32 at gmail.com> wrote: >>>>>>> >>>>>>>> Hit there... >>>>>>>> >>>>>>>> I have set up two server as replica, like this: >>>>>>>> >>>>>>>> gluster vol create Vol01 server1:/data/storage server2:/data/storage >>>>>>>> >>>>>>>> Then I create a config file in client, like this: >>>>>>>> volume remote1 >>>>>>>> type protocol/client >>>>>>>> option transport-type tcp >>>>>>>> option remote-host server1 >>>>>>>> option remote-subvolume /data/storage >>>>>>>> end-volume >>>>>>>> >>>>>>>> volume remote2 >>>>>>>> type protocol/client >>>>>>>> option transport-type tcp >>>>>>>> option remote-host server2 >>>>>>>> option remote-subvolume /data/storage >>>>>>>> end-volume >>>>>>>> >>>>>>>> volume replicate >>>>>>>> type cluster/replicate >>>>>>>> subvolumes remote1 remote2 >>>>>>>> end-volume >>>>>>>> >>>>>>>> volume writebehind >>>>>>>> type performance/write-behind >>>>>>>> option window-size 1MB >>>>>>>> subvolumes replicate >>>>>>>> end-volume >>>>>>>> >>>>>>>> volume cache >>>>>>>> type performance/io-cache >>>>>>>> option cache-size 512MB >>>>>>>> subvolumes writebehind >>>>>>>> end-volume >>>>>>>> >>>>>>>> And add this line in /etc/fstab >>>>>>>> >>>>>>>> /etc/glusterfs/datastore.vol /mnt glusterfs defaults,_netdev 0 0 >>>>>>>> >>>>>>>> After mount /mnt, I can access the servers. So far so good! >>>>>>>> But when I make server1 crash, I was unable to access /mnt or even >>>>>>>> use >>>>>>>> gluster vol status >>>>>>>> on server2 >>>>>>>> >>>>>>>> Everything hangon! >>>>>>>> >>>>>>>> I have tried with replicated, distributed and >>>>>>>> replicated-distributed too. >>>>>>>> I am using Debian Stretch, with gluster package installed via apt, >>>>>>>> provided by Standard Debian Repo, glusterfs-server 3.8.8-1 >>>>>>>> >>>>>>>> I am sorry if this is a newbie question, but glusterfs share it's >>>>>>>> not suppose to keep online if one server goes down? >>>>>>>> >>>>>>>> Any adviced will be welcome >>>>>>>> >>>>>>>> Best >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --- >>>>>>>> Gilberto Nunes Ferreira >>>>>>>> >>>>>>>> (47) 3025-5907 >>>>>>>> (47) 99676-7530 - Whatsapp / Telegram >>>>>>>> >>>>>>>> Skype: gilberto.nunes36 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> -- >> Amar Tumballi (amarts) >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From mrajanna at redhat.com Thu Jan 24 17:21:45 2019 From: mrajanna at redhat.com (Madhu Rajanna) Date: Thu, 24 Jan 2019 22:51:45 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Adding John who is having more idea about how to debug this one. @Shaik Salam can you some more info on the hardware on which you are running heketi (kernel details) On Thu, Jan 24, 2019 at 7:42 PM Shaik Salam wrote: > Hi Madhu, > > Sorry to disturb could you please provide atleast work around (to clear > requests which stuck) to move further. > We are also not able to find root cause from glusterd logs. Please find > attachment. > > BR > Salam > > > > > > From: Shaik Salam/HYD/TCS > To: "Madhu Rajanna" > Cc: "gluster-users at gluster.org List" , > "Michael Adam" > Date: 01/24/2019 04:12 PM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > Hi Madhu, > > Please let me know If any other information required. > > BR > Salam > > > > > From: Shaik Salam/HYD/TCS > To: "Madhu Rajanna" > Cc: "gluster-users at gluster.org List" , > "Michael Adam" > Date: 01/24/2019 03:23 PM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > Hi Madhu, > > This is complete one after restart of heketi pod and process log. > > BR > Salam > > [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] > [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] > > > > > From: "Madhu Rajanna" > To: "Shaik Salam" > Cc: "gluster-users at gluster.org List" , > "Michael Adam" > Date: 01/24/2019 01:55 PM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > > *"External email. Open with Caution"* > the logs you provided is not complete, not able to figure out which > command is struck, can you reattach the complete output of `ps aux` and > also attach complete heketi logs. > > On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Madhu, > > Please find requested info. > > BR > Salam > > > > > > From: Madhu Rajanna <*mrajanna at redhat.com* > > To: Shaik Salam <*shaik.salam at tcs.com* > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* >, Michael Adam < > *madam at redhat.com* > > Date: 01/24/2019 01:33 PM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > > * "External email. Open with Caution"* > the heketi logs you have attached is not complete i believe, can you > povide the complete heketi logs > and also an we get the output of "ps aux" from the gluster pods ? I want > to see if any lvm commands or gluster commands are "stuck". > > > On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > Hi Madhu. > > I tried lot of times restarted heketi pod but not resolved. > > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 0 > New: 0 > Stale: 0 > > Now you can see all operations are zero. Now I try to create single volume > below is observation in-flight reaching slowly to 8. > > sh-4.4# heketi-cli server operations infoCLI_SERVER= > *http://localhost:8080* ; export HEKETI_CLI_USE > Operation Counts: > Total: 0 > In-Flight: 6 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > sh-4.4# heketi-cli server operations info > Operation Counts: > Total: 0 > In-Flight: 7 > New: 0 > Stale: 0 > > [negroni] Completed 200 OK in 186.286?s > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 166.294?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 186.411?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 179.796?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 131.108?s > [negroni] Started POST /volumes > [negroni] Started GET /operations > [negroni] Completed 200 OK in 111.392?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 265.023?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 179.364?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 295.058?s > [negroni] Started GET /operations > [negroni] Completed 200 OK in 146.857?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 403.166?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 193.554?s > > > But for pod volume is not creating. > 1:15:36 PM > Warning > Provisioning failed Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume > Server busy. Retry operation later.. > 9 times in the last 2 minutes > 1:13:21 PM > Warning > Provisioning failed Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: create volume err: error creating volume . > 8 times in the last > > > > > > > From: "Madhu Rajanna" <*mrajanna at redhat.com* > > To: "Shaik Salam" <*shaik.salam at tcs.com* > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* >, "Michael Adam" > <*madam at redhat.com* > > Date: 01/24/2019 12:51 PM > Subject: Re: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > ------------------------------ > > > > * "External email. Open with Caution"* > HI Shaik, > > can you provide me the outpout of $heketi-cli server operations info > from heketi pod > > as a workround you can try restarting the heketi pod. This will cause the > current operations to go stale, but other pending pvcs may go to Bound > state > > Regards, > > Madhu R > > On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam <*shaik.salam at tcs.com* > > wrote: > H Madhu, > > Could you please have look my issue If you have time (atleast workaround). > I am unable to send mail to "John Mulligan" <*John_Mulligan at redhat.com* > >" who is currently handling issue > *https://bugzilla.redhat.com/show_bug.cgi?id=1636912* > > > BR > Salam > > > From: Shaik Salam/HYD/TCS > To: "John Mulligan" <*John_Mulligan at redhat.com* > >, "Michael Adam" <*madam at redhat.com* > >, "Madhu Rajanna" <*mrajanna at redhat.com* > > > Cc: "*gluster-users at gluster.org* List" > <*gluster-users at gluster.org* > > Date: 01/24/2019 12:21 PM > Subject: Failed to provision volume with StorageClass > "glusterfs-storage": glusterfs: server busy > > ------------------------------ > > > > > Hi All, > > We are facing also following issue on openshift origin while we are > creating pvc for pods. (atlease provide workaround to move further) > > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume > Failed to provision volume with StorageClass "glusterfs-storage": > glusterfs: create volume err: error creating volume Server busy. Retry > operation later.. > > Please find heketidb dump and log > > [negroni] Completed 429 Too Many Requests in 250.763?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 169.08?s > [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c > [negroni] Completed 404 Not Found in 148.125?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 496.624?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 101.673?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 209.681?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 103.595?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 297.594?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 96.75?s > [negroni] Started POST /volumes > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 477.007?s > [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 165.38?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 488.253?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 171.836?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 208.59?s > [negroni] Started POST /volumes > [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds > limit (8) > [negroni] Completed 429 Too Many Requests in 125.141?s > [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 > [negroni] Completed 404 Not Found in 138.687?s > [negroni] Started POST /volumes > > > BR > Salam > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > > > -- > Madhu Rajanna > Software Engineer > *Red Hat Bangalore, India* > *mrajanna at redhat.com* M: +91-9741133155 > > > > > > -- > Madhu Rajanna > Software Engineer > *Red Hat Bangalore, India* > *mrajanna at redhat.com* M: +91-9741133155 > > > > > > -- > Madhu Rajanna > Software Engineer > Red Hat Bangalore, India > mrajanna at redhat.com M: +91-9741133155 > > > > -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: From meira at cesup.ufrgs.br Thu Jan 24 17:47:38 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Thu, 24 Jan 2019 15:47:38 -0200 (-02) Subject: [Gluster-users] Can't write to volume using vim/nano In-Reply-To: References: <959c36a43c4353b869fd40468b6b95ab17c143b3.camel@gmail.com> Message-ID: It looks indeed like some bug with the RDMA protocol implementation. I tested TCP and it works fine. It's a huge bummer for me because my network links work pretty solidly at 50Gb/s in RDMA, while IPoIB gives me (on the best case) less than 30Gb/s :/ Also, I couldn't create a volume with transport mode tcp,rdma. The logs aren't very helpfull: they just say "failed to create volfile" and "could not generate gfproxy client volfiles". If I create a TCP I can later change it to tcp,rdma, but it's not cleanly achieved, and the resulting volume doesn't work (one might be able to mount it, but writes always fail). Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 On Thu, 24 Jan 2019, Jim Kinney wrote: > I have rdma capability. Will test and report back. I'm still on v 3.12. > > On January 24, 2019 12:54:26 AM EST, Amar Tumballi Suryanarayan wrote: > >I suspect this is a bug with 'Transport: rdma' part. We have called out > >for > >de-scoping that feature as we are lacking experts in that domain right > >now. > >Recommend you to use IPoIB option, and use tcp/socket transport type > >(which > >is default). That should mostly fix all the issues. > > > >-Amar > > > >On Thu, Jan 24, 2019 at 5:31 AM Jim Kinney > >wrote: > > > >> That really sounds like a bug with the sharding. I'm not using > >sharding on > >> my setup and files are writeable (vim) with 2 bytes and no errors > >occur. > >> Perhaps the small size is cached until it's large enough to trigger a > >write > >> > >> On Wed, 2019-01-23 at 21:46 -0200, Lindolfo Meira wrote: > >> > >> Also I noticed that any subsequent write (after the first write with > >340 > >> > >> bytes or more), regardless the size, will work as expected. > >> > >> > >> > >> Lindolfo Meira, MSc > >> > >> Diretor Geral, Centro Nacional de Supercomputa??o > >> > >> Universidade Federal do Rio Grande do Sul > >> > >> +55 (51) 3308-3139 > >> > >> > >> On Wed, 23 Jan 2019, Lindolfo Meira wrote: > >> > >> > >> Just checked: when the write is >= 340 bytes, everything works as > >> > >> supposed. If the write is smaller, the error takes place. And when it > >> > >> does, nothing is logged on the server. The client, however, logs the > >> > >> following: > >> > >> > >> [2019-01-23 23:28:54.554664] W [MSGID: 103046] > >> > >> [rdma.c:3502:gf_rdma_decode_header] 0-rpc-transport/rdma: received a > >msg > >> > >> of type RDMA_ERROR > >> > >> > >> [2019-01-23 23:28:54.554728] W [MSGID: 103046] > >> > >> [rdma.c:3939:gf_rdma_process_recv] 0-rpc-transport/rdma: peer > >> > >> (172.24.1.6:49152), couldn't encode or decode the msg properly or > >write > >> > >> chunks were not provided for replies that were bigger than > >> > >> RDMA_INLINE_THRESHOLD (2048) > >> > >> > >> [2019-01-23 23:28:54.554765] W [MSGID: 114031] > >> > >> [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-gfs-client-5: > >remote > >> > >> operation failed [Transport endpoint is not connected] > >> > >> > >> [2019-01-23 23:28:54.554850] W [fuse-bridge.c:1436:fuse_err_cbk] > >> > >> 0-glusterfs-fuse: 1723199: FLUSH() ERR => -1 (Transport endpoint is > >not > >> > >> connected) > >> > >> > >> > >> > >> Lindolfo Meira, MSc > >> > >> Diretor Geral, Centro Nacional de Supercomputa??o > >> > >> Universidade Federal do Rio Grande do Sul > >> > >> +55 (51) 3308-3139 > >> > >> > >> On Wed, 23 Jan 2019, Lindolfo Meira wrote: > >> > >> > >> Hi Jim. Thanks for taking the time. > >> > >> > >> Sorry I didn't express myself properly. It's not a simple matter of > >> > >> permissions. Users can write to the volume alright. It's when vim and > >nano > >> > >> are used, or when small file writes are performed (by cat or echo), > >that > >> > >> it doesn't work. The file is updated with the write in the server, > >but it > >> > >> shows up as empty in the client. > >> > >> > >> I guess it has something to do with the size of the write, because I > >ran a > >> > >> test writing to a file one byte at a time, and it never showed up as > >> > >> having any content in the client (although in the server it kept > >growing > >> > >> accordingly). > >> > >> > >> I should point out that I'm using a sharded volume. But when I was > >testing > >> > >> a striped volume, it also happened. Output of "gluster volume info" > >> > >> follows bellow: > >> > >> > >> Volume Name: gfs > >> > >> Type: Distribute > >> > >> Volume ID: b5ef065f-1ba2-481f-8108-e8f6d2d3f036 > >> > >> Status: Started > >> > >> Snapshot Count: 0 > >> > >> Number of Bricks: 6 > >> > >> Transport-type: rdma > >> > >> Bricks: > >> > >> Brick1: pfs01-ib:/mnt/data > >> > >> Brick2: pfs02-ib:/mnt/data > >> > >> Brick3: pfs03-ib:/mnt/data > >> > >> Brick4: pfs04-ib:/mnt/data > >> > >> Brick5: pfs05-ib:/mnt/data > >> > >> Brick6: pfs06-ib:/mnt/data > >> > >> Options Reconfigured: > >> > >> nfs.disable: on > >> > >> features.shard: on > >> > >> > >> > >> > >> Lindolfo Meira, MSc > >> > >> Diretor Geral, Centro Nacional de Supercomputa??o > >> > >> Universidade Federal do Rio Grande do Sul > >> > >> +55 (51) 3308-3139 > >> > >> > >> On Wed, 23 Jan 2019, Jim Kinney wrote: > >> > >> > >> Check permissions on the mount. I have multiple dozens of systems > >> > >> mounting 18 "exports" using fuse and it works for multiple user > >> > >> read/write based on user access permissions to the mount point space. > >> > >> /home is mounted for 150+ users plus another dozen+ lab storage > >spaces. > >> > >> I do manage user access with freeIPA across all systems to keep > >things > >> > >> consistent. > >> > >> On Wed, 2019-01-23 at 19:31 -0200, Lindolfo Meira wrote: > >> > >> Am I missing something here? A mere write operation, using vim or > >> > >> nano, cannot be performed on a gluster volume mounted over fuse! What > >> > >> gives? > >> > >> Lindolfo Meira, MScDiretor Geral, Centro Nacional de > >> > >> Supercomputa??oUniversidade Federal do Rio Grande do Sul+55 (51) > >> > >> 3308-3139_______________________________________________Gluster-users > >> > >> mailing > >> > >> listGluster-users at gluster.org > >> > >> > >> https://lists.gluster.org/mailman/listinfo/gluster-users > >> > >> > >> -- > >> > >> James P. Kinney III > >> > >> > >> Every time you stop a school, you will have to build a jail. What you > >> > >> gain at one end you lose at the other. It's like feeding a dog on his > >> > >> own tail. It won't fatten the dog. > >> > >> - Speech 11/23/1900 Mark Twain > >> > >> > >> http://heretothereideas.blogspot.com/ > >> > >> > >> > >> -- > >> > >> James P. Kinney III Every time you stop a school, you will have to > >build a > >> jail. What you gain at one end you lose at the other. It's like > >feeding a > >> dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 > >Mark > >> Twain http://heretothereideas.blogspot.com/ > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > > > >-- > >Amar Tumballi (amarts) > > -- > Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. From thing.thing at gmail.com Fri Jan 25 01:11:58 2019 From: thing.thing at gmail.com (Thing) Date: Fri, 25 Jan 2019 14:11:58 +1300 Subject: [Gluster-users] gluster 5, Centos 7.6 and nfs Message-ID: Hi, I am trying to get the above combo working but I cant get nfs from a RHEL7.6 client to work. I keep getting, "[root at vuwunicodsktop2 ~]# showmount -e 10.180.48.186 clnt_create: RPC: Program not registered [root at vuwunicodsktop2 ~]#" nmap and verbose mounting shows, ======= [root at vuwunicodsktop2 ~]# nmap centos76-001 Starting Nmap 6.40 ( http://nmap.org ) at 2019-01-25 14:03 NZDT Nmap scan report for centos76-001 (10.180.48.186) Host is up (0.00014s latency). Not shown: 986 filtered ports PORT STATE SERVICE 22/tcp open ssh 111/tcp open rpcbind 139/tcp closed netbios-ssn 445/tcp closed microsoft-ds 2049/tcp closed nfs 49152/tcp closed unknown 49153/tcp closed unknown 49154/tcp closed unknown 49155/tcp closed unknown 49156/tcp closed unknown 49157/tcp closed unknown 49158/tcp closed unknown 49159/tcp closed unknown 49160/tcp closed unknown MAC Address: 00:50:56:8D:60:25 (VMware) Nmap done: 1 IP address (1 host up) scanned in 4.34 seconds [root at vuwunicodsktop2 ~]# mount -vv -t nfs centos76-001:/gv0 /gv0 mount.nfs: timeout set for Fri Jan 25 14:05:43 2019 mount.nfs: trying text-based options 'vers=4.1,addr=10.180.48.186,clientaddr=10.180.48.181' mount.nfs: mount(2): Connection refused mount.nfs: trying text-based options 'addr=10.180.48.186' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 10.180.48.186 prog 100003 vers 3 prot TCP port 2049 mount.nfs: portmap query failed: RPC: Remote system error - Connection refused mount.nfs: trying text-based options 'vers=4.1,addr=10.180.48.186,clientaddr=10.180.48.181' mount.nfs: mount(2): Connection refused mount.nfs: trying text-based options 'addr=10.180.48.186' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 10.180.48.186 prog 100003 vers 3 prot TCP port 2049 mount.nfs: portmap query failed: RPC: Remote system error - Connection refused mount.nfs: trying text-based options 'vers=4.1,addr=10.180.48.186,clientaddr=10.180.48.181' mount.nfs: mount(2): Connection refused mount.nfs: trying text-based options 'addr=10.180.48.186' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 10.180.48.186 prog 100003 vers 3 prot TCP port 2049 mount.nfs: portmap query failed: RPC: Remote system error - Connection refused mount.nfs: trying text-based options 'vers=4.1,addr=10.180.48.186,clientaddr=10.180.48.181' mount.nfs: mount(2): Connection refused mount.nfs: trying text-based options 'addr=10.180.48.186' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 10.180.48.186 prog 100003 vers 3 prot TCP port 2049 mount.nfs: portmap query failed: RPC: Remote system error - Connection refused ^C ======== When I try the glusterfs works fine so I appear to be missing something. Anyone got a current URL howto for the above combo please? rpcbind is running. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jthottan at redhat.com Fri Jan 25 04:07:43 2019 From: jthottan at redhat.com (Jiffin Thottan) Date: Thu, 24 Jan 2019 23:07:43 -0500 (EST) Subject: [Gluster-users] gluster 5, Centos 7.6 and nfs In-Reply-To: References: Message-ID: <1284291505.74842190.1548389263645.JavaMail.zimbra@redhat.com> There were two nfs-solution w.r.t gluster, one is build nfs server aka gluster nfs, Another one is nfs-ganesha. The gNFS got deprecated in favour of nfs-ganesha. Are u trying to user nfs-ganesha here? -- Jiffin ----- Original Message ----- From: "Thing" To: "Gluster Users" Sent: Friday, January 25, 2019 6:41:58 AM Subject: [Gluster-users] gluster 5, Centos 7.6 and nfs Hi, I am trying to get the above combo working but I cant get nfs from a RHEL7.6 client to work. I keep getting, "[root at vuwunicodsktop2 ~]# showmount -e 10.180.48.186 clnt_create: RPC: Program not registered [root at vuwunicodsktop2 ~]#" nmap and verbose mounting shows, ======= [root at vuwunicodsktop2 ~]# nmap centos76-001 Starting Nmap 6.40 ( http://nmap.org ) at 2019-01-25 14:03 NZDT Nmap scan report for centos76-001 (10.180.48.186) Host is up (0.00014s latency). Not shown: 986 filtered ports PORT STATE SERVICE 22/tcp open ssh 111/tcp open rpcbind 139/tcp closed netbios-ssn 445/tcp closed microsoft-ds 2049/tcp closed nfs 49152/tcp closed unknown 49153/tcp closed unknown 49154/tcp closed unknown 49155/tcp closed unknown 49156/tcp closed unknown 49157/tcp closed unknown 49158/tcp closed unknown 49159/tcp closed unknown 49160/tcp closed unknown MAC Address: 00:50:56:8D:60:25 (VMware) Nmap done: 1 IP address (1 host up) scanned in 4.34 seconds [root at vuwunicodsktop2 ~]# mount -vv -t nfs centos76-001:/gv0 /gv0 mount.nfs: timeout set for Fri Jan 25 14:05:43 2019 mount.nfs: trying text-based options 'vers=4.1,addr=10.180.48.186,clientaddr=10.180.48.181' mount.nfs: mount(2): Connection refused mount.nfs: trying text-based options 'addr=10.180.48.186' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 10.180.48.186 prog 100003 vers 3 prot TCP port 2049 mount.nfs: portmap query failed: RPC: Remote system error - Connection refused mount.nfs: trying text-based options 'vers=4.1,addr=10.180.48.186,clientaddr=10.180.48.181' mount.nfs: mount(2): Connection refused mount.nfs: trying text-based options 'addr=10.180.48.186' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 10.180.48.186 prog 100003 vers 3 prot TCP port 2049 mount.nfs: portmap query failed: RPC: Remote system error - Connection refused mount.nfs: trying text-based options 'vers=4.1,addr=10.180.48.186,clientaddr=10.180.48.181' mount.nfs: mount(2): Connection refused mount.nfs: trying text-based options 'addr=10.180.48.186' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 10.180.48.186 prog 100003 vers 3 prot TCP port 2049 mount.nfs: portmap query failed: RPC: Remote system error - Connection refused mount.nfs: trying text-based options 'vers=4.1,addr=10.180.48.186,clientaddr=10.180.48.181' mount.nfs: mount(2): Connection refused mount.nfs: trying text-based options 'addr=10.180.48.186' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: trying 10.180.48.186 prog 100003 vers 3 prot TCP port 2049 mount.nfs: portmap query failed: RPC: Remote system error - Connection refused ^C ======== When I try the glusterfs works fine so I appear to be missing something. Anyone got a current URL howto for the above combo please? rpcbind is running. _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users From g1patnaik at gmail.com Fri Jan 25 08:10:50 2019 From: g1patnaik at gmail.com (Jeevan Patnaik) Date: Fri, 25 Jan 2019 13:40:50 +0530 Subject: [Gluster-users] Is it required for a node to meet quorum over all the nodes in storage pool? Message-ID: Hi, I'm just going through the concepts of quorum and split-brains with a cluster in general, and trying to understand GlusterFS quorums again which I previously found difficult to accurately understand. When we talk about server quorums, what I understand is that the concept is similar to STONITH in cluster i.e., we shoot the node that probably have issues/ make the bricks down preventing access at all. But I don't get how it calculates quorum. My understanding: In a distributed replicated volume, 1. All bricks in a replica set should have same data writes and hence, it is required to meet atleast 51% quorum on those replica sets. Now considering following 3x replica configuration: ServerA,B,C,D,E,F-> brickA,B,C,D,E,F respectively and serverG without any brick in storage pool. Scenario: ServerA,B,F formed a partition i.e., they are isolated with other nodes in storage pool. But serverA,B,C bricks are of same sub-volume, Hence if we consider quorum over sub-volumes, A and B meets quorum for it's only participating sub-volume and can serve the corresponding bricks. And the corresponding bricks on C should go down. But when we consider quorum over storage pool, C,D,E,G meets quorum whereas A,B,F is not. Hence, bricks on A,B,F should fail. And for C, the quorum still will not me met for it's sub-volume. So, it will go to read only mode. Sub-volume on D and E should work normally. So, with assumption that only sub-volume quorum is considered, we don't have any downtime on sub-volumes, but we have two partitions and if clients can access both, clients can still write and read on both the partitions separately and without data conflict. The split-brain problem arrives when some clients can access one partition and some other. If quorum is considered for entire storage pool, then this split-brain will not be seen as the problem nodes will be dead. And so why is it's not mandatory to enable server quorum to avoid this split-brain issue? And I also assume that quorum percentage should be greater than 50%. There's any option to set custom percentage. Why is it required? If all that is required is to kill the problem node partition (group) by identifying if it has the largest possible share (i.e. greater than 50), does the percentage really matter? Thanks in advance! Regards, Jeevan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Fri Jan 25 08:26:19 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 25 Jan 2019 13:56:19 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Amudhan, So here's the issue: In node3, 'cat /var/lib/glusterd/peers/* ' doesn't show up node2's details and that's why glusterd wasn't able to resolve the brick(s) hosted on node2. Can you please pick up 0083ec0c-40bf-472a-a128-458924e56c96 file from /var/lib/glusterd/peers/ from node 4 and place it in the same location in node 3 and then restart glusterd service on node 3? On Thu, Jan 24, 2019 at 11:57 AM Amudhan P wrote: > Atin, > > Sorry, i missed to send entire `glusterd` folder. Now attached zip > contains `glusterd` folder from all nodes. > > the problem node is node3 IP 10.1.2.3, `glusterd` log file is inside node3 > folder. > > regards > Amudhan > > On Wed, Jan 23, 2019 at 11:02 PM Atin Mukherjee > wrote: > >> Amudhan, >> >> I see that you have provided the content of the configuration of the >> volume gfs-tst where the request was to share the dump of >> /var/lib/glusterd/* . I can not debug this further until you share the >> correct dump. >> >> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee >> wrote: >> >>> Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? >>> Instead of doing too many back and forth I suggest you to share the content >>> of /var/lib/glusterd from all the nodes. Also do mention which particular >>> node the glusterd service is unable to come up. >>> >>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P wrote: >>> >>>> I have created the folder in the path as said but still, service failed >>>> to start below is the error msg in glusterd.log >>>> >>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] [glusterfsd.c:2741:main] >>>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >>>> 0-management: Using /var/lib/glusterd as working directory >>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >>>> 0-management: Using /var/run/gluster as pid file working directory >>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>> channel creation failed [No such device] >>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>>> 0-rdma.management: Failed to initialize IB Device >>>> [2019-01-16 14:50:14.563882] W [rpc-transport.c:351:rpc_transport_load] >>>> 0-rpc-transport: 'rdma' initialization failed >>>> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>> 0-rpc-service: cannot create listener, initing the transport failed >>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>> transport >>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>> op-version: 40100 >>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>> connect returned 0 >>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>> Failed to get tcp-user-timeout >>>> [2019-01-16 14:50:15.675451] I >>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>> frame-timeout to 600 >>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>> brick failed in restore* >>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>> 'management' failed, review your volfile again* >>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>> failed >>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>> received signum (-1), shutting down >>>> >>>> >>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee >>>> wrote: >>>> >>>>> If gluster volume info/status shows the brick to be >>>>> /media/disk4/brick4 then you'd need to mount the same path and hence you'd >>>>> need to create the brick4 directory explicitly. I fail to understand the >>>>> rationale how only /media/disk4 can be used as the mount path for the >>>>> brick. >>>>> >>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P wrote: >>>>> >>>>>> Yes, I did mount bricks but the folder 'brick4' was still not created >>>>>> inside the brick. >>>>>> Do I need to create this folder because when I run replace-brick it >>>>>> will create folder inside the brick. I have seen this behavior before when >>>>>> running replace-brick or heal begins. >>>>>> >>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P >>>>>>> wrote: >>>>>>> >>>>>>>> Atin, >>>>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>>>> >>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>> /var/run/glusterd.pid) >>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>> set to 65536 >>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>> directory >>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>> working directory >>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>> channel creation failed [No such device] >>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>> initialization failed >>>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>> listener, initing the transport failed >>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>> continuing with succeeded transport >>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>> op-version: 40100 >>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>>> directory] >>>>>>>> >>>>>>> >>>>>>> This means that underlying brick /media/disk4/brick4 doesn't exist. >>>>>>> You already mentioned that you had replaced the faulty disk, but have you >>>>>>> not mounted it yet? >>>>>>> >>>>>>> >>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>>> connect returned 0 >>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>>> Failed to get tcp-user-timeout >>>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>>> frame-timeout to 600 >>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>>> brick failed in restore >>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>> 'management' failed, review your volfile again >>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>> failed >>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>> received signum (-1), shutting down >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This is a case of partial write of a transaction and as the host >>>>>>>>> ran out of space for the root partition where all the glusterd related >>>>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>>>> reporting all nodes healthy and connected. >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> In short, when I started glusterd service I am getting following >>>>>>>>>> error msg in the glusterd.log file in one server. >>>>>>>>>> what needs to be done? >>>>>>>>>> >>>>>>>>>> error logged in glusterd.log >>>>>>>>>> >>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>> set to 65536 >>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>> directory >>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>> working directory >>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>> channel creation failed [No such device] >>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>> initialization failed >>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>> listener, initing the transport failed >>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>> continuing with succeeded transport >>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>> op-version: 40100 >>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>> file or directory] >>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>> failed >>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> In long, I am trying to simulate a situation. where volume stoped >>>>>>>>>> abnormally and >>>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>>> >>>>>>>>>> My test cluster is set up with 3 nodes and each has four disks, I >>>>>>>>>> have setup a volume with disperse 4+2. >>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>>> system >>>>>>>>>> >>>>>>>>>> below are the steps done. >>>>>>>>>> >>>>>>>>>> 1. umount from client machine >>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>>> without stopping volume and stop service) >>>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>>> 4. powered ON all system >>>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>>>> file for details. >>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED >>>>>>>>>> : Volume gfs-tst already started >>>>>>>>>> >>>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>> Online Pid >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1517 >>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1668 >>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1522 >>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1678 >>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1527 >>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1677 >>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1541 >>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1683 >>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>> Y 2662 >>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>> 2786 >>>>>>>>>> >>>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>>> `reset-brick` command >>>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>>> >>>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>>> >>>>>>>>>> 11. reset-brick command was not working, so, tried stopping >>>>>>>>>> volume and start with force command >>>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>>>> details >>>>>>>>>> >>>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>>>> >>>>>>>>>> in node-3 receiving following message. >>>>>>>>>> >>>>>>>>>> sudo service glusterd start >>>>>>>>>> * Starting glusterd service glusterd >>>>>>>>>> >>>>>>>>>> [fail] >>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>>>>> >>>>>>>>>> 13. checking glusterd log file found that OS drive was running >>>>>>>>>> out of space >>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>>>> left on device] >>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>>> >>>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>>> >>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>> set to 65536 >>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>> directory >>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>> working directory >>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>> channel creation failed [No such device] >>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>> initialization failed >>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>> listener, initing the transport failed >>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>> continuing with succeeded transport >>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>> op-version: 40100 >>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>> file or directory] >>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>> failed >>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>> received signum (-1), shutting down >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 15. In other node running `volume status' still shows bricks >>>>>>>>>> node3 is live >>>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>>> >>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>> Online Pid >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1517 >>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1668 >>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1522 >>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1678 >>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1527 >>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1677 >>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1541 >>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1683 >>>>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>>>> 2662 >>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>> 2786 >>>>>>>>>> >>>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> There are no active volume tasks >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>>>> UUID Hostname State >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>>>>> >>>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>>>> Number of Peers: 2 >>>>>>>>>> >>>>>>>>>> Hostname: IP.3 >>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>>> >>>>>>>>>> Hostname: IP.4 >>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> regards >>>>>>>>>> Amudhan >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Fri Jan 25 09:24:13 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Fri, 25 Jan 2019 14:54:13 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi John, Please find db dump and heketi log. Here kernel version. Please let me know If you need more information. [root at app2 ~]# uname -a Linux app2.matrix.nokia.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Hardware: HP GEN8 OS; NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" From: "Madhu Rajanna" To: "Shaik Salam" , "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 10:52 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" Adding John who is having more idea about how to debug this one. @Shaik Salam can you some more info on the hardware on which you are running heketi (kernel details) On Thu, Jan 24, 2019 at 7:42 PM Shaik Salam wrote: Hi Madhu, Sorry to disturb could you please provide atleast work around (to clear requests which stuck) to move further. We are also not able to find root cause from glusterd logs. Please find attachment. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 04:12 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-complete.log.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-gluster.db.txt URL: From shaik.salam at tcs.com Fri Jan 25 10:33:51 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Fri, 25 Jan 2019 16:03:51 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi John, Could you please have look my issue If you have time (atleast provide workaround). Thanks in advance. BR Salam From: "Shaik Salam" To: Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/25/2019 02:55 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Sent by: gluster-users-bounces at gluster.org "External email. Open with Caution" Hi John, Please find db dump and heketi log. Here kernel version. Please let me know If you need more information. [root at app2 ~]# uname -a Linux app2.matrix.nokia.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Hardware: HP GEN8 OS; NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" From: "Madhu Rajanna" To: "Shaik Salam" , "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 10:52 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" Adding John who is having more idea about how to debug this one. @Shaik Salam can you some more info on the hardware on which you are running heketi (kernel details) On Thu, Jan 24, 2019 at 7:42 PM Shaik Salam wrote: Hi Madhu, Sorry to disturb could you please provide atleast work around (to clear requests which stuck) to move further. We are also not able to find root cause from glusterd logs. Please find attachment. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 04:12 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-complete.log.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-gluster.db.txt URL: From max.degraaf at kpn.com Fri Jan 25 11:30:02 2019 From: max.degraaf at kpn.com (max.degraaf at kpn.com) Date: Fri, 25 Jan 2019 11:30:02 +0000 Subject: [Gluster-users] Brick stays offline after update from 4.1.6-1.el7 to 4.1.7-1.el7 Message-ID: We have 2 nodes running CentOS 7.3. Running just fine with glusterfs 4.1.6-1.el7. This morning update both to 4.1.7-1.el7 and the only brick configured stays offline. gluster peer status show no problems: Number of Peers: 1 Hostname: 10.159.241.35 Uuid: 7453dbec-44fb-4e57-9471-6e653d287d3b State: Peer in Cluster (Connected) Number of Peers: 1 Hostname: 10.159.241.3 Uuid: 8f0e75bd-c782-4d21-aaf3-2d8a27e8a714 State: Peer in Cluster (Connected) gluster volume status show the bricks offline: Status of volume: gst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick grpprdaalcgst01.cloudprod.local:/apps /glusterfs-gst/gst 49152 0 Y 8827 Brick grpprdapdcgst01.cloudprod.local:/apps /glusterfs-gst/gst N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 8818 Self-heal Daemon on grpprdapdcgst01.cloudpr od.local N/A N/A Y 28111 Task Status of Volume gst ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick grpprdaalcgst01.cloudprod.local:/apps /glusterfs-gst/gst 49152 0 Y 8827 Brick grpprdapdcgst01.cloudprod.local:/apps /glusterfs-gst/gst N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 28111 Self-heal Daemon on 10.159.241.3 N/A N/A Y 8818 Task Status of Volume gst ------------------------------------------------------------------------------ There are no active volume tasks -------------- next part -------------- An HTML attachment was scrubbed... URL: From max.degraaf at kpn.com Fri Jan 25 12:32:41 2019 From: max.degraaf at kpn.com (max.degraaf at kpn.com) Date: Fri, 25 Jan 2019 12:32:41 +0000 Subject: [Gluster-users] Brick stays offline after update from 4.1.6-1.el7 to 4.1.7-1.el7 Message-ID: We have 2 nodes running CentOS 7.3. Running just fine with glusterfs 4.1.6-1.el7. This morning update both to 4.1.7-1.el7 and the only brick configured stays offline. gluster peer status show no problems: Number of Peers: 1 Hostname: 10.159.241.35 Uuid: 7453dbec-44fb-4e57-9471-6e653d287d3b State: Peer in Cluster (Connected) Number of Peers: 1 Hostname: 10.159.241.3 Uuid: 8f0e75bd-c782-4d21-aaf3-2d8a27e8a714 State: Peer in Cluster (Connected) gluster volume status show the bricks offline: Status of volume: gst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick grpprdaalcgst01.cloudprod.local:/apps /glusterfs-gst/gst 49152 0 Y 8827 Brick grpprdapdcgst01.cloudprod.local:/apps /glusterfs-gst/gst N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 8818 Self-heal Daemon on grpprdapdcgst01.cloudpr od.local N/A N/A Y 28111 Task Status of Volume gst ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick grpprdaalcgst01.cloudprod.local:/apps /glusterfs-gst/gst 49152 0 Y 8827 Brick grpprdapdcgst01.cloudprod.local:/apps /glusterfs-gst/gst N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 28111 Self-heal Daemon on 10.159.241.3 N/A N/A Y 8818 Task Status of Volume gst ------------------------------------------------------------------------------ There are no active volume tasks Any idea on a fix? If not, how can we revert to 4.1.6-1.el7? -------------- next part -------------- An HTML attachment was scrubbed... URL: From hunter86_bg at yahoo.com Fri Jan 25 13:12:59 2019 From: hunter86_bg at yahoo.com (Strahil Nikolov) Date: Fri, 25 Jan 2019 13:12:59 +0000 (UTC) Subject: [Gluster-users] =?utf-8?b?0J7RgtC9OiAgR2x1c3RlciBwZXJmb3JtYW5j?= =?utf-8?q?e_issues_-_need_advise?= In-Reply-To: References: <54e562bd-3465-4c69-8fda-8060a52e9c22@email.android.com> <1172976507.352481.1548325968837@mail.yahoo.com> Message-ID: <1701539419.952862.1548421979393@mail.yahoo.com> Dear Darrell, I found the issue and now I can reach the maximum of the network with a fuse client.Here is a short overview:1. I noticed that working with a new gluster volume is reaching my network speed - I was quite excited.2. Then I have destroyed my gluster volume and created a new one and started adding features from ovirt Once I have added features.shard on -> I hit the same performance from before. Increasing the shard size to 16MB didn't help at all. For my case where I have 2 Virtualization hosts with single data gluster volume - sharding is not neccessary, but for larger setups it will be a problem. As this looks to me as a bug - can someone tell me where I can report it ? Thanks to all who guided me in this journey of GlusterFS ! I have learned so much , as my prior knowledge was only in Ceph. Best Regards,Strahil Nikolov ? ?????????, 24 ?????? 2019 ?., 17:53:50 ?. ???????+2, Darrell Budic ??????: Strahil- The fuse client is what it is, it?s limited by operating in user land and waiting for the gluster servers to acknowledge all the writes. I noted you're using ovirt, you should look into enabling the libgfapi engine setting to run your VMs with libgf natively. You can?t test directly from the host with that, but you can run your tests inside the VMs. I saw significant throughput and latency improvements that way. It?s still somewhat beta, so you?ll probably need to search the overt-users mailing list to find info on enabling it.? Good luck! On Jan 24, 2019, at 4:32 AM, Strahil Nikolov wrote: Dear Amar, Community, it seems the issue is in the fuse client itself. Here is the latest update:1. I have added the following:server.event-threads: 4 client.event-threads: 4 performance.stat-prefetch: onperformance.strict-o-direct: off Results: no change 2. Allowed nfs and connected ovirt1 to the gluster volume:nfs.disable: off Results: Drastic improvement in performance as follows: [root at ovirt1 data]# dd if=/dev/zero of=largeio bs=1M count=5000 status=progress 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB) copied, 53.0443 s, 98.8 MB/s So I would be happy if anyone guide me in order to fix the situation as the fuse client is the best way to use glusterfs, and it seems the glusterfs-server is not the guilty one. Thanks in advance for your guidance.I have learned so much. Best Regards,Strahil Nikolov ??: Strahil ??: Amar Tumballi Suryanarayan ?????: Gluster-users ????????: ?????, 23 ?????? 2019 ?. 18:44 ????: Re: [Gluster-users] Gluster performance issues - need advise Dear Amar, Thanks for your email. Actually my concerns were on both topics.Would you recommend any perf options that will be suitable ? After mentioning the network usage, I just checked it and it seems duringthe test session, ovirt1 (both client and host) is using no more than 455Mbit/s which is half the network bandwidth. I'm still in the middle of nowhere, so any ideas are welcome. Best Regards,Strahil Nikolov On Jan 23, 2019 17:49, Amar Tumballi Suryanarayan wrote: I didn't understand the issue properly. Mostly I missed something. Are you concerned the performance is 49MB/s with and without perf options? or are you expecting it to be 123MB/s as over the n/w you get that speed? If it is the first problem, then you are actually having 'performance.write-behind on' in both options, and it is the only perf xlator which comes into action during the test you ran. If it is the second, then please be informed that gluster does client side replication, which means, n/w would be split in half for write operations (like write(), creat() etc), so the number you are getting is almost the maximum with 1GbE. Regards,Amar On Wed, Jan 23, 2019 at 8:38 PM Strahil Nikolov wrote: Hello Community, recently I have built a new lab based on oVirt and CentOS 7. During deployment I had some hicups, but now the engine is up and running - but gluster is causing me trouble. Symptoms: Slow VM install from DVD, poor write performance. The latter has been tested via: dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_data bs=1M count=1000 status=progress The reported speed is 60MB/s which is way too low for my setup. My lab design: https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing Gluster version is 3.12.15 So far I have done: 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure on' in glusterd.vol) Volume info after that change: Volume Name: data Type: Replicate Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt1.localdomain:/gluster_bricks/data/data Brick2: ovirt2.localdomain:/gluster_bricks/data/data Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off storage.owner-uid: 36 storage.owner-gid: 36 network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable server.allow-insecure: on Seems no positive or negative effect so far. 2. Tested with tmpfs? on all bricks -> ovirt1 mounted gluster volume ->? max 60MB/s (bs=1M without 'oflag=direct') [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M? count=4000 status=progress 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s 4000+0 records in 4000+0 records out 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s [root at ovirt1 data]# rm -f large_io [root at ovirt1 data]# gluster volume profile data info Brick: ovirt1.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size:? ? ? ? ? ? 131072b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 8 No. of Writes:? ? ? ? ? ? ? ? 44968 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 3? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 35? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 28? RELEASEDIR 0.00? ? ? 78.00 us? ? ? 78.00 us? ? ? 78.00 us? ? ? ? ? ? ? 1? ? ? FSTAT 0.00? ? ? 35.67 us? ? ? 26.00 us? ? ? 73.00 us? ? ? ? ? ? ? 6? ? ? FLUSH 0.00? ? 324.00 us? ? 324.00 us? ? 324.00 us? ? ? ? ? ? ? 1? ? XATTROP 0.00? ? ? 45.80 us? ? ? 38.00 us? ? ? 54.00 us? ? ? ? ? ? 10? ? ? ? STAT 0.00? ? 227.67 us? ? 216.00 us? ? 242.00 us? ? ? ? ? ? ? 3? ? ? CREATE 0.00? ? 113.38 us? ? ? 68.00 us? ? 381.00 us? ? ? ? ? ? ? 8? ? ? ? READ 0.00? ? ? 39.82 us? ? ? 1.00 us? ? 148.00 us? ? ? ? ? ? 28? ? OPENDIR 0.00? ? ? 67.54 us? ? ? 10.00 us? ? 283.00 us? ? ? ? ? ? 24? ? GETXATTR 0.00? ? ? 59.97 us? ? ? 45.00 us? ? 113.00 us? ? ? ? ? ? 32? ? ? ? OPEN 0.00? ? ? 24.41 us? ? ? 13.00 us? ? ? 89.00 us? ? ? ? ? ? 161? ? INODELK 0.00? ? ? 43.43 us? ? ? 28.00 us? ? 214.00 us? ? ? ? ? ? 93? ? ? STATFS 0.00? ? 246.35 us? ? ? 11.00 us? ? 1155.00 us? ? ? ? ? ? 20? ? READDIR 0.00? ? 283.00 us? ? 233.00 us? ? 353.00 us? ? ? ? ? ? 18? ? READDIRP 0.00? ? 153.23 us? ? 122.00 us? ? 259.00 us? ? ? ? ? ? 87? ? ? MKNOD 0.01? ? ? 99.77 us? ? ? 10.00 us? ? 258.00 us? ? ? ? ? ? 442? ? ? LOOKUP 0.31? ? ? 49.22 us? ? ? 27.00 us? ? 540.00 us? ? ? ? ? 45620? ? FXATTROP 0.77? ? 124.24 us? ? ? 87.00 us? ? 604.00 us? ? ? ? ? 44968? ? ? WRITE 0.93? 15767.71 us? ? ? 15.00 us? 305833.00 us? ? ? ? ? ? 431? ? ENTRYLK 1.99? 160711.39 us? ? 3332.00 us? 406037.00 us? ? ? ? ? ? 90? ? ? UNLINK 96.00? ? 5167.82 us? ? ? 18.00 us? 55972.00 us? ? ? ? 135349? ? FINODELK Duration: 380 seconds Data Read: 1048576 bytes Data Written: 5894045696 bytes Interval 0 Stats: Block Size:? ? ? ? ? ? 131072b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 8 No. of Writes:? ? ? ? ? ? ? ? 44968 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 3? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 35? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 28? RELEASEDIR 0.00? ? ? 78.00 us? ? ? 78.00 us? ? ? 78.00 us? ? ? ? ? ? ? 1? ? ? FSTAT 0.00? ? ? 35.67 us? ? ? 26.00 us? ? ? 73.00 us? ? ? ? ? ? ? 6? ? ? FLUSH 0.00? ? 324.00 us? ? 324.00 us? ? 324.00 us? ? ? ? ? ? ? 1? ? XATTROP 0.00? ? ? 45.80 us? ? ? 38.00 us? ? ? 54.00 us? ? ? ? ? ? 10? ? ? ? STAT 0.00? ? 227.67 us? ? 216.00 us? ? 242.00 us? ? ? ? ? ? ? 3? ? ? CREATE 0.00? ? 113.38 us? ? ? 68.00 us? ? 381.00 us? ? ? ? ? ? ? 8? ? ? ? READ 0.00? ? ? 39.82 us? ? ? 1.00 us? ? 148.00 us? ? ? ? ? ? 28? ? OPENDIR 0.00? ? ? 67.54 us? ? ? 10.00 us? ? 283.00 us? ? ? ? ? ? 24? ? GETXATTR 0.00? ? ? 59.97 us? ? ? 45.00 us? ? 113.00 us? ? ? ? ? ? 32? ? ? ? OPEN 0.00? ? ? 24.41 us? ? ? 13.00 us? ? ? 89.00 us? ? ? ? ? ? 161? ? INODELK 0.00? ? ? 43.43 us? ? ? 28.00 us? ? 214.00 us? ? ? ? ? ? 93? ? ? STATFS 0.00? ? 246.35 us? ? ? 11.00 us? ? 1155.00 us? ? ? ? ? ? 20? ? READDIR 0.00? ? 283.00 us? ? 233.00 us? ? 353.00 us? ? ? ? ? ? 18? ? READDIRP 0.00? ? 153.23 us? ? 122.00 us? ? 259.00 us? ? ? ? ? ? 87? ? ? MKNOD 0.01? ? ? 99.77 us? ? ? 10.00 us? ? 258.00 us? ? ? ? ? ? 442? ? ? LOOKUP 0.31? ? ? 49.22 us? ? ? 27.00 us? ? 540.00 us? ? ? ? ? 45620? ? FXATTROP 0.77? ? 124.24 us? ? ? 87.00 us? ? 604.00 us? ? ? ? ? 44968? ? ? WRITE 0.93? 15767.71 us? ? ? 15.00 us? 305833.00 us? ? ? ? ? ? 431? ? ENTRYLK 1.99? 160711.39 us? ? 3332.00 us? 406037.00 us? ? ? ? ? ? 90? ? ? UNLINK 96.00? ? 5167.82 us? ? ? 18.00 us? 55972.00 us? ? ? ? 135349? ? FINODELK Duration: 380 seconds Data Read: 1048576 bytes Data Written: 5894045696 bytes Brick: ovirt3.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size:? ? ? ? ? ? ? ? ? 1b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 0 No. of Writes:? ? ? ? ? ? ? ? 39328 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 2? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 12? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 17? RELEASEDIR 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 1? ? ? FSTAT 0.00? ? ? 51.50 us? ? ? 20.00 us? ? ? 81.00 us? ? ? ? ? ? ? 4? ? ? FLUSH 0.01? ? 219.50 us? ? 188.00 us? ? 251.00 us? ? ? ? ? ? ? 2? ? ? CREATE 0.01? ? ? 43.45 us? ? ? 11.00 us? ? ? 90.00 us? ? ? ? ? ? 11? ? GETXATTR 0.01? ? ? 62.30 us? ? ? 38.00 us? ? 119.00 us? ? ? ? ? ? 10? ? ? ? OPEN 0.01? ? ? 50.59 us? ? ? 1.00 us? ? 102.00 us? ? ? ? ? ? 17? ? OPENDIR 0.01? ? ? 24.60 us? ? ? 12.00 us? ? ? 64.00 us? ? ? ? ? ? 40? ? INODELK 0.02? ? 176.30 us? ? ? 10.00 us? ? 765.00 us? ? ? ? ? ? 10? ? READDIR 0.07? ? ? 63.08 us? ? ? 39.00 us? ? 133.00 us? ? ? ? ? ? 78? ? ? UNLINK 0.13? ? ? 27.35 us? ? ? 10.00 us? ? ? 91.00 us? ? ? ? ? ? 333? ? ENTRYLK 0.13? ? 126.89 us? ? ? 99.00 us? ? 179.00 us? ? ? ? ? ? 76? ? ? MKNOD 0.42? ? 116.70 us? ? ? 8.00 us? ? 8661.00 us? ? ? ? ? ? 261? ? ? LOOKUP 28.73? ? ? 51.79 us? ? ? 22.00 us? ? 2574.00 us? ? ? ? ? 39822? ? FXATTROP 29.52? ? ? 53.87 us? ? ? 16.00 us? ? 3290.00 us? ? ? ? ? 39328? ? ? WRITE 40.92? ? ? 24.71 us? ? ? 10.00 us? ? 3224.00 us? ? ? ? 118864? ? FINODELK Duration: 189 seconds Data Read: 0 bytes Data Written: 39328 bytes Interval 0 Stats: Block Size:? ? ? ? ? ? ? ? ? 1b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 0 No. of Writes:? ? ? ? ? ? ? ? 39328 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 2? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 12? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 17? RELEASEDIR 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 1? ? ? FSTAT 0.00? ? ? 51.50 us? ? ? 20.00 us? ? ? 81.00 us? ? ? ? ? ? ? 4? ? ? FLUSH 0.01? ? 219.50 us? ? 188.00 us? ? 251.00 us? ? ? ? ? ? ? 2? ? ? CREATE 0.01? ? ? 43.45 us? ? ? 11.00 us? ? ? 90.00 us? ? ? ? ? ? 11? ? GETXATTR 0.01? ? ? 62.30 us? ? ? 38.00 us? ? 119.00 us? ? ? ? ? ? 10? ? ? ? OPEN 0.01? ? ? 50.59 us? ? ? 1.00 us? ? 102.00 us? ? ? ? ? ? 17? ? OPENDIR 0.01? ? ? 24.60 us? ? ? 12.00 us? ? ? 64.00 us? ? ? ? ? ? 40? ? INODELK 0.02? ? 176.30 us? ? ? 10.00 us? ? 765.00 us? ? ? ? ? ? 10? ? READDIR 0.07? ? ? 63.08 us? ? ? 39.00 us? ? 133.00 us? ? ? ? ? ? 78? ? ? UNLINK 0.13? ? ? 27.35 us? ? ? 10.00 us? ? ? 91.00 us? ? ? ? ? ? 333? ? ENTRYLK 0.13? ? 126.89 us? ? ? 99.00 us? ? 179.00 us? ? ? ? ? ? 76? ? ? MKNOD 0.42? ? 116.70 us? ? ? 8.00 us? ? 8661.00 us? ? ? ? ? ? 261? ? ? LOOKUP 28.73? ? ? 51.79 us? ? ? 22.00 us? ? 2574.00 us? ? ? ? ? 39822? ? FXATTROP 29.52? ? ? 53.87 us? ? ? 16.00 us? ? 3290.00 us? ? ? ? ? 39328? ? ? WRITE 40.92? ? ? 24.71 us? ? ? 10.00 us? ? 3224.00 us? ? ? ? 118864? ? FINODELK Duration: 189 seconds Data Read: 0 bytes Data Written: 39328 bytes Brick: ovirt2.localdomain:/gluster_bricks/data/data --------------------------------------------------- Cumulative Stats: Block Size:? ? ? ? ? ? ? ? 512b+? ? ? ? ? ? ? 131072b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 0? ? ? ? ? ? ? ? ? ? 0 No. of Writes:? ? ? ? ? ? ? ? ? 36? ? ? ? ? ? ? ? 76758 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 6? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 87? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 96? RELEASEDIR 0.00? ? 100.50 us? ? ? 80.00 us? ? 121.00 us? ? ? ? ? ? ? 2 REMOVEXATTR 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 2? ? SETXATTR 0.00? ? ? 36.18 us? ? ? 22.00 us? ? ? 62.00 us? ? ? ? ? ? 11? ? ? FLUSH 0.00? ? ? 57.44 us? ? ? 42.00 us? ? ? 77.00 us? ? ? ? ? ? ? 9? FTRUNCATE 0.00? ? ? 82.56 us? ? ? 59.00 us? ? 138.00 us? ? ? ? ? ? ? 9? ? ? FSTAT 0.00? ? ? 89.42 us? ? ? 67.00 us? ? 161.00 us? ? ? ? ? ? 12? ? SETATTR 0.00? ? 272.40 us? ? 235.00 us? ? 296.00 us? ? ? ? ? ? ? 5? ? ? CREATE 0.01? ? 154.28 us? ? ? 88.00 us? ? 320.00 us? ? ? ? ? ? 18? ? XATTROP 0.01? ? ? 45.29 us? ? ? 1.00 us? ? 319.00 us? ? ? ? ? ? 96? ? OPENDIR 0.01? ? ? 86.69 us? ? ? 30.00 us? ? 379.00 us? ? ? ? ? ? 62? ? ? ? STAT 0.01? ? ? 64.30 us? ? ? 47.00 us? ? 169.00 us? ? ? ? ? ? 84? ? ? ? OPEN 0.02? ? 107.34 us? ? ? 23.00 us? ? 273.00 us? ? ? ? ? ? 73? ? READDIRP 0.02? ? 4688.00 us? ? ? 86.00 us? ? 9290.00 us? ? ? ? ? ? ? 2? ? TRUNCATE 0.02? ? ? 59.29 us? ? ? 13.00 us? ? 394.00 us? ? ? ? ? ? 165? ? GETXATTR 0.03? ? 128.51 us? ? ? 27.00 us? ? 338.00 us? ? ? ? ? ? 96? ? ? FSYNC 0.03? ? 240.75 us? ? ? 14.00 us? ? 1943.00 us? ? ? ? ? ? 52? ? READDIR 0.04? ? ? 65.59 us? ? ? 26.00 us? ? 293.00 us? ? ? ? ? ? 279? ? ? STATFS 0.06? ? 180.77 us? ? 118.00 us? ? 306.00 us? ? ? ? ? ? 148? ? ? MKNOD 0.14? ? ? 37.98 us? ? ? 17.00 us? ? 192.00 us? ? ? ? ? 1598? ? INODELK 0.67? ? ? 91.68 us? ? ? 12.00 us? ? 1141.00 us? ? ? ? ? 3186? ? ? LOOKUP 10.10? ? ? 55.92 us? ? ? 28.00 us? ? 1658.00 us? ? ? ? ? 78608? ? FXATTROP 11.89? ? 6814.76 us? ? ? 18.00 us? 301246.00 us? ? ? ? ? ? 760? ? ENTRYLK 19.44? ? ? 36.55 us? ? ? 14.00 us? ? 2353.00 us? ? ? ? 231535? ? FINODELK 25.21? ? 142.92 us? ? ? 62.00 us? ? 593.00 us? ? ? ? ? 76794? ? ? WRITE 32.28? 91283.68 us? ? ? 28.00 us? 316658.00 us? ? ? ? ? ? 154? ? ? UNLINK Duration: 1206 seconds Data Read: 0 bytes Data Written: 10060843008 bytes Interval 0 Stats: Block Size:? ? ? ? ? ? ? ? 512b+? ? ? ? ? ? ? 131072b+ No. of Reads:? ? ? ? ? ? ? ? ? ? 0? ? ? ? ? ? ? ? ? ? 0 No. of Writes:? ? ? ? ? ? ? ? ? 36? ? ? ? ? ? ? ? 76758 %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop ---------? -----------? -----------? -----------? ------------? ? ? ? ---- 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 6? ? ? FORGET 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 87? ? RELEASE 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 96? RELEASEDIR 0.00? ? 100.50 us? ? ? 80.00 us? ? 121.00 us? ? ? ? ? ? ? 2 REMOVEXATTR 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 2? ? SETXATTR 0.00? ? ? 36.18 us? ? ? 22.00 us? ? ? 62.00 us? ? ? ? ? ? 11? ? ? FLUSH 0.00? ? ? 57.44 us? ? ? 42.00 us? ? ? 77.00 us? ? ? ? ? ? ? 9? FTRUNCATE 0.00? ? ? 82.56 us? ? ? 59.00 us? ? 138.00 us? ? ? ? ? ? ? 9? ? ? FSTAT 0.00? ? ? 89.42 us? ? ? 67.00 us? ? 161.00 us? ? ? ? ? ? 12? ? SETATTR 0.00? ? 272.40 us? ? 235.00 us? ? 296.00 us? ? ? ? ? ? ? 5? ? ? CREATE 0.01? ? 154.28 us? ? ? 88.00 us? ? 320.00 us? ? ? ? ? ? 18? ? XATTROP 0.01? ? ? 45.29 us? ? ? 1.00 us? ? 319.00 us? ? ? ? ? ? 96? ? OPENDIR 0.01? ? ? 86.69 us? ? ? 30.00 us? ? 379.00 us? ? ? ? ? ? 62? ? ? ? STAT 0.01? ? ? 64.30 us? ? ? 47.00 us? ? 169.00 us? ? ? ? ? ? 84? ? ? ? OPEN 0.02? ? 107.34 us? ? ? 23.00 us? ? 273.00 us? ? ? ? ? ? 73? ? READDIRP 0.02? ? 4688.00 us? ? ? 86.00 us? ? 9290.00 us? ? ? ? ? ? ? 2? ? TRUNCATE 0.02? ? ? 59.29 us? ? ? 13.00 us? ? 394.00 us? ? ? ? ? ? 165? ? GETXATTR 0.03? ? 128.51 us? ? ? 27.00 us? ? 338.00 us? ? ? ? ? ? 96? ? ? FSYNC 0.03? ? 240.75 us? ? ? 14.00 us? ? 1943.00 us? ? ? ? ? ? 52? ? READDIR 0.04? ? ? 65.59 us? ? ? 26.00 us? ? 293.00 us? ? ? ? ? ? 279? ? ? STATFS 0.06? ? 180.77 us? ? 118.00 us? ? 306.00 us? ? ? ? ? ? 148? ? ? MKNOD 0.14? ? ? 37.98 us? ? ? 17.00 us? ? 192.00 us? ? ? ? ? 1598? ? INODELK 0.67? ? ? 91.66 us? ? ? 12.00 us? ? 1141.00 us? ? ? ? ? 3186? ? ? LOOKUP 10.10? ? ? 55.92 us? ? ? 28.00 us? ? 1658.00 us? ? ? ? ? 78608? ? FXATTROP 11.89? ? 6814.76 us? ? ? 18.00 us? 301246.00 us? ? ? ? ? ? 760? ? ENTRYLK 19.44? ? ? 36.55 us? ? ? 14.00 us? ? 2353.00 us? ? ? ? 231535? ? FINODELK 25.21? ? 142.92 us? ? ? 62.00 us? ? 593.00 us? ? ? ? ? 76794? ? ? WRITE 32.28? 91283.68 us? ? ? 28.00 us? 316658.00 us? ? ? ? ? ? 154? ? ? UNLINK Duration: 1206 seconds Data Read: 0 bytes Data Written: 10060843008 bytes This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. Most probably I haven't created the volume properly or some option/feature is disabled ?!? Network shows OK for a gigabit: [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C 7180980+0 records in 7180979+0 records out 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s I'm looking for any help... you can share your volume info also. Thanks in advance. Best Regards, Strahil Nikolov _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) Dear Amar, Thanks for your email. Actually my concerns were on both topics. Would you recommend any perf options that will be suitable ? After mentioning the network usage, I just checked it and it seems duringthe test session, ovirt1 (both client and host) is using no more than 455Mbit/s which is half the network bandwidth. I'm still in the middle of nowhere, so any ideas are welcome. Best Regards, Strahil Nikolov On Jan 23, 2019 17:49, Amar Tumballi Suryanarayan wrote: > > I didn't understand the issue properly. Mostly I missed something. > > Are you concerned the performance is 49MB/s with and without perf options? or are you expecting it to be 123MB/s as over the n/w you get that speed? > > If it is the first problem, then you are actually having 'performance.write-behind on' in both options, and it is the only perf xlator which comes into action during the test you ran. > > If it is the second, then please be informed that gluster does client side replication, which means, n/w would be split in half for write operations (like write(), creat() etc), so the number you are getting is almost the maximum with 1GbE. > > Regards, > Amar > > On Wed, Jan 23, 2019 at 8:38 PM Strahil Nikolov wrote: >> >> Hello Community, >> >> recently I have built a new lab based on oVirt and CentOS 7. >> During deployment I had some hicups, but now the engine is up and running - but gluster is causing me trouble. >> >> Symptoms: Slow VM install from DVD, poor write performance. The latter has been tested via: >> dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_data bs=1M count=1000 status=progress >> >> The reported speed is 60MB/s which is way too low for my setup. >> >> My lab design: >> https://drive.google.com/file/d/1SiW21ASPXHRAEuE_jZ50R3FoO-NcnFqT/view?usp=sharing >> Gluster version is 3.12.15 >> >> So far I have done: >> >> 1. Added 'server.allow-insecure on' (with 'option rpc-auth-allow-insecure on' in glusterd.vol) >> Volume info after that change: >> >> Volume Name: data >> Type: Replicate >> Volume ID: 9b06a1e9-8102-4cd7-bc56-84960a1efaa2 >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 1 x (2 + 1) = 3 >> Transport-type: tcp >> Bricks: >> Brick1: ovirt1.localdomain:/gluster_bricks/data/data >> Brick2: ovirt2.localdomain:/gluster_bricks/data/data >> Brick3: ovirt3.localdomain:/gluster_bricks/data/data (arbiter) >> Options Reconfigured: >> performance.client-io-threads: off >> nfs.disable: on >> transport.address-family: inet >> performance.quick-read: off >> performance.read-ahead: off >> performance.io-cache: off >> performance.low-prio-threads: 32 >> network.remote-dio: off >> cluster.eager-lock: enable >> cluster.quorum-type: auto >> cluster.server-quorum-type: server >> cluster.data-self-heal-algorithm: full >> cluster.locking-scheme: granular >> cluster.shd-max-threads: 8 >> cluster.shd-wait-qlength: 10000 >> features.shard: on >> user.cifs: off >> storage.owner-uid: 36 >> storage.owner-gid: 36 >> network.ping-timeout: 30 >> performance.strict-o-direct: on >> cluster.granular-entry-heal: enable >> server.allow-insecure: on >> >> Seems no positive or negative effect so far. >> >> 2. Tested with tmpfs? on all bricks -> ovirt1 mounted gluster volume ->? max 60MB/s (bs=1M without 'oflag=direct') >> >> >> [root at ovirt1 data]# dd if=/dev/zero of=large_io bs=1M? count=4000 status=progress >> 4177526784 bytes (4.2 GB) copied, 70.843409 s, 59.0 MB/s >> 4000+0 records in >> 4000+0 records out >> 4194304000 bytes (4.2 GB) copied, 71.1407 s, 59.0 MB/s >> [root at ovirt1 data]# rm -f large_io >> [root at ovirt1 data]# gluster volume profile data info >> Brick: ovirt1.localdomain:/gluster_bricks/data/data >> --------------------------------------------------- >> Cumulative Stats: >> Block Size:? ? ? ? ? ? 131072b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 8 >> No. of Writes:? ? ? ? ? ? ? ? 44968 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 3? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 35? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 28? RELEASEDIR >> 0.00? ? ? 78.00 us? ? ? 78.00 us? ? ? 78.00 us? ? ? ? ? ? ? 1? ? ? FSTAT >> 0.00? ? ? 35.67 us? ? ? 26.00 us? ? ? 73.00 us? ? ? ? ? ? ? 6? ? ? FLUSH >> 0.00? ? 324.00 us? ? 324.00 us? ? 324.00 us? ? ? ? ? ? ? 1? ? XATTROP >> 0.00? ? ? 45.80 us? ? ? 38.00 us? ? ? 54.00 us? ? ? ? ? ? 10? ? ? ? STAT >> 0.00? ? 227.67 us? ? 216.00 us? ? 242.00 us? ? ? ? ? ? ? 3? ? ? CREATE >> 0.00? ? 113.38 us? ? ? 68.00 us? ? 381.00 us? ? ? ? ? ? ? 8? ? ? ? READ >> 0.00? ? ? 39.82 us? ? ? 1.00 us? ? 148.00 us? ? ? ? ? ? 28? ? OPENDIR >> 0.00? ? ? 67.54 us? ? ? 10.00 us? ? 283.00 us? ? ? ? ? ? 24? ? GETXATTR >> 0.00? ? ? 59.97 us? ? ? 45.00 us? ? 113.00 us? ? ? ? ? ? 32? ? ? ? OPEN >> 0.00? ? ? 24.41 us? ? ? 13.00 us? ? ? 89.00 us? ? ? ? ? ? 161? ? INODELK >> 0.00? ? ? 43.43 us? ? ? 28.00 us? ? 214.00 us? ? ? ? ? ? 93? ? ? STATFS >> 0.00? ? 246.35 us? ? ? 11.00 us? ? 1155.00 us? ? ? ? ? ? 20? ? READDIR >> 0.00? ? 283.00 us? ? 233.00 us? ? 353.00 us? ? ? ? ? ? 18? ? READDIRP >> 0.00? ? 153.23 us? ? 122.00 us? ? 259.00 us? ? ? ? ? ? 87? ? ? MKNOD >> 0.01? ? ? 99.77 us? ? ? 10.00 us? ? 258.00 us? ? ? ? ? ? 442? ? ? LOOKUP >> 0.31? ? ? 49.22 us? ? ? 27.00 us? ? 540.00 us? ? ? ? ? 45620? ? FXATTROP >> 0.77? ? 124.24 us? ? ? 87.00 us? ? 604.00 us? ? ? ? ? 44968? ? ? WRITE >> 0.93? 15767.71 us? ? ? 15.00 us? 305833.00 us? ? ? ? ? ? 431? ? ENTRYLK >> 1.99? 160711.39 us? ? 3332.00 us? 406037.00 us? ? ? ? ? ? 90? ? ? UNLINK >> 96.00? ? 5167.82 us? ? ? 18.00 us? 55972.00 us? ? ? ? 135349? ? FINODELK >> >> Duration: 380 seconds >> Data Read: 1048576 bytes >> Data Written: 5894045696 bytes >> >> Interval 0 Stats: >> Block Size:? ? ? ? ? ? 131072b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 8 >> No. of Writes:? ? ? ? ? ? ? ? 44968 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 3? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 35? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 28? RELEASEDIR >> 0.00? ? ? 78.00 us? ? ? 78.00 us? ? ? 78.00 us? ? ? ? ? ? ? 1? ? ? FSTAT >> 0.00? ? ? 35.67 us? ? ? 26.00 us? ? ? 73.00 us? ? ? ? ? ? ? 6? ? ? FLUSH >> 0.00? ? 324.00 us? ? 324.00 us? ? 324.00 us? ? ? ? ? ? ? 1? ? XATTROP >> 0.00? ? ? 45.80 us? ? ? 38.00 us? ? ? 54.00 us? ? ? ? ? ? 10? ? ? ? STAT >> 0.00? ? 227.67 us? ? 216.00 us? ? 242.00 us? ? ? ? ? ? ? 3? ? ? CREATE >> 0.00? ? 113.38 us? ? ? 68.00 us? ? 381.00 us? ? ? ? ? ? ? 8? ? ? ? READ >> 0.00? ? ? 39.82 us? ? ? 1.00 us? ? 148.00 us? ? ? ? ? ? 28? ? OPENDIR >> 0.00? ? ? 67.54 us? ? ? 10.00 us? ? 283.00 us? ? ? ? ? ? 24? ? GETXATTR >> 0.00? ? ? 59.97 us? ? ? 45.00 us? ? 113.00 us? ? ? ? ? ? 32? ? ? ? OPEN >> 0.00? ? ? 24.41 us? ? ? 13.00 us? ? ? 89.00 us? ? ? ? ? ? 161? ? INODELK >> 0.00? ? ? 43.43 us? ? ? 28.00 us? ? 214.00 us? ? ? ? ? ? 93? ? ? STATFS >> 0.00? ? 246.35 us? ? ? 11.00 us? ? 1155.00 us? ? ? ? ? ? 20? ? READDIR >> 0.00? ? 283.00 us? ? 233.00 us? ? 353.00 us? ? ? ? ? ? 18? ? READDIRP >> 0.00? ? 153.23 us? ? 122.00 us? ? 259.00 us? ? ? ? ? ? 87? ? ? MKNOD >> 0.01? ? ? 99.77 us? ? ? 10.00 us? ? 258.00 us? ? ? ? ? ? 442? ? ? LOOKUP >> 0.31? ? ? 49.22 us? ? ? 27.00 us? ? 540.00 us? ? ? ? ? 45620? ? FXATTROP >> 0.77? ? 124.24 us? ? ? 87.00 us? ? 604.00 us? ? ? ? ? 44968? ? ? WRITE >> 0.93? 15767.71 us? ? ? 15.00 us? 305833.00 us? ? ? ? ? ? 431? ? ENTRYLK >> 1.99? 160711.39 us? ? 3332.00 us? 406037.00 us? ? ? ? ? ? 90? ? ? UNLINK >> 96.00? ? 5167.82 us? ? ? 18.00 us? 55972.00 us? ? ? ? 135349? ? FINODELK >> >> Duration: 380 seconds >> Data Read: 1048576 bytes >> Data Written: 5894045696 bytes >> >> Brick: ovirt3.localdomain:/gluster_bricks/data/data >> --------------------------------------------------- >> Cumulative Stats: >> Block Size:? ? ? ? ? ? ? ? ? 1b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 0 >> No. of Writes:? ? ? ? ? ? ? ? 39328 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 2? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 12? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 17? RELEASEDIR >> 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 1? ? ? FSTAT >> 0.00? ? ? 51.50 us? ? ? 20.00 us? ? ? 81.00 us? ? ? ? ? ? ? 4? ? ? FLUSH >> 0.01? ? 219.50 us? ? 188.00 us? ? 251.00 us? ? ? ? ? ? ? 2? ? ? CREATE >> 0.01? ? ? 43.45 us? ? ? 11.00 us? ? ? 90.00 us? ? ? ? ? ? 11? ? GETXATTR >> 0.01? ? ? 62.30 us? ? ? 38.00 us? ? 119.00 us? ? ? ? ? ? 10? ? ? ? OPEN >> 0.01? ? ? 50.59 us? ? ? 1.00 us? ? 102.00 us? ? ? ? ? ? 17? ? OPENDIR >> 0.01? ? ? 24.60 us? ? ? 12.00 us? ? ? 64.00 us? ? ? ? ? ? 40? ? INODELK >> 0.02? ? 176.30 us? ? ? 10.00 us? ? 765.00 us? ? ? ? ? ? 10? ? READDIR >> 0.07? ? ? 63.08 us? ? ? 39.00 us? ? 133.00 us? ? ? ? ? ? 78? ? ? UNLINK >> 0.13? ? ? 27.35 us? ? ? 10.00 us? ? ? 91.00 us? ? ? ? ? ? 333? ? ENTRYLK >> 0.13? ? 126.89 us? ? ? 99.00 us? ? 179.00 us? ? ? ? ? ? 76? ? ? MKNOD >> 0.42? ? 116.70 us? ? ? 8.00 us? ? 8661.00 us? ? ? ? ? ? 261? ? ? LOOKUP >> 28.73? ? ? 51.79 us? ? ? 22.00 us? ? 2574.00 us? ? ? ? ? 39822? ? FXATTROP >> 29.52? ? ? 53.87 us? ? ? 16.00 us? ? 3290.00 us? ? ? ? ? 39328? ? ? WRITE >> 40.92? ? ? 24.71 us? ? ? 10.00 us? ? 3224.00 us? ? ? ? 118864? ? FINODELK >> >> Duration: 189 seconds >> Data Read: 0 bytes >> Data Written: 39328 bytes >> >> Interval 0 Stats: >> Block Size:? ? ? ? ? ? ? ? ? 1b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 0 >> No. of Writes:? ? ? ? ? ? ? ? 39328 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 2? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 12? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 17? RELEASEDIR >> 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 1? ? ? FSTAT >> 0.00? ? ? 51.50 us? ? ? 20.00 us? ? ? 81.00 us? ? ? ? ? ? ? 4? ? ? FLUSH >> 0.01? ? 219.50 us? ? 188.00 us? ? 251.00 us? ? ? ? ? ? ? 2? ? ? CREATE >> 0.01? ? ? 43.45 us? ? ? 11.00 us? ? ? 90.00 us? ? ? ? ? ? 11? ? GETXATTR >> 0.01? ? ? 62.30 us? ? ? 38.00 us? ? 119.00 us? ? ? ? ? ? 10? ? ? ? OPEN >> 0.01? ? ? 50.59 us? ? ? 1.00 us? ? 102.00 us? ? ? ? ? ? 17? ? OPENDIR >> 0.01? ? ? 24.60 us? ? ? 12.00 us? ? ? 64.00 us? ? ? ? ? ? 40? ? INODELK >> 0.02? ? 176.30 us? ? ? 10.00 us? ? 765.00 us? ? ? ? ? ? 10? ? READDIR >> 0.07? ? ? 63.08 us? ? ? 39.00 us? ? 133.00 us? ? ? ? ? ? 78? ? ? UNLINK >> 0.13? ? ? 27.35 us? ? ? 10.00 us? ? ? 91.00 us? ? ? ? ? ? 333? ? ENTRYLK >> 0.13? ? 126.89 us? ? ? 99.00 us? ? 179.00 us? ? ? ? ? ? 76? ? ? MKNOD >> 0.42? ? 116.70 us? ? ? 8.00 us? ? 8661.00 us? ? ? ? ? ? 261? ? ? LOOKUP >> 28.73? ? ? 51.79 us? ? ? 22.00 us? ? 2574.00 us? ? ? ? ? 39822? ? FXATTROP >> 29.52? ? ? 53.87 us? ? ? 16.00 us? ? 3290.00 us? ? ? ? ? 39328? ? ? WRITE >> 40.92? ? ? 24.71 us? ? ? 10.00 us? ? 3224.00 us? ? ? ? 118864? ? FINODELK >> >> Duration: 189 seconds >> Data Read: 0 bytes >> Data Written: 39328 bytes >> >> Brick: ovirt2.localdomain:/gluster_bricks/data/data >> --------------------------------------------------- >> Cumulative Stats: >> Block Size:? ? ? ? ? ? ? ? 512b+? ? ? ? ? ? ? 131072b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 0? ? ? ? ? ? ? ? ? ? 0 >> No. of Writes:? ? ? ? ? ? ? ? ? 36? ? ? ? ? ? ? ? 76758 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 6? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 87? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 96? RELEASEDIR >> 0.00? ? 100.50 us? ? ? 80.00 us? ? 121.00 us? ? ? ? ? ? ? 2 REMOVEXATTR >> 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 2? ? SETXATTR >> 0.00? ? ? 36.18 us? ? ? 22.00 us? ? ? 62.00 us? ? ? ? ? ? 11? ? ? FLUSH >> 0.00? ? ? 57.44 us? ? ? 42.00 us? ? ? 77.00 us? ? ? ? ? ? ? 9? FTRUNCATE >> 0.00? ? ? 82.56 us? ? ? 59.00 us? ? 138.00 us? ? ? ? ? ? ? 9? ? ? FSTAT >> 0.00? ? ? 89.42 us? ? ? 67.00 us? ? 161.00 us? ? ? ? ? ? 12? ? SETATTR >> 0.00? ? 272.40 us? ? 235.00 us? ? 296.00 us? ? ? ? ? ? ? 5? ? ? CREATE >> 0.01? ? 154.28 us? ? ? 88.00 us? ? 320.00 us? ? ? ? ? ? 18? ? XATTROP >> 0.01? ? ? 45.29 us? ? ? 1.00 us? ? 319.00 us? ? ? ? ? ? 96? ? OPENDIR >> 0.01? ? ? 86.69 us? ? ? 30.00 us? ? 379.00 us? ? ? ? ? ? 62? ? ? ? STAT >> 0.01? ? ? 64.30 us? ? ? 47.00 us? ? 169.00 us? ? ? ? ? ? 84? ? ? ? OPEN >> 0.02? ? 107.34 us? ? ? 23.00 us? ? 273.00 us? ? ? ? ? ? 73? ? READDIRP >> 0.02? ? 4688.00 us? ? ? 86.00 us? ? 9290.00 us? ? ? ? ? ? ? 2? ? TRUNCATE >> 0.02? ? ? 59.29 us? ? ? 13.00 us? ? 394.00 us? ? ? ? ? ? 165? ? GETXATTR >> 0.03? ? 128.51 us? ? ? 27.00 us? ? 338.00 us? ? ? ? ? ? 96? ? ? FSYNC >> 0.03? ? 240.75 us? ? ? 14.00 us? ? 1943.00 us? ? ? ? ? ? 52? ? READDIR >> 0.04? ? ? 65.59 us? ? ? 26.00 us? ? 293.00 us? ? ? ? ? ? 279? ? ? STATFS >> 0.06? ? 180.77 us? ? 118.00 us? ? 306.00 us? ? ? ? ? ? 148? ? ? MKNOD >> 0.14? ? ? 37.98 us? ? ? 17.00 us? ? 192.00 us? ? ? ? ? 1598? ? INODELK >> 0.67? ? ? 91.68 us? ? ? 12.00 us? ? 1141.00 us? ? ? ? ? 3186? ? ? LOOKUP >> 10.10? ? ? 55.92 us? ? ? 28.00 us? ? 1658.00 us? ? ? ? ? 78608? ? FXATTROP >> 11.89? ? 6814.76 us? ? ? 18.00 us? 301246.00 us? ? ? ? ? ? 760? ? ENTRYLK >> 19.44? ? ? 36.55 us? ? ? 14.00 us? ? 2353.00 us? ? ? ? 231535? ? FINODELK >> 25.21? ? 142.92 us? ? ? 62.00 us? ? 593.00 us? ? ? ? ? 76794? ? ? WRITE >> 32.28? 91283.68 us? ? ? 28.00 us? 316658.00 us? ? ? ? ? ? 154? ? ? UNLINK >> >> Duration: 1206 seconds >> Data Read: 0 bytes >> Data Written: 10060843008 bytes >> >> Interval 0 Stats: >> Block Size:? ? ? ? ? ? ? ? 512b+? ? ? ? ? ? ? 131072b+ >> No. of Reads:? ? ? ? ? ? ? ? ? ? 0? ? ? ? ? ? ? ? ? ? 0 >> No. of Writes:? ? ? ? ? ? ? ? ? 36? ? ? ? ? ? ? ? 76758 >> %-latency? Avg-latency? Min-Latency? Max-Latency? No. of calls? ? ? ? Fop >> ---------? -----------? -----------? -----------? ------------? ? ? ? ---- >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? ? 6? ? ? FORGET >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 87? ? RELEASE >> 0.00? ? ? 0.00 us? ? ? 0.00 us? ? ? 0.00 us? ? ? ? ? ? 96? RELEASEDIR >> 0.00? ? 100.50 us? ? ? 80.00 us? ? 121.00 us? ? ? ? ? ? ? 2 REMOVEXATTR >> 0.00? ? 101.00 us? ? 101.00 us? ? 101.00 us? ? ? ? ? ? ? 2? ? SETXATTR >> 0.00? ? ? 36.18 us? ? ? 22.00 us? ? ? 62.00 us? ? ? ? ? ? 11? ? ? FLUSH >> 0.00? ? ? 57.44 us? ? ? 42.00 us? ? ? 77.00 us? ? ? ? ? ? ? 9? FTRUNCATE >> 0.00? ? ? 82.56 us? ? ? 59.00 us? ? 138.00 us? ? ? ? ? ? ? 9? ? ? FSTAT >> 0.00? ? ? 89.42 us? ? ? 67.00 us? ? 161.00 us? ? ? ? ? ? 12? ? SETATTR >> 0.00? ? 272.40 us? ? 235.00 us? ? 296.00 us? ? ? ? ? ? ? 5? ? ? CREATE >> 0.01? ? 154.28 us? ? ? 88.00 us? ? 320.00 us? ? ? ? ? ? 18? ? XATTROP >> 0.01? ? ? 45.29 us? ? ? 1.00 us? ? 319.00 us? ? ? ? ? ? 96? ? OPENDIR >> 0.01? ? ? 86.69 us? ? ? 30.00 us? ? 379.00 us? ? ? ? ? ? 62? ? ? ? STAT >> 0.01? ? ? 64.30 us? ? ? 47.00 us? ? 169.00 us? ? ? ? ? ? 84? ? ? ? OPEN >> 0.02? ? 107.34 us? ? ? 23.00 us? ? 273.00 us? ? ? ? ? ? 73? ? READDIRP >> 0.02? ? 4688.00 us? ? ? 86.00 us? ? 9290.00 us? ? ? ? ? ? ? 2? ? TRUNCATE >> 0.02? ? ? 59.29 us? ? ? 13.00 us? ? 394.00 us? ? ? ? ? ? 165? ? GETXATTR >> 0.03? ? 128.51 us? ? ? 27.00 us? ? 338.00 us? ? ? ? ? ? 96? ? ? FSYNC >> 0.03? ? 240.75 us? ? ? 14.00 us? ? 1943.00 us? ? ? ? ? ? 52? ? READDIR >> 0.04? ? ? 65.59 us? ? ? 26.00 us? ? 293.00 us? ? ? ? ? ? 279? ? ? STATFS >> 0.06? ? 180.77 us? ? 118.00 us? ? 306.00 us? ? ? ? ? ? 148? ? ? MKNOD >> 0.14? ? ? 37.98 us? ? ? 17.00 us? ? 192.00 us? ? ? ? ? 1598? ? INODELK >> 0.67? ? ? 91.66 us? ? ? 12.00 us? ? 1141.00 us? ? ? ? ? 3186? ? ? LOOKUP >> 10.10? ? ? 55.92 us? ? ? 28.00 us? ? 1658.00 us? ? ? ? ? 78608? ? FXATTROP >> 11.89? ? 6814.76 us? ? ? 18.00 us? 301246.00 us? ? ? ? ? ? 760? ? ENTRYLK >> 19.44? ? ? 36.55 us? ? ? 14.00 us? ? 2353.00 us? ? ? ? 231535? ? FINODELK >> 25.21? ? 142.92 us? ? ? 62.00 us? ? 593.00 us? ? ? ? ? 76794? ? ? WRITE >> 32.28? 91283.68 us? ? ? 28.00 us? 316658.00 us? ? ? ? ? ? 154? ? ? UNLINK >> >> Duration: 1206 seconds >> Data Read: 0 bytes >> Data Written: 10060843008 bytes >> >> >> >> This indicates to me that it's not a problem in Disk/LVM/FileSystem layout. >> >> Most probably I haven't created the volume properly or some option/feature is disabled ?!? >> Network shows OK for a gigabit: >> [root at ovirt1 data]# dd if=/dev/zero status=progress | nc ovirt2 9999 >> 3569227264 bytes (3.6 GB) copied, 29.001052 s, 123 MB/s^C >> 7180980+0 records in >> 7180979+0 records out >> 3676661248 bytes (3.7 GB) copied, 29.8739 s, 123 MB/s >> >> >> I'm looking for any help... you can share your volume info also. >> >> Thanks in advance. >> >> Best Regards, >> Strahil Nikolov >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From max.degraaf at kpn.com Fri Jan 25 15:09:42 2019 From: max.degraaf at kpn.com (max.degraaf at kpn.com) Date: Fri, 25 Jan 2019 15:09:42 +0000 Subject: [Gluster-users] Brick stays offline after update from 4.1.6-1.el7 to 4.1.7-1.el7 In-Reply-To: References: Message-ID: Found it. Filesystem on 1 of the nodes was corrupt. Removing that brick, fixing the filesytem and adding the brick again solved the problem. ________________________________ From: Graaf, Max de Sent: Friday, January 25, 2019 1:32:41 PM To: Gluster-users Subject: Brick stays offline after update from 4.1.6-1.el7 to 4.1.7-1.el7 We have 2 nodes running CentOS 7.3. Running just fine with glusterfs 4.1.6-1.el7. This morning update both to 4.1.7-1.el7 and the only brick configured stays offline. gluster peer status show no problems: Number of Peers: 1 Hostname: 10.159.241.35 Uuid: 7453dbec-44fb-4e57-9471-6e653d287d3b State: Peer in Cluster (Connected) Number of Peers: 1 Hostname: 10.159.241.3 Uuid: 8f0e75bd-c782-4d21-aaf3-2d8a27e8a714 State: Peer in Cluster (Connected) gluster volume status show the bricks offline: Status of volume: gst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick grpprdaalcgst01.cloudprod.local:/apps /glusterfs-gst/gst 49152 0 Y 8827 Brick grpprdapdcgst01.cloudprod.local:/apps /glusterfs-gst/gst N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 8818 Self-heal Daemon on grpprdapdcgst01.cloudpr od.local N/A N/A Y 28111 Task Status of Volume gst ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick grpprdaalcgst01.cloudprod.local:/apps /glusterfs-gst/gst 49152 0 Y 8827 Brick grpprdapdcgst01.cloudprod.local:/apps /glusterfs-gst/gst N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 28111 Self-heal Daemon on 10.159.241.3 N/A N/A Y 8818 Task Status of Volume gst ------------------------------------------------------------------------------ There are no active volume tasks Any idea on a fix? If not, how can we revert to 4.1.6-1.el7? -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.amedick at uni-luebeck.de Fri Jan 25 15:10:58 2019 From: g.amedick at uni-luebeck.de (Gudrun Mareike Amedick) Date: Fri, 25 Jan 2019 16:10:58 +0100 Subject: [Gluster-users] Files losing permissions in GlusterFS 3.12 Message-ID: <1548429058.2018.12.camel@uni-luebeck.de> Hi all, we have a problem with a distributed dispersed volume (GlusterFS 3.12). We have files that lost their permissions or gained sticky bits. The files themselves seem to be okay. It looks like this: # ls -lah $file1 ---------- 1 www-data www-data 45M Jan 12 07:01 $file1 # ls -lah $file2 -rw-rwS--T 1 $user $group 11K Jan??9 11:48 $file2 # ls -lah $file3 ---------T 1 $user $group 6.8M Jan 12 08:17 $file3 This is not what the permissions are supposed to look. They were 644 or 660 before. And they definitely had no sticky bits. The permissions on the bricks match what I see on client side. So I think the original permissions are lost without a chance to recover them, right? With some files with weird looking permissions (but not with all of them), I can do this: # ls -lah $path/$file4 -rw-r--r-- 1 $user $group 6.0G Oct 11 09:34 $path/$file4 ls -lah $path | grep $file4 -rw-r-Sr-T??1 $user$group 6.0G Oct 11 09:34 $file4 So, the permissions I see depend on how I'm querying them. The permissions on brick side agree with the ladder result, stat sees the former. I'm not sure how that works. We know for at least a part of those files that they were okay at December 19th. We got the first reports of weird-looking permissions at January 12th. Between that, there was a rebalance running (January 7th to January 11th). During that rebalance, a node was offline for a longer period of time due to hardware issues. The output of "gluster volume heal $VOLUME info" shows no files though. For all files with broken permissions we found so far, the following lines are in the rebalance log: [2019-01-07 09:31:11.004802] I [MSGID: 109045] [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not having link subvol for $file5 [2019-01-07 09:31:11.262273] I [MSGID: 109069] [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] 0-$VOLUME-dht: lookup_unlink returned with op_ret -> 0 and op-errno -> 0 for $file5 [2019-01-07 09:31:11.266014] I [dht-rebalance.c:1570:dht_migrate_file] 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 [2019-01-07 09:31:11.278120] I [dht-rebalance.c:1570:dht_migrate_file] 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 [2019-01-07 09:31:11.732175] W [dht-rebalance.c:2159:dht_migrate_file] 0-$VOLUME-dht: $file5: failed to perform removexattr on $VOLUME-readdir-ahead-0 (No data available) [2019-01-07 09:31:11.737319] W [MSGID: 109023] [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file5: failed to do a stat on $VOLUME-readdir- ahead-0 [No such file or directory] [2019-01-07 09:31:11.744382] I [MSGID: 109022] [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration of $file5 from subvolume $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 [2019-01-07 09:31:11.744676] I [MSGID: 109022] [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration of $file5 from subvolume $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 I've searched the brick logs for $file5 with broken permissions and found this on all bricks from (I think) the subvolume $VOLUME-readdir-ahead-5: [2019-01-07 09:32:13.821545] I [MSGID: 113030] [posix.c:2171:posix_unlink] 0-$VOLUME-posix: open-fd-key-status: 0 for $file5 [2019-01-07 09:32:13.821609] I [MSGID: 113031] [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr status: 0 for $file5 Also, we noticed that many directories got their modification time updated. It was set to the rebalance date. Is that supposed to happen? We had parallel-readdir enabled during the rebalance. We disabled it since we had empty directories that couldn't be deleted. I was able to delete those dirs after that.? Also, we have directories who lost their GFID on some bricks. Again. What happened? Can we do something to fix this? And could that happen again? We want to upgrade to 4.1 soon. Is it safe to do that or could it make things worse? Kind regards Gudrun Amedick -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 6743 bytes Desc: not available URL: From nbalacha at redhat.com Mon Jan 28 04:20:12 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Mon, 28 Jan 2019 09:50:12 +0530 Subject: [Gluster-users] Files losing permissions in GlusterFS 3.12 In-Reply-To: <1548429058.2018.12.camel@uni-luebeck.de> References: <1548429058.2018.12.camel@uni-luebeck.de> Message-ID: On Fri, 25 Jan 2019 at 20:51, Gudrun Mareike Amedick < g.amedick at uni-luebeck.de> wrote: > Hi all, > > we have a problem with a distributed dispersed volume (GlusterFS 3.12). We > have files that lost their permissions or gained sticky bits. The files > themselves seem to be okay. > > It looks like this: > > # ls -lah $file1 > ---------- 1 www-data www-data 45M Jan 12 07:01 $file1 > > # ls -lah $file2 > -rw-rwS--T 1 $user $group 11K Jan 9 11:48 $file2 > > # ls -lah $file3 > ---------T 1 $user $group 6.8M Jan 12 08:17 $file3 > > These are linkto files (internal dht files) and should not be visible on the mount point. Are they consistently visible like this or do they revert to the proper permissions after some time? > This is not what the permissions are supposed to look. They were 644 or > 660 before. And they definitely had no sticky bits. > The permissions on the bricks match what I see on client side. So I think > the original permissions are lost without a chance to recover them, right? > > > With some files with weird looking permissions (but not with all of them), > I can do this: > # ls -lah $path/$file4 > -rw-r--r-- 1 $user $group 6.0G Oct 11 09:34 $path/$file4 > ls -lah $path | grep $file4 > -rw-r-Sr-T 1 $user$group 6.0G Oct 11 09:34 $file4 > So, the permissions I see depend on how I'm querying them. The permissions > on brick side agree with the ladder result, stat sees the former. I'm not > sure how that works. > The S and T bits indicate that a file is being migrated. The difference seems to be because of the way lookup versus readdirp handle this - this looks like a bug. Lookup will strip out the internal permissions set. I don't think readdirp does. This is happening because a rebalance is in progress. > We know for at least a part of those files that they were okay at December > 19th. We got the first reports of weird-looking permissions at January > 12th. Between that, there was a rebalance running (January 7th to January > 11th). During that rebalance, a node was offline for a longer period of time > due to hardware issues. The output of "gluster volume heal $VOLUME info" > shows no files though. > > For all files with broken permissions we found so far, the following lines > are in the rebalance log: > > [2019-01-07 09:31:11.004802] I [MSGID: 109045] > [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not having link > subvol for $file5 > [2019-01-07 09:31:11.262273] I [MSGID: 109069] > [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] 0-$VOLUME-dht: > lookup_unlink returned with > op_ret -> 0 and op-errno -> 0 for $file5 > [2019-01-07 09:31:11.266014] I [dht-rebalance.c:1570:dht_migrate_file] > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > $VOLUME-readdir-ahead-5 > [2019-01-07 09:31:11.278120] I [dht-rebalance.c:1570:dht_migrate_file] > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > $VOLUME-readdir-ahead-5 > [2019-01-07 09:31:11.732175] W [dht-rebalance.c:2159:dht_migrate_file] > 0-$VOLUME-dht: $file5: failed to perform removexattr on > $VOLUME-readdir-ahead-0 > (No data available) > [2019-01-07 09:31:11.737319] W [MSGID: 109023] > [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file5: failed to do > a stat on $VOLUME-readdir- > ahead-0 [No such file or directory] > [2019-01-07 09:31:11.744382] I [MSGID: 109022] > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > of $file5 from subvolume > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > [2019-01-07 09:31:11.744676] I [MSGID: 109022] > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > of $file5 from subvolume > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > I've searched the brick logs for $file5 with broken permissions and found > this on all bricks from (I think) the subvolume $VOLUME-readdir-ahead-5: > > [2019-01-07 09:32:13.821545] I [MSGID: 113030] [posix.c:2171:posix_unlink] > 0-$VOLUME-posix: open-fd-key-status: 0 for $file5 > [2019-01-07 09:32:13.821609] I [MSGID: 113031] > [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr status: 0 > for $file5 > > > > Also, we noticed that many directories got their modification time > updated. It was set to the rebalance date. Is that supposed to happen? > > > We had parallel-readdir enabled during the rebalance. We disabled it since > we had empty directories that couldn't be deleted. I was able to delete > those dirs after that. > Was this disabled during the rebalance? parallel-readdirp changes the volume graph for clients but not for the rebalance process causing it to fail to find the linkto subvols. > > Also, we have directories who lost their GFID on some bricks. Again. Is this the missing symlink problem that was reported earlier? Regards, Nithya > > > What happened? Can we do something to fix this? And could that happen > again? > > We want to upgrade to 4.1 soon. Is it safe to do that or could it make > things worse? > > Kind regards > > Gudrun Amedick_______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Mon Jan 28 07:05:03 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Mon, 28 Jan 2019 12:35:03 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Raghavendra, We are also facing following issue which is mentioned in case on openshift origin while we are creating pvc for pods. (Please provide workaround to move further (pod restart doesn't workout) https://bugzilla.redhat.com/show_bug.cgi?id=1630117 https://bugzilla.redhat.com/show_bug.cgi?id=1636912 Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. at a time to create only one volume when in-flight operations are zero. Once volume requested it reaches to 8. Now single volume not able to create and we are till now mostly 10 volumes are created. Please find heketidb dump and log [negroni] Completed 200 OK in 98.699?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 106.654?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 185.406?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 102.664?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 192.658?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 198.611?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 124.254?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 101.491?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 116.997?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 100.171?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 109.238?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/28 06:50:57 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 191.118?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 188.791?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 94.436?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 110.893?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.132?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 96.15?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.682?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 140.543?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 182.066?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 151.572?s BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" , Madhu Rajanna Date: 01/25/2019 04:03 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi John, Could you please have look my issue If you have time (atleast provide workaround). Thanks in advance. BR Salam From: "Shaik Salam" To: Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/25/2019 02:55 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Sent by: gluster-users-bounces at gluster.org "External email. Open with Caution" Hi John, Please find db dump and heketi log. Here kernel version. Please let me know If you need more information. [root at app2 ~]# uname -a Linux app2.matrix.nokia.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Hardware: HP GEN8 OS; NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" From: "Madhu Rajanna" To: "Shaik Salam" , "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 10:52 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" Adding John who is having more idea about how to debug this one. @Shaik Salam can you some more info on the hardware on which you are running heketi (kernel details) On Thu, Jan 24, 2019 at 7:42 PM Shaik Salam wrote: Hi Madhu, Sorry to disturb could you please provide atleast work around (to clear requests which stuck) to move further. We are also not able to find root cause from glusterd logs. Please find attachment. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 04:12 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 [attachment "heketi-complete.log.txt" deleted by Shaik Salam/HYD/TCS] [attachment "heketi-gluster.db.txt" deleted by Shaik Salam/HYD/TCS] _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: heketi-complete.log Type: application/octet-stream Size: 445422 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: heketi-dump.db Type: application/octet-stream Size: 67151 bytes Desc: not available URL: From shaik.salam at tcs.com Mon Jan 28 07:10:54 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Mon, 28 Jan 2019 12:40:54 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Raghavendra, We are also facing following issue which is mentioned in case on openshift origin while we are creating pvc for pods. (Please provide workaround to move further (pod restart doesn't workout) https://bugzilla.redhat.com/show_bug.cgi?id=1630117 https://bugzilla.redhat.com/show_bug.cgi?id=1636912 Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. at a time to create only one volume when in-flight operations are zero. Once volume requested it reaches to 8. Now single volume not able to create and we are till now mostly 10 volumes are created. Please find heketidb dump and log [negroni] Completed 200 OK in 98.699?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 106.654?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 185.406?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 102.664?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 192.658?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 198.611?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 124.254?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 101.491?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 116.997?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 100.171?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 109.238?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/28 06:50:57 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 191.118?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 188.791?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 94.436?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 110.893?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.132?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 96.15?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.682?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 140.543?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 182.066?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 151.572?s BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" , Madhu Rajanna Date: 01/25/2019 04:03 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi John, Could you please have look my issue If you have time (atleast provide workaround). Thanks in advance. BR Salam From: "Shaik Salam" To: Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/25/2019 02:55 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Sent by: gluster-users-bounces at gluster.org "External email. Open with Caution" Hi John, Please find db dump and heketi log. Here kernel version. Please let me know If you need more information. [root at app2 ~]# uname -a Linux app2.matrix.nokia.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Hardware: HP GEN8 OS; NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" From: "Madhu Rajanna" To: "Shaik Salam" , "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 10:52 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" Adding John who is having more idea about how to debug this one. @Shaik Salam can you some more info on the hardware on which you are running heketi (kernel details) On Thu, Jan 24, 2019 at 7:42 PM Shaik Salam wrote: Hi Madhu, Sorry to disturb could you please provide atleast work around (to clear requests which stuck) to move further. We are also not able to find root cause from glusterd logs. Please find attachment. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 04:12 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 [attachment "heketi-complete.log.txt" deleted by Shaik Salam/HYD/TCS] [attachment "heketi-gluster.db.txt" deleted by Shaik Salam/HYD/TCS] _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: heketi-complete.log Type: application/octet-stream Size: 445422 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: heketi-dump.db Type: application/octet-stream Size: 67151 bytes Desc: not available URL: From shaik.salam at tcs.com Mon Jan 28 08:27:25 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Mon, 28 Jan 2019 13:57:25 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Raghavendra, Could you please have look my issue If you have time. Thanks in advance. BR Salam From: Shaik Salam To: rtalur at redhat.com Cc: "gluster-users at gluster.org List" , John Mulligan , Michael Adam Date: 01/28/2019 12:42 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Sent by: gluster-users-bounces at gluster.org "External email. Open with Caution" Hi Raghavendra, We are also facing following issue which is mentioned in case on openshift origin while we are creating pvc for pods. (Please provide workaround to move further (pod restart doesn't workout) https://bugzilla.redhat.com/show_bug.cgi?id=1630117 https://bugzilla.redhat.com/show_bug.cgi?id=1636912 Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. at a time to create only one volume when in-flight operations are zero. Once volume requested it reaches to 8. Now single volume not able to create and we are till now mostly 10 volumes are created. Please find heketidb dump and log [negroni] Completed 200 OK in 98.699?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 106.654?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 185.406?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 102.664?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 192.658?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 198.611?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 124.254?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 101.491?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 116.997?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 100.171?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 109.238?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/28 06:50:57 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 191.118?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 188.791?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 94.436?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 110.893?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.132?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 96.15?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.682?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 140.543?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 182.066?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 151.572?s BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" , Madhu Rajanna Date: 01/25/2019 04:03 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi John, Could you please have look my issue If you have time (atleast provide workaround). Thanks in advance. BR Salam From: "Shaik Salam" To: Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/25/2019 02:55 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Sent by: gluster-users-bounces at gluster.org "External email. Open with Caution" Hi John, Please find db dump and heketi log. Here kernel version. Please let me know If you need more information. [root at app2 ~]# uname -a Linux app2.matrix.nokia.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Hardware: HP GEN8 OS; NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" From: "Madhu Rajanna" To: "Shaik Salam" , "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 10:52 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" Adding John who is having more idea about how to debug this one. @Shaik Salam can you some more info on the hardware on which you are running heketi (kernel details) On Thu, Jan 24, 2019 at 7:42 PM Shaik Salam wrote: Hi Madhu, Sorry to disturb could you please provide atleast work around (to clear requests which stuck) to move further. We are also not able to find root cause from glusterd logs. Please find attachment. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 04:12 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 [attachment "heketi-complete.log.txt" deleted by Shaik Salam/HYD/TCS] [attachment "heketi-gluster.db.txt" deleted by Shaik Salam/HYD/TCS] _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users [attachment "heketi-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "heketi-dump.db" deleted by Shaik Salam/HYD/TCS] _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From mabi at protonmail.ch Mon Jan 28 08:14:26 2019 From: mabi at protonmail.ch (mabi) Date: Mon, 28 Jan 2019 08:14:26 +0000 Subject: [Gluster-users] Max length for filename Message-ID: Hello, I saw this warning today in my fuse mount client log file: [2019-01-28 06:01:25.091232] W [fuse-bridge.c:565:fuse_entry_cbk] 0-glusterfs-fuse: 530594537: LOOKUP() /data/somedir0/files/-somdir1/dir2/dir3/some super long filename?.mp3.TransferId1924513788.part => -1 (File name too long) and was actually wondering on GlusterFS what is the maximum length for a filename? I am using GlusterFS 4.1.6. Regards, Mabi From cobanserkan at gmail.com Mon Jan 28 08:55:17 2019 From: cobanserkan at gmail.com (=?UTF-8?Q?Serkan_=C3=87oban?=) Date: Mon, 28 Jan 2019 11:55:17 +0300 Subject: [Gluster-users] Max length for filename In-Reply-To: References: Message-ID: Filename max is 255 bytes, path name max is 4096 bytes. On Mon, Jan 28, 2019 at 11:33 AM mabi wrote: > > Hello, > > I saw this warning today in my fuse mount client log file: > > [2019-01-28 06:01:25.091232] W [fuse-bridge.c:565:fuse_entry_cbk] 0-glusterfs-fuse: 530594537: LOOKUP() /data/somedir0/files/-somdir1/dir2/dir3/some super long filename?.mp3.TransferId1924513788.part => -1 (File name too long) > > and was actually wondering on GlusterFS what is the maximum length for a filename? > > I am using GlusterFS 4.1.6. > > Regards, > Mabi > > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From f.ruehlemann at uni-luebeck.de Mon Jan 28 09:23:41 2019 From: f.ruehlemann at uni-luebeck.de (Frank Ruehlemann) Date: Mon, 28 Jan 2019 10:23:41 +0100 Subject: [Gluster-users] Files losing permissions in GlusterFS 3.12 In-Reply-To: References: <1548429058.2018.12.camel@uni-luebeck.de> Message-ID: <1548667421.10294.221.camel@uni-luebeck.de> Am Montag, den 28.01.2019, 09:50 +0530 schrieb Nithya Balachandran: > On Fri, 25 Jan 2019 at 20:51, Gudrun Mareike Amedick < > g.amedick at uni-luebeck.de> wrote: > > > Hi all, > > > > we have a problem with a distributed dispersed volume (GlusterFS 3.12). We > > have files that lost their permissions or gained sticky bits. The files > > themselves seem to be okay. > > > > It looks like this: > > > > # ls -lah $file1 > > ---------- 1 www-data www-data 45M Jan 12 07:01 $file1 > > > > # ls -lah $file2 > > -rw-rwS--T 1 $user $group 11K Jan 9 11:48 $file2 > > > > # ls -lah $file3 > > ---------T 1 $user $group 6.8M Jan 12 08:17 $file3 > > > > These are linkto files (internal dht files) and should not be visible on > the mount point. Are they consistently visible like this or do they revert > to the proper permissions after some time? They didn't heal yet, even after more than 4 weeks. Therefore we decided to recommend our users to fix their files by setting the correct permissions again, which worked without problems. But for analysis reasons we still have some broken files nobody touched yet. We know these linkto files but they were never visible to clients. We did these ls-commands on a client, not on a brick. > > This is not what the permissions are supposed to look. They were 644 or > > 660 before. And they definitely had no sticky bits. > > The permissions on the bricks match what I see on client side. So I think > > the original permissions are lost without a chance to recover them, right? > > > > > > With some files with weird looking permissions (but not with all of them), > > I can do this: > > # ls -lah $path/$file4 > > -rw-r--r-- 1 $user $group 6.0G Oct 11 09:34 $path/$file4 > > ls -lah $path | grep $file4 > > -rw-r-Sr-T 1 $user$group 6.0G Oct 11 09:34 $file4 > > > > So, the permissions I see depend on how I'm querying them. The permissions > > on brick side agree with the ladder result, stat sees the former. I'm not > > sure how that works. > > > The S and T bits indicate that a file is being migrated. The difference > seems to be because of the way lookup versus readdirp handle this - this > looks like a bug. Lookup will strip out the internal permissions set. I > don't think readdirp does. This is happening because a rebalance is in > progress. There is no active rebalance. At least in "gluster volume rebalance $VOLUME status" is none visible. And in the rebalance log file of this volume is the last line: "[2019-01-11 02:14:50.101944] W ? received signum (15), shutting down" > > We know for at least a part of those files that they were okay at December > > 19th. We got the first reports of weird-looking permissions at January > > 12th. Between that, there was a rebalance running (January 7th to January > > 11th). During that rebalance, a node was offline for a longer period of time > > due to hardware issues. The output of "gluster volume heal $VOLUME info" > > shows no files though. > > > > For all files with broken permissions we found so far, the following lines > > are in the rebalance log: > > > > [2019-01-07 09:31:11.004802] I [MSGID: 109045] > > [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not having link > > subvol for $file5 > > [2019-01-07 09:31:11.262273] I [MSGID: 109069] > > [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] 0-$VOLUME-dht: > > lookup_unlink returned with > > op_ret -> 0 and op-errno -> 0 for $file5 > > [2019-01-07 09:31:11.266014] I [dht-rebalance.c:1570:dht_migrate_file] > > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > > $VOLUME-readdir-ahead-5 > > [2019-01-07 09:31:11.278120] I [dht-rebalance.c:1570:dht_migrate_file] > > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > > $VOLUME-readdir-ahead-5 > > [2019-01-07 09:31:11.732175] W [dht-rebalance.c:2159:dht_migrate_file] > > 0-$VOLUME-dht: $file5: failed to perform removexattr on > > $VOLUME-readdir-ahead-0 > > (No data available) > > [2019-01-07 09:31:11.737319] W [MSGID: 109023] > > [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file5: failed to do > > a stat on $VOLUME-readdir- > > ahead-0 [No such file or directory] > > [2019-01-07 09:31:11.744382] I [MSGID: 109022] > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > > of $file5 from subvolume > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > [2019-01-07 09:31:11.744676] I [MSGID: 109022] > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > > of $file5 from subvolume > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > > > > > I've searched the brick logs for $file5 with broken permissions and found > > this on all bricks from (I think) the subvolume $VOLUME-readdir-ahead-5: > > > > [2019-01-07 09:32:13.821545] I [MSGID: 113030] [posix.c:2171:posix_unlink] > > 0-$VOLUME-posix: open-fd-key-status: 0 for $file5 > > [2019-01-07 09:32:13.821609] I [MSGID: 113031] > > [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr status: 0 > > for $file5 > > > > > > > > Also, we noticed that many directories got their modification time > > updated. It was set to the rebalance date. Is that supposed to happen? > > > > > > We had parallel-readdir enabled during the rebalance. We disabled it since > > we had empty directories that couldn't be deleted. I was able to delete > > those dirs after that. > > > > Was this disabled during the rebalance? parallel-readdirp changes the > volume graph for clients but not for the rebalance process causing it to > fail to find the linkto subvols. Yes, parallel-readdirp was enabled during the rebalance. But we disabled it after some files where invisible on the client side again. > > > > Also, we have directories who lost their GFID on some bricks. Again. > > > Is this the missing symlink problem that was reported earlier? > > Regards, > Nithya > > > > > > > > What happened? Can we do something to fix this? And could that happen > > again? > > > > We want to upgrade to 4.1 soon. Is it safe to do that or could it make > > things worse? > > > > Kind regards > > > > Gudrun Amedick_______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Frank R?hlemann IT-Systemtechnik UNIVERSIT?T ZU L?BECK IT-Service-Center Ratzeburger Allee 160 23562 L?beck Tel +49 451 3101 2034 Fax +49 451 3101 2004 ruehlemann at itsc.uni-luebeck.de www.itsc.uni-luebeck.de From isakdim at gmail.com Mon Jan 28 15:39:55 2019 From: isakdim at gmail.com (Dmitry Isakbayev) Date: Mon, 28 Jan 2019 10:39:55 -0500 Subject: [Gluster-users] java application crushes while reading a zip file In-Reply-To: References: Message-ID: Amar, Thank you for helping me troubleshoot the issues. I don't have the resources to test the software at this point, but I will keep it in mind. Regards, Dmitry On Tue, Jan 22, 2019 at 1:02 AM Amar Tumballi Suryanarayan < atumball at redhat.com> wrote: > Dmitry, > > Thanks for the detailed updates on this thread. Let us know how your > 'production' setup is running. For much smoother next upgrade, we request > you to help out with some early testing of glusterfs-6 RC builds which are > expected to be out by Feb 1st week. > > Also, if it is possible for you to automate the tests, it would be great > to have it in our regression, so we can always be sure your setup would > never break in future releases. > > Regards, > Amar > > On Mon, Jan 7, 2019 at 11:42 PM Dmitry Isakbayev > wrote: > >> This system is going into production. I will try to replicate this >> problem on the next installation. >> >> On Wed, Jan 2, 2019 at 9:25 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Wed, Jan 2, 2019 at 9:59 PM Dmitry Isakbayev >>> wrote: >>> >>>> Still no JVM crushes. Is it possible that running glusterfs with >>>> performance options turned off for a couple of days cleared out the "stale >>>> metadata issue"? >>>> >>> >>> restarting these options, would've cleared the existing cache and hence >>> previous stale metadata would've been cleared. Hitting stale metadata >>> again depends on races. That might be the reason you are still not seeing >>> the issue. Can you try with enabling all perf xlators (default >>> configuration)? >>> >>> >>>> >>>> On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev >>>> wrote: >>>> >>>>> The software ran with all of the options turned off over the weekend >>>>> without any problems. >>>>> I will try to collect the debug info for you. I have re-enabled the 3 >>>>> three options, but yet to see the problem reoccurring. >>>>> >>>>> >>>>> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa < >>>>> rgowdapp at redhat.com> wrote: >>>>> >>>>>> Thanks Dmitry. Can you provide the following debug info I asked >>>>>> earlier: >>>>>> >>>>>> * strace -ff -v ... of java application >>>>>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse >>>>>> while mounting). >>>>>> >>>>>> regards, >>>>>> Raghavendra >>>>>> >>>>>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev >>>>>> wrote: >>>>>> >>>>>>> These 3 options seem to trigger both (reading zip file and renaming >>>>>>> files) problems. >>>>>>> >>>>>>> Options Reconfigured: >>>>>>> performance.io-cache: off >>>>>>> performance.stat-prefetch: off >>>>>>> performance.quick-read: off >>>>>>> performance.parallel-readdir: off >>>>>>> *performance.readdir-ahead: on* >>>>>>> *performance.write-behind: on* >>>>>>> *performance.read-ahead: on* >>>>>>> performance.client-io-threads: off >>>>>>> nfs.disable: on >>>>>>> transport.address-family: inet >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev >>>>>>> wrote: >>>>>>> >>>>>>>> Turning a single option on at a time still worked fine. I will >>>>>>>> keep trying. >>>>>>>> >>>>>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues or >>>>>>>> log messages. Do you suppose these issues are triggered by the new >>>>>>>> environment or did not exist in 4.1.5? >>>>>>>> >>>>>>>> [root at node1 ~]# glusterfs --version >>>>>>>> glusterfs 4.1.5 >>>>>>>> >>>>>>>> On AWS using >>>>>>>> [root at node1 ~]# hostnamectl >>>>>>>> Static hostname: node1 >>>>>>>> Icon name: computer-vm >>>>>>>> Chassis: vm >>>>>>>> Machine ID: b30d0f2110ac3807b210c19ede3ce88f >>>>>>>> Boot ID: 52bb159a0aa94043a40e7c7651967bd9 >>>>>>>> Virtualization: kvm >>>>>>>> Operating System: CentOS Linux 7 (Core) >>>>>>>> CPE OS Name: cpe:/o:centos:centos:7 >>>>>>>> Kernel: Linux 3.10.0-862.3.2.el7.x86_64 >>>>>>>> Architecture: x86-64 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa < >>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev < >>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Ok. I will try different options. >>>>>>>>>> >>>>>>>>>> This system is scheduled to go into production soon. What >>>>>>>>>> version would you recommend to roll back to? >>>>>>>>>> >>>>>>>>> >>>>>>>>> These are long standing issues. So, rolling back may not make >>>>>>>>> these issues go away. Instead if you think performance is agreeable to you, >>>>>>>>> please keep these xlators off in production. >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa < >>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev < >>>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Raghavendra, >>>>>>>>>>>> >>>>>>>>>>>> Thank for the suggestion. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I am suing >>>>>>>>>>>> >>>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster --version >>>>>>>>>>>> glusterfs 5.0 >>>>>>>>>>>> >>>>>>>>>>>> On >>>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# hostnamectl >>>>>>>>>>>> Icon name: computer-vm >>>>>>>>>>>> Chassis: vm >>>>>>>>>>>> Machine ID: e44b8478ef7a467d98363614f4e50535 >>>>>>>>>>>> Boot ID: eed98992fdda4c88bdd459a89101766b >>>>>>>>>>>> Virtualization: vmware >>>>>>>>>>>> Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) >>>>>>>>>>>> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server >>>>>>>>>>>> Kernel: Linux 3.10.0-862.14.4.el7.x86_64 >>>>>>>>>>>> Architecture: x86-64 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I have configured the following options >>>>>>>>>>>> >>>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster volume info >>>>>>>>>>>> Volume Name: gv0 >>>>>>>>>>>> Type: Replicate >>>>>>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824 >>>>>>>>>>>> Status: Started >>>>>>>>>>>> Snapshot Count: 0 >>>>>>>>>>>> Number of Bricks: 1 x 3 = 3 >>>>>>>>>>>> Transport-type: tcp >>>>>>>>>>>> Bricks: >>>>>>>>>>>> Brick1: jl-fanexoss1p.cspire.net:/data/brick1/gv0 >>>>>>>>>>>> Brick2: sl-fanexoss2p.cspire.net:/data/brick1/gv0 >>>>>>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0 >>>>>>>>>>>> Options Reconfigured: >>>>>>>>>>>> performance.io-cache: off >>>>>>>>>>>> performance.stat-prefetch: off >>>>>>>>>>>> performance.quick-read: off >>>>>>>>>>>> performance.parallel-readdir: off >>>>>>>>>>>> performance.readdir-ahead: off >>>>>>>>>>>> performance.write-behind: off >>>>>>>>>>>> performance.read-ahead: off >>>>>>>>>>>> performance.client-io-threads: off >>>>>>>>>>>> nfs.disable: on >>>>>>>>>>>> transport.address-family: inet >>>>>>>>>>>> >>>>>>>>>>>> I don't know if it is related, but I am seeing a lot of >>>>>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031] >>>>>>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: remote >>>>>>>>>>>> operation failed [No such device or address] >>>>>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> These msgs were introduced by patch [1]. To the best of my >>>>>>>>>>> knowledge they are benign. We'll be sending a patch to fix these msgs >>>>>>>>>>> though. >>>>>>>>>>> >>>>>>>>>>> +Mohit Agrawal +Milind Changire >>>>>>>>>>> . Can you try to identify why we are >>>>>>>>>>> seeing these messages? If possible please send a patch to fix this. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> And java.io exceptions trying to rename files. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> When you see the errors is it possible to collect, >>>>>>>>>>> * strace of the java application (strace -ff -v ...) >>>>>>>>>>> * fuse-dump of the glusterfs mount (use option --dump-fuse while >>>>>>>>>>> mounting)? >>>>>>>>>>> >>>>>>>>>>> I also need another favour from you. By trail and error, can you >>>>>>>>>>> point out which of the many performance xlators you've turned off is >>>>>>>>>>> causing the issue? >>>>>>>>>>> >>>>>>>>>>> The above two data-points will help us to fix the problem. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Thank You, >>>>>>>>>>>> Dmitry >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra Gowdappa < >>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> What version of glusterfs are you using? It might be either >>>>>>>>>>>>> * a stale metadata issue. >>>>>>>>>>>>> * inconsistent ctime issue. >>>>>>>>>>>>> >>>>>>>>>>>>> Can you try turning off all performance xlators? If the issue >>>>>>>>>>>>> is 1, that should help. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry Isakbayev < >>>>>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Attempted to set 'performance.read-ahead off` according to >>>>>>>>>>>>>> https://jira.apache.org/jira/browse/AMQ-7041 >>>>>>>>>>>>>> That did not help. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry Isakbayev < >>>>>>>>>>>>>> isakdim at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The core file generated by JVM suggests that it happens >>>>>>>>>>>>>>> because the file is changing while it is being read - >>>>>>>>>>>>>>> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557 >>>>>>>>>>>>>>> . >>>>>>>>>>>>>>> The application reads in the zipfile and goes through the >>>>>>>>>>>>>>> zip entries, then reloads the file and goes the zip entries again. It does >>>>>>>>>>>>>>> so 3 times. The application never crushes on the 1st cycle but sometimes >>>>>>>>>>>>>>> crushes on the 2nd or 3rd cycle. >>>>>>>>>>>>>>> The zip file is generated about 20 seconds prior to it being >>>>>>>>>>>>>>> used and is not updated or even used by any other application. I have >>>>>>>>>>>>>>> never seen this problem on a plain file system. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would appreciate any suggestions on how to go debugging >>>>>>>>>>>>>>> this issue. I can change the source code of the java application. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> Dmitry >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynthia.zhou at nokia-sbell.com Tue Jan 29 06:41:04 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Tue, 29 Jan 2019 06:41:04 +0000 Subject: [Gluster-users] query about glusterd epoll thread get stuck Message-ID: Hi, We are using glusterfs version 3.12 for 3 brick I find that occasionally after reboot all 3 sn nodes simultaneously, the glusterd process on one sn nodes may get stuck, when you try to execute glusterd command it does not response. Following is the glusterd stuck log and the gdb info of the stuck glusterd process, [2019-01-28 14:38:24.999329] I [rpc-clnt.c:1048:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2019-01-28 14:38:24.999450] I [rpc-clnt.c:1048:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2019-01-28 14:38:24.999597] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a4315121-e127-42ca-9869-0fe451216a80 [2019-01-28 14:38:24.999692] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e, host: mn-0.local, port: 0 [2019-01-28 14:38:25.010624] W [socket.c:593:__socket_rwv] 0-management: readv on 192.168.1.14:24007 failed (No data available) [2019-01-28 14:38:25.010774] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer (<7f694d7e-7613-4298-8da1-50cbb73ed47e>), in state , has disconnected from glusterd. [2019-01-28 14:38:25.010860] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) [0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol ccs not held [2019-01-28 14:38:25.010903] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for ccs [2019-01-28 14:38:25.010931] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) [0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol encryptfile not held [2019-01-28 14:38:25.010945] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for encryptfile [2019-01-28 14:38:25.010971] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) [0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol export not held [2019-01-28 14:38:25.010983] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for export [2019-01-28 14:38:25.011006] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) [0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol log not held [2019-01-28 14:38:25.011046] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for log [2019-01-28 14:38:25.011070] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) [0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol mstate not held [2019-01-28 14:38:25.011082] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for mstate [2019-01-28 14:38:25.011104] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) [0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol services not held [2019-01-28 14:38:25.011115] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for services [2019-01-28 14:38:25.011268] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e9)[0x7f182df24a99] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1f9)[0x7f182dce6f27] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7f182dce701a] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7f182dce756e] (--> /lib64/libgfrpc.so.0(+0x12f9e)[0x7f182dce7f9e] ))))) 0-management: forced unwinding frame type(Peer mgmt) op(--(4)) called at 2019-01-28 14:38:24.999851 (xid=0x7) [2019-01-28 14:38:25.011286] E [MSGID: 106158] [glusterd-rpc-ops.c:684:__glusterd_friend_update_cbk] 0-management: RPC Error [2019-01-28 14:38:25.011301] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received RJT from uuid: 00000000-0000-0000-0000-000000000000 [2019-01-28 14:38:25.011424] I [MSGID: 106006] [glusterd-svc-mgmt.c:328:glusterd_svc_common_rpc_notify] 0-management: glustershd has connected with glusterd. [2019-01-28 14:38:25.011599] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /mnt/bricks/ccs/brick on port 53952 [2019-01-28 14:38:25.011862] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e [2019-01-28 14:38:25.021058] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2019-01-28 14:38:25.021090] I [socket.c:3704:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1) [2019-01-28 14:38:25.021099] E [rpcsvc.c:1364:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management) [2019-01-28 14:38:25.021108] E [MSGID: 106430] [glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission failed [2019-01-28 14:38:25.021126] E [rpcsvc.c:559:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2019-01-28 14:38:25.021135] E [rpcsvc.c:1364:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management) [2019-01-28 14:38:25.021147] W [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: failed to queue error reply Gdb info the stuck thread of glusterd is: (gdb) thread 8 [Switching to thread 8 (Thread 0x7f1826ec2700 (LWP 2418))] #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f182ccea677 in __lll_lock_elision () from /lib64/libpthread.so.0 #2 0x00007f182df5cae6 in iobref_unref () from /lib64/libglusterfs.so.0 #3 0x00007f182dce2f29 in rpc_transport_pollin_destroy () from /lib64/libgfrpc.so.0 #4 0x00007f1827ccf319 in socket_event_poll_in () from /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so #5 0x00007f1827ccf932 in socket_event_handler () from /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so #6 0x00007f182df925d4 in event_dispatch_epoll_handler () from /lib64/libglusterfs.so.0 #7 0x00007f182df928ab in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #8 0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0 #9 0x00007f182c5b4e8f in clone () from /lib64/libc.so.6 (gdb) thread 9 [Switching to thread 9 (Thread 0x7f18266c1700 (LWP 2419))] #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f182cce2b42 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0 #2 0x00007f182cce44c8 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #3 0x00007f1827ccadab in socket_event_poll_err () from /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so #4 0x00007f1827ccf99c in socket_event_handler () from /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so #5 0x00007f182df925d4 in event_dispatch_epoll_handler () from /lib64/libglusterfs.so.0 #6 0x00007f182df928ab in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #7 0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0 #8 0x00007f182c5b4e8f in clone () from /lib64/libc.so.6 Have you ever encounter this issue? From the gdb info it seem the epoll thread get dead lock. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sheggodu at redhat.com Tue Jan 29 07:06:06 2019 From: sheggodu at redhat.com (Sunil Kumar Heggodu Gopala Acharya) Date: Tue, 29 Jan 2019 12:36:06 +0530 Subject: [Gluster-users] Improvements to Gluster upstream documentation Message-ID: Hi, As part of our continuous effort to improve Gluster upstream documentation , we are proposing a change to the documentation theme that we are currently using through the glusterdocs pull request 454 . Preview of the changes proposed can be viewed through this temporary website . Request you to review and share the comments/concerns/feedback. Regards, Sunil kumar AcharYa -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Tue Jan 29 07:25:38 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Tue, 29 Jan 2019 12:55:38 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Raghvendra/Michael/Madhu/John, Could you please help me out from this heketi issue. We are unable to create volumes. Please provide current workaround If any. Thanks in advance. BR Salam From: Shaik Salam/HYD/TCS To: rtalur at redhat.com Cc: "John Mulligan" , "gluster-users at gluster.org List" , "Michael Adam" , Madhu Rajanna Date: 01/28/2019 12:40 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Raghavendra, We are also facing following issue which is mentioned in case on openshift origin while we are creating pvc for pods. (Please provide workaround to move further (pod restart doesn't workout) https://bugzilla.redhat.com/show_bug.cgi?id=1630117 https://bugzilla.redhat.com/show_bug.cgi?id=1636912 Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. at a time to create only one volume when in-flight operations are zero. Once volume requested it reaches to 8. Now single volume not able to create and we are till now mostly 10 volumes are created. Please find heketidb dump and log [negroni] Completed 200 OK in 98.699?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 106.654?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 185.406?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 102.664?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 192.658?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 198.611?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 124.254?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 101.491?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 116.997?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 100.171?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 109.238?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/28 06:50:57 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 191.118?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 188.791?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 94.436?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 110.893?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.132?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 96.15?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.682?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 140.543?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 182.066?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 151.572?s BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" , Madhu Rajanna Date: 01/25/2019 04:03 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi John, Could you please have look my issue If you have time (atleast provide workaround). Thanks in advance. BR Salam From: "Shaik Salam" To: Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/25/2019 02:55 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Sent by: gluster-users-bounces at gluster.org "External email. Open with Caution" Hi John, Please find db dump and heketi log. Here kernel version. Please let me know If you need more information. [root at app2 ~]# uname -a Linux app2.matrix.nokia.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Hardware: HP GEN8 OS; NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" From: "Madhu Rajanna" To: "Shaik Salam" , "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 10:52 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" Adding John who is having more idea about how to debug this one. @Shaik Salam can you some more info on the hardware on which you are running heketi (kernel details) On Thu, Jan 24, 2019 at 7:42 PM Shaik Salam wrote: Hi Madhu, Sorry to disturb could you please provide atleast work around (to clear requests which stuck) to move further. We are also not able to find root cause from glusterd logs. Please find attachment. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 04:12 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 [attachment "heketi-complete.log.txt" deleted by Shaik Salam/HYD/TCS] [attachment "heketi-gluster.db.txt" deleted by Shaik Salam/HYD/TCS] _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: heketi-complete.log Type: application/octet-stream Size: 445422 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: heketi-dump.db Type: application/octet-stream Size: 67151 bytes Desc: not available URL: From rgowdapp at redhat.com Tue Jan 29 11:32:12 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 29 Jan 2019 17:02:12 +0530 Subject: [Gluster-users] query about glusterd epoll thread get stuck In-Reply-To: References: Message-ID: On Tue, Jan 29, 2019 at 12:11 PM Zhou, Cynthia (NSB - CN/Hangzhou) < cynthia.zhou at nokia-sbell.com> wrote: > Hi, > > We are using glusterfs version 3.12 for 3 brick I find that occasionally > after reboot all 3 sn nodes simultaneously, the glusterd process on one sn > nodes may get stuck, when you try to execute glusterd command it does not > response. > > > > Following is the glusterd stuck log and the gdb info of the stuck glusterd > process, > > [2019-01-28 14:38:24.999329] I [rpc-clnt.c:1048:rpc_clnt_connection_init] > 0-snapd: setting frame-timeout to 600 > > [2019-01-28 14:38:24.999450] I [rpc-clnt.c:1048:rpc_clnt_connection_init] > 0-snapd: setting frame-timeout to 600 > > [2019-01-28 14:38:24.999597] I [MSGID: 106493] > [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: > Received ACC from uuid: a4315121-e127-42ca-9869-0fe451216a80 > > [2019-01-28 14:38:24.999692] I [MSGID: 106493] > [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC > from uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e, host: mn-0.local, port: 0 > > [2019-01-28 14:38:25.010624] W [socket.c:593:__socket_rwv] 0-management: > readv on 192.168.1.14:24007 failed (No data available) > > [2019-01-28 14:38:25.010774] I [MSGID: 106004] > [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer > (<7f694d7e-7613-4298-8da1-50cbb73ed47e>), in state Cluster>, has disconnected from glusterd. > > [2019-01-28 14:38:25.010860] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol ccs not held > > [2019-01-28 14:38:25.010903] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for ccs > > [2019-01-28 14:38:25.010931] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol encryptfile not held > > [2019-01-28 14:38:25.010945] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for encryptfile > > [2019-01-28 14:38:25.010971] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol export not held > > [2019-01-28 14:38:25.010983] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for export > > [2019-01-28 14:38:25.011006] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol log not held > > [2019-01-28 14:38:25.011046] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for log > > [2019-01-28 14:38:25.011070] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol mstate not held > > [2019-01-28 14:38:25.011082] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for mstate > > [2019-01-28 14:38:25.011104] W > [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) > [0x7f1828aa0ba4] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d) > [0x7f1828ab3d5d] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) > [0x7f1828b7c635] ) 0-management: Lock for vol services not held > > [2019-01-28 14:38:25.011115] W [MSGID: 106118] > [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not > released for services > > [2019-01-28 14:38:25.011268] E [rpc-clnt.c:350:saved_frames_unwind] (--> > /lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e9)[0x7f182df24a99] (--> > /lib64/libgfrpc.so.0(saved_frames_unwind+0x1f9)[0x7f182dce6f27] (--> > /lib64/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7f182dce701a] (--> > /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7f182dce756e] > (--> /lib64/libgfrpc.so.0(+0x12f9e)[0x7f182dce7f9e] ))))) 0-management: > forced unwinding frame type(Peer mgmt) op(--(4)) called at 2019-01-28 > 14:38:24.999851 (xid=0x7) > > [2019-01-28 14:38:25.011286] E [MSGID: 106158] > [glusterd-rpc-ops.c:684:__glusterd_friend_update_cbk] 0-management: RPC > Error > > [2019-01-28 14:38:25.011301] I [MSGID: 106493] > [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: > Received RJT from uuid: 00000000-0000-0000-0000-000000000000 > > [2019-01-28 14:38:25.011424] I [MSGID: 106006] > [glusterd-svc-mgmt.c:328:glusterd_svc_common_rpc_notify] 0-management: > glustershd has connected with glusterd. > > [2019-01-28 14:38:25.011599] I [MSGID: 106143] > [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick > /mnt/bricks/ccs/brick on port 53952 > > [2019-01-28 14:38:25.011862] I [MSGID: 106492] [glusterd-handler.c:2718: > __glusterd_handle_friend_update] 0-glusterd: Received friend update from > uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e > > [2019-01-28 14:38:25.021058] I [MSGID: 106502] > [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: > Received my uuid as Friend > > [2019-01-28 14:38:25.021090] I [socket.c:3704:socket_submit_reply] > 0-socket.management: not connected (priv->connected = -1) > > [2019-01-28 14:38:25.021099] E [rpcsvc.c:1364:rpcsvc_submit_generic] > 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc > peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management) > > [2019-01-28 14:38:25.021108] E [MSGID: 106430] > [glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission > failed > > [2019-01-28 14:38:25.021126] E [rpcsvc.c:559:rpcsvc_check_and_reply_error] > 0-rpcsvc: rpc actor failed to complete successfully > > [2019-01-28 14:38:25.021135] E [rpcsvc.c:1364:rpcsvc_submit_generic] > 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc > peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management) > > [2019-01-28 14:38:25.021147] W [rpcsvc.c:565:rpcsvc_check_and_reply_error] > 0-rpcsvc: failed to queue error reply > > > > > > Gdb info the stuck thread of glusterd is: > > (gdb) thread 8 > > [Switching to thread 8 (Thread 0x7f1826ec2700 (LWP 2418))] > > #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 > > (gdb) bt > > #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 > > #1 0x00007f182ccea677 in __lll_lock_elision () from /lib64/libpthread.so.0 > > #2 0x00007f182df5cae6 in iobref_unref () from /lib64/libglusterfs.so.0 > > #3 0x00007f182dce2f29 in rpc_transport_pollin_destroy () from > /lib64/libgfrpc.so.0 > > #4 0x00007f1827ccf319 in socket_event_poll_in () from > /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so > > #5 0x00007f1827ccf932 in socket_event_handler () from > /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so > > #6 0x00007f182df925d4 in event_dispatch_epoll_handler () from > /lib64/libglusterfs.so.0 > > #7 0x00007f182df928ab in event_dispatch_epoll_worker () from > /lib64/libglusterfs.so.0 > > #8 0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0 > > #9 0x00007f182c5b4e8f in clone () from /lib64/libc.so.6 > > (gdb) thread 9 > > [Switching to thread 9 (Thread 0x7f18266c1700 (LWP 2419))] > > #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 > > (gdb) bt > > #0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0 > > #1 0x00007f182cce2b42 in __pthread_mutex_cond_lock () from > /lib64/libpthread.so.0 > > #2 0x00007f182cce44c8 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > > #3 0x00007f1827ccadab in socket_event_poll_err () from > /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so > > #4 0x00007f1827ccf99c in socket_event_handler () from > /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so > > #5 0x00007f182df925d4 in event_dispatch_epoll_handler () from > /lib64/libglusterfs.so.0 > > #6 0x00007f182df928ab in event_dispatch_epoll_worker () from > /lib64/libglusterfs.so.0 > > #7 0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0 > > #8 0x00007f182c5b4e8f in clone () from /lib64/libc.so.6 > > > > > > Have you ever encounter this issue? From the gdb info it seem the epoll > thread get dead lock. > Its not a deadlock. Pollerr is serialized with any ongoing pollin/pollout request processing. The problem here is iobuf_unref is waiting on a lock and hence pollin didn't complete. Question is why is iobuf_unref is stuck? Who has the lock of iobuf? -------------- next part -------------- An HTML attachment was scrubbed... URL: From spisla80 at gmail.com Tue Jan 29 15:21:15 2019 From: spisla80 at gmail.com (David Spisla) Date: Tue, 29 Jan 2019 16:21:15 +0100 Subject: [Gluster-users] Default Port Range for Bricks Message-ID: Hello Gluster Community, in glusterd.vol are parameters to define the port range for the bricks. They are commented out per default: # option base-port 49152 # option max-port 65535 I assume that glusterd is not using this range if the parameters are commented out. But what range instead? Is there a way to find this out? Regards David Spisla -------------- next part -------------- An HTML attachment was scrubbed... URL: From deqian.li at nokia-sbell.com Wed Jan 30 02:05:32 2019 From: deqian.li at nokia-sbell.com (Li, Deqian (NSB - CN/Hangzhou)) Date: Wed, 30 Jan 2019 02:05:32 +0000 Subject: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump Message-ID: <53269b77b7cf4bc59303101967083d95@nokia-sbell.com> Hi, Could you help to check this coredump? We are using glusterfs 3.12-3(3 replicated bricks solution ) to do stability testing under high CPU load like 80% by stress and doing I/O. After several hours, coredump happened in glusterfs side . [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] Missing separate debuginfos, use: dnf debuginfo-install rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 (gdb) bt #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, liabilities=0x7fffdc234b50) at write-behind.c:1148 #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at write-behind.c:1718 #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) at write-behind.c:1825 #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, opaque=0x7fffdc0305a0) at disk-usage.c:490 #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) p wb_inode $1 = (wb_inode_t *) 0x7fffd406b3b0 (gdb) frame 2 #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) at write-behind.c:1825 1825 in write-behind.c (gdb) p *fd $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' , "\377\377\377\377", '\000' , __align = 0}}, _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous = _gf_false} (gdb) p fd $3 = (fd_t *) 0x7fffe4034070 (gdb) p wb_inode->this $1 = (xlator_t *) 0xffffffffffffff00 After adding test log I found the FOP sequence in write-behind xlator side was mass as bellow showing. In the FUSE side the FLUSH is after write2, but in the WB side, FLUSH is between write2 'wb_do_unwinds' and 'wb_fulfill'. So I think this should has problem. I think it's possible that the FLUSH and later RELEASE operation will destroy the fd , it will cause 'wb_in->this(0xffffffffffffff00)'. Do you think so? And I think our new adding disk-usage xlator's synctask_new will dealy the write operation, but the FLUSH operation without this delay(because not invoked the disk-usage xlator). Do you agree with my speculation ? and how to fix?(we don't want to move the disk-usage xlator) Problematic FOP sequence : FUSE side: WB side: Write 1 write1 Write2 do unwind Write 2 FLUSH Release(destroy fd) FLUSH write2 (wb_fulfill) then coredump. Release int wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) { wb_request_t *req = NULL; wb_request_t *head = NULL; wb_request_t *tmp = NULL; wb_conf_t *conf = NULL; off_t expected_offset = 0; size_t curr_aggregate = 0; size_t vector_count = 0; int ret = 0; conf = wb_inode->this->private; --> this line coredump list_for_each_entry_safe (req, tmp, liabilities, winds) { list_del_init (&req->winds); .... volume ccs-write-behind 68: type performance/write-behind 69: subvolumes ccs-dht 70: end-volume 71: 72: volume ccs-disk-usage --> we add a new xlator here for write op ,just for checking if disk if full. And synctask_new for write. 73: type performance/disk-usage 74: subvolumes ccs-write-behind 75: end-volume 76: 77: volume ccs-read-ahead 78: type performance/read-ahead 79: subvolumes ccs-disk-usage 80: end-volume Ps. Part of Our new translator code int du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, struct iovec *vector, int count, off_t off, uint32_t flags, struct iobref *iobref, dict_t *xdata) { int op_errno = -1; int ret = -1; du_local_t *local = NULL; loc_t tmp_loc = {0,}; VALIDATE_OR_GOTO (frame, err); VALIDATE_OR_GOTO (this, err); VALIDATE_OR_GOTO (fd, err); tmp_loc.gfid[15] = 1; tmp_loc.inode = fd->inode; tmp_loc.parent = fd->inode; local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); if (!local) { op_errno = ENOMEM; goto err; } local->vector = iov_dup (vector, count); local->offset = off; local->count = count; local->flags = flags; local->iobref = iobref_ref (iobref); ret = synctask_new(this->ctx->env, du_get_du_info,du_writev_resume,frame,frame); if(ret) { op_errno = -1; gf_log (this->name, GF_LOG_WARNING,"synctask_new return failure ret(%d) ",ret); goto err; } return 0; err: op_errno = (op_errno == -1) ? errno : op_errno; DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); return 0; } Br, Li Deqian -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed Jan 30 02:59:34 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 30 Jan 2019 08:29:34 +0530 Subject: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump In-Reply-To: <53269b77b7cf4bc59303101967083d95@nokia-sbell.com> References: <53269b77b7cf4bc59303101967083d95@nokia-sbell.com> Message-ID: On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) < deqian.li at nokia-sbell.com> wrote: > Hi, > > > > Could you help to check this coredump? > > We are using glusterfs 3.12-3(3 replicated bricks solution ) to do > stability testing under high CPU load like 80% by stress and doing I/O. > > After several hours, coredump happened in glusterfs side . > > > > [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] > > Missing separate debuginfos, use: dnf debuginfo-install > rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 > > (gdb) bt > > #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, > liabilities=0x7fffdc234b50) at write-behind.c:1148 > > #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at > write-behind.c:1718 > > #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, > this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, > offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) > > at write-behind.c:1825 > > #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, > opaque=0x7fffdc0305a0) at disk-usage.c:490 > > #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 > > #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 > > #6 0x0000000000000000 in ?? () > > (gdb) p wb_inode > > $1 = (wb_inode_t *) 0x7fffd406b3b0 > > (gdb) frame 2 > > #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, > this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, > offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) > > at write-behind.c:1825 > > 1825 in write-behind.c > > (gdb) p *fd > > $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = > 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, > mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > > __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = > {__prev = 0x0, __next = 0x0}}, __size = '\000' , > "\377\377\377\377", '\000' , __align = 0}}, > > _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous > = _gf_false} > > (gdb) p fd > > $3 = (fd_t *) 0x7fffe4034070 > > > > (gdb) p wb_inode->this > > $1 = (xlator_t *) 0xffffffffffffff00 > > > > After adding test log I found the FOP sequence in write-behind xlator > side was mass as bellow showing. In the FUSE side the FLUSH is after > write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and > ?wb_fulfill?. > > So I think this should has problem. I think it?s possible that the FLUSH > and later RELEASE operation will destroy the fd , it will cause > ?wb_in->this(0xffffffffffffff00)?. Do you think so? > > And I think our new adding disk-usage xlator?s synctask_new will dealy the > write operation, but the FLUSH operation without this delay(because not > invoked the disk-usage xlator). > > > > Do you agree with my speculation ? and how to fix?(we don?t want to move > the disk-usage xlator) > > > > > > Problematic FOP sequence : > > > > FUSE side: WB side: > > > > Write 1 write1 > > Write2 do unwind > > Write 2 FLUSH > > Release(destroy fd) > > FLUSH write2 (wb_fulfill) then coredump. > > Release > > > > > > int > > wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) > > { > > wb_request_t *req = NULL; > > wb_request_t *head = NULL; > > wb_request_t *tmp = NULL; > > wb_conf_t *conf = NULL; > > off_t expected_offset = 0; > > size_t curr_aggregate = 0; > > size_t vector_count = 0; > > int ret = 0; > > > > conf = wb_inode->this->private; ? this line coredump > > > > list_for_each_entry_safe (req, tmp, liabilities, winds) { > > list_del_init (&req->winds); > > > > ?. > > > > > > volume ccs-write-behind > > 68: type performance/write-behind > > 69: subvolumes ccs-dht > > 70: end-volume > > 71: > > * 72: volume ccs-disk-usage **? we add a new xlator > here for write op ,just for checking if disk if full. And synctask_new for > write.* > > *73: type performance/disk-usage* > > *74: subvolumes ccs-write-behind* > > *75: end-volume* > > 76: > > 77: volume ccs-read-ahead > > 78: type performance/read-ahead > > 79: subvolumes ccs-disk-usage > > 80: end-volume > > > > > > > > Ps. Part of Our new translator code > > > > int > > du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, > > struct iovec *vector, int count, off_t off, uint32_t flags, > > struct iobref *iobref, dict_t *xdata) > > { > > int op_errno = -1; > > int ret = -1; > > du_local_t *local = NULL; > > loc_t tmp_loc = {0,}; > > > > VALIDATE_OR_GOTO (frame, err); > > VALIDATE_OR_GOTO (this, err); > > VALIDATE_OR_GOTO (fd, err); > > > > tmp_loc.gfid[15] = 1; > > tmp_loc.inode = fd->inode; > > tmp_loc.parent = fd->inode; > > local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); > > if (!local) { > > > > op_errno = ENOMEM; > > goto err; > > } > > local->vector = iov_dup (vector, count); > > local->offset = off; > > local->count = count; > > local->flags = flags; > > local->iobref = iobref_ref (iobref); > > > > ret = synctask_new(this->ctx->env, > du_get_du_info,du_writev_resume,frame,frame); > Can you paste the code of, * du_get_du_info * du_writev_resume if(ret) > > { > > op_errno = -1; > > gf_log (this->name, GF_LOG_WARNING,"synctask_new return > failure ret(%d) ",ret); > > goto err; > > } > > return 0; > > err: > > op_errno = (op_errno == -1) ? errno : op_errno; > > DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); > > return 0; > > } > > > > Br, > > Li Deqian > -------------- next part -------------- An HTML attachment was scrubbed... URL: From deqian.li at nokia-sbell.com Wed Jan 30 03:02:57 2019 From: deqian.li at nokia-sbell.com (Li, Deqian (NSB - CN/Hangzhou)) Date: Wed, 30 Jan 2019 03:02:57 +0000 Subject: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump In-Reply-To: References: <53269b77b7cf4bc59303101967083d95@nokia-sbell.com> Message-ID: <44342369a0a946f18925228d8cf52148@nokia-sbell.com> Hi, Yes, thanks very much for your quick response. I attach the whole file, not very big. Br, Li Deqian From: Raghavendra Gowdappa Sent: Wednesday, January 30, 2019 11:00 AM To: Li, Deqian (NSB - CN/Hangzhou) Cc: gluster-users Subject: Re: query about glusterfs 3.12-3 write-behind.c coredump On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) > wrote: Hi, Could you help to check this coredump? We are using glusterfs 3.12-3(3 replicated bricks solution ) to do stability testing under high CPU load like 80% by stress and doing I/O. After several hours, coredump happened in glusterfs side . [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] Missing separate debuginfos, use: dnf debuginfo-install rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 (gdb) bt #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, liabilities=0x7fffdc234b50) at write-behind.c:1148 #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at write-behind.c:1718 #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) at write-behind.c:1825 #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, opaque=0x7fffdc0305a0) at disk-usage.c:490 #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) p wb_inode $1 = (wb_inode_t *) 0x7fffd406b3b0 (gdb) frame 2 #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) at write-behind.c:1825 1825 in write-behind.c (gdb) p *fd $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' , "\377\377\377\377", '\000' , __align = 0}}, _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous = _gf_false} (gdb) p fd $3 = (fd_t *) 0x7fffe4034070 (gdb) p wb_inode->this $1 = (xlator_t *) 0xffffffffffffff00 After adding test log I found the FOP sequence in write-behind xlator side was mass as bellow showing. In the FUSE side the FLUSH is after write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and ?wb_fulfill?. So I think this should has problem. I think it?s possible that the FLUSH and later RELEASE operation will destroy the fd , it will cause ?wb_in->this(0xffffffffffffff00)?. Do you think so? And I think our new adding disk-usage xlator?s synctask_new will dealy the write operation, but the FLUSH operation without this delay(because not invoked the disk-usage xlator). Do you agree with my speculation ? and how to fix?(we don?t want to move the disk-usage xlator) Problematic FOP sequence : FUSE side: WB side: Write 1 write1 Write2 do unwind Write 2 FLUSH Release(destroy fd) FLUSH write2 (wb_fulfill) then coredump. Release int wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) { wb_request_t *req = NULL; wb_request_t *head = NULL; wb_request_t *tmp = NULL; wb_conf_t *conf = NULL; off_t expected_offset = 0; size_t curr_aggregate = 0; size_t vector_count = 0; int ret = 0; conf = wb_inode->this->private; --> this line coredump list_for_each_entry_safe (req, tmp, liabilities, winds) { list_del_init (&req->winds); ?. volume ccs-write-behind 68: type performance/write-behind 69: subvolumes ccs-dht 70: end-volume 71: 72: volume ccs-disk-usage --> we add a new xlator here for write op ,just for checking if disk if full. And synctask_new for write. 73: type performance/disk-usage 74: subvolumes ccs-write-behind 75: end-volume 76: 77: volume ccs-read-ahead 78: type performance/read-ahead 79: subvolumes ccs-disk-usage 80: end-volume Ps. Part of Our new translator code int du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, struct iovec *vector, int count, off_t off, uint32_t flags, struct iobref *iobref, dict_t *xdata) { int op_errno = -1; int ret = -1; du_local_t *local = NULL; loc_t tmp_loc = {0,}; VALIDATE_OR_GOTO (frame, err); VALIDATE_OR_GOTO (this, err); VALIDATE_OR_GOTO (fd, err); tmp_loc.gfid[15] = 1; tmp_loc.inode = fd->inode; tmp_loc.parent = fd->inode; local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); if (!local) { op_errno = ENOMEM; goto err; } local->vector = iov_dup (vector, count); local->offset = off; local->count = count; local->flags = flags; local->iobref = iobref_ref (iobref); ret = synctask_new(this->ctx->env, du_get_du_info,du_writev_resume,frame,frame); Can you paste the code of, * du_get_du_info * du_writev_resume if(ret) { op_errno = -1; gf_log (this->name, GF_LOG_WARNING,"synctask_new return failure ret(%d) ",ret); goto err; } return 0; err: op_errno = (op_errno == -1) ? errno : op_errno; DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); return 0; } Br, Li Deqian -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: disk-usage.c URL: From rgowdapp at redhat.com Wed Jan 30 04:14:18 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 30 Jan 2019 09:44:18 +0530 Subject: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump In-Reply-To: <53269b77b7cf4bc59303101967083d95@nokia-sbell.com> References: <53269b77b7cf4bc59303101967083d95@nokia-sbell.com> Message-ID: On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) < deqian.li at nokia-sbell.com> wrote: > Hi, > > > > Could you help to check this coredump? > > We are using glusterfs 3.12-3(3 replicated bricks solution ) to do > stability testing under high CPU load like 80% by stress and doing I/O. > > After several hours, coredump happened in glusterfs side . > > > > [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] > > Missing separate debuginfos, use: dnf debuginfo-install > rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 > > (gdb) bt > > #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, > liabilities=0x7fffdc234b50) at write-behind.c:1148 > > #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at > write-behind.c:1718 > > #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, > this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, > offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) > > at write-behind.c:1825 > > #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, > opaque=0x7fffdc0305a0) at disk-usage.c:490 > > #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 > > #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 > > #6 0x0000000000000000 in ?? () > > (gdb) p wb_inode > > $1 = (wb_inode_t *) 0x7fffd406b3b0 > > (gdb) frame 2 > > #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, > this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, > offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) > > at write-behind.c:1825 > > 1825 in write-behind.c > > (gdb) p *fd > > $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = > 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, > mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > > __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = > {__prev = 0x0, __next = 0x0}}, __size = '\000' , > "\377\377\377\377", '\000' , __align = 0}}, > > _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous > = _gf_false} > > (gdb) p fd > > $3 = (fd_t *) 0x7fffe4034070 > > > > (gdb) p wb_inode->this > > $1 = (xlator_t *) 0xffffffffffffff00 > > > > After adding test log I found the FOP sequence in write-behind xlator > side was mass as bellow showing. In the FUSE side the FLUSH is after > write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and > ?wb_fulfill?. > > So I think this should has problem. > wb_do_unwinds the write after caching it. Once the write response is unwound, kernel issues a FLUSH. So, this is a valid sequence of operations and nothing wrong. I think it?s possible that the FLUSH and later RELEASE operation will > destroy the fd , it will cause ?wb_in->this(0xffffffffffffff00)?. Do you > think so? > > And I think our new adding disk-usage xlator?s synctask_new will dealy the > write operation, but the FLUSH operation without this delay(because not > invoked the disk-usage xlator). > Flush and release wait for completion of any on-going writes. So, unless write-behind has unwound the write, it won't see a flush. By the time write is unwound, write-behind makes sure to take references on objects it uses (like fds, iobref etc). So, I don't see a problem there. > > Do you agree with my speculation ? and how to fix?(we don?t want to move > the disk-usage xlator) > I've still not found the RCA. We can discuss about the fix once RCA is found. > > > > Problematic FOP sequence : > > > > FUSE side: WB side: > > > > Write 1 write1 > > Write2 do unwind > > Write 2 FLUSH > > Release(destroy fd) > > FLUSH write2 (wb_fulfill) then coredump. > > Release > > > > > > int > > wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) > > { > > wb_request_t *req = NULL; > > wb_request_t *head = NULL; > > wb_request_t *tmp = NULL; > > wb_conf_t *conf = NULL; > > off_t expected_offset = 0; > > size_t curr_aggregate = 0; > > size_t vector_count = 0; > > int ret = 0; > > > > conf = wb_inode->this->private; ? this line coredump > > > > list_for_each_entry_safe (req, tmp, liabilities, winds) { > > list_del_init (&req->winds); > > > > ?. > > > > > > volume ccs-write-behind > > 68: type performance/write-behind > > 69: subvolumes ccs-dht > > 70: end-volume > > 71: > > * 72: volume ccs-disk-usage **? we add a new xlator > here for write op ,just for checking if disk if full. And synctask_new for > write.* > > *73: type performance/disk-usage* > > *74: subvolumes ccs-write-behind* > > *75: end-volume* > > 76: > > 77: volume ccs-read-ahead > > 78: type performance/read-ahead > > 79: subvolumes ccs-disk-usage > > 80: end-volume > > > > > > > > Ps. Part of Our new translator code > > > > int > > du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, > > struct iovec *vector, int count, off_t off, uint32_t flags, > > struct iobref *iobref, dict_t *xdata) > > { > > int op_errno = -1; > > int ret = -1; > > du_local_t *local = NULL; > > loc_t tmp_loc = {0,}; > > > > VALIDATE_OR_GOTO (frame, err); > > VALIDATE_OR_GOTO (this, err); > > VALIDATE_OR_GOTO (fd, err); > > > > tmp_loc.gfid[15] = 1; > > tmp_loc.inode = fd->inode; > > tmp_loc.parent = fd->inode; > > local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); > > if (!local) { > > > > op_errno = ENOMEM; > > goto err; > > } > > local->vector = iov_dup (vector, count); > > local->offset = off; > > local->count = count; > > local->flags = flags; > > local->iobref = iobref_ref (iobref); > > > > ret = synctask_new(this->ctx->env, > du_get_du_info,du_writev_resume,frame,frame); > > if(ret) > > { > > op_errno = -1; > > gf_log (this->name, GF_LOG_WARNING,"synctask_new return > failure ret(%d) ",ret); > > goto err; > > } > > return 0; > > err: > > op_errno = (op_errno == -1) ? errno : op_errno; > > DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); > > return 0; > > } > > > > Br, > > Li Deqian > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed Jan 30 04:15:11 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 30 Jan 2019 09:45:11 +0530 Subject: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump In-Reply-To: References: <53269b77b7cf4bc59303101967083d95@nokia-sbell.com> Message-ID: On Wed, Jan 30, 2019 at 9:44 AM Raghavendra Gowdappa wrote: > > > On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) < > deqian.li at nokia-sbell.com> wrote: > >> Hi, >> >> >> >> Could you help to check this coredump? >> >> We are using glusterfs 3.12-3(3 replicated bricks solution ) to do >> stability testing under high CPU load like 80% by stress and doing I/O. >> >> After several hours, coredump happened in glusterfs side . >> >> >> >> [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] >> >> Missing separate debuginfos, use: dnf debuginfo-install >> rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 >> >> (gdb) bt >> >> #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, >> liabilities=0x7fffdc234b50) at write-behind.c:1148 >> >> #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at >> write-behind.c:1718 >> >> #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, >> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, >> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) >> >> at write-behind.c:1825 >> >> #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, >> opaque=0x7fffdc0305a0) at disk-usage.c:490 >> >> #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 >> >> #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 >> >> #6 0x0000000000000000 in ?? () >> >> (gdb) p wb_inode >> >> $1 = (wb_inode_t *) 0x7fffd406b3b0 >> >> (gdb) frame 2 >> >> #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, >> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, >> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) >> >> at write-behind.c:1825 >> >> 1825 in write-behind.c >> >> (gdb) p *fd >> >> $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = >> 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, >> mutex = {__data = {__lock = 0, __count = 0, __owner = 0, >> >> __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = >> {__prev = 0x0, __next = 0x0}}, __size = '\000' , >> "\377\377\377\377", '\000' , __align = 0}}, >> >> _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, >> anonymous = _gf_false} >> >> (gdb) p fd >> >> $3 = (fd_t *) 0x7fffe4034070 >> >> >> >> (gdb) p wb_inode->this >> >> $1 = (xlator_t *) 0xffffffffffffff00 >> >> >> >> After adding test log I found the FOP sequence in write-behind xlator >> side was mass as bellow showing. In the FUSE side the FLUSH is after >> write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and >> ?wb_fulfill?. >> >> So I think this should has problem. >> > > wb_do_unwinds the write after caching it. > * wb_do_unwinds unwind the write after caching it. Once the write response is unwound, kernel issues a FLUSH. So, this is a > valid sequence of operations and nothing wrong. > > I think it?s possible that the FLUSH and later RELEASE operation will >> destroy the fd , it will cause ?wb_in->this(0xffffffffffffff00)?. Do you >> think so? >> >> And I think our new adding disk-usage xlator?s synctask_new will dealy >> the write operation, but the FLUSH operation without this delay(because not >> invoked the disk-usage xlator). >> > > Flush and release wait for completion of any on-going writes. So, unless > write-behind has unwound the write, it won't see a flush. By the time write > is unwound, write-behind makes sure to take references on objects it uses > (like fds, iobref etc). So, I don't see a problem there. > > >> >> Do you agree with my speculation ? and how to fix?(we don?t want to move >> the disk-usage xlator) >> > > I've still not found the RCA. We can discuss about the fix once RCA is > found. > > >> >> >> >> Problematic FOP sequence : >> >> >> >> FUSE side: WB side: >> >> >> >> Write 1 write1 >> >> Write2 do unwind >> >> Write 2 FLUSH >> >> Release(destroy fd) >> >> FLUSH write2 (wb_fulfill) then coredump. >> >> Release >> >> >> >> >> >> int >> >> wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) >> >> { >> >> wb_request_t *req = NULL; >> >> wb_request_t *head = NULL; >> >> wb_request_t *tmp = NULL; >> >> wb_conf_t *conf = NULL; >> >> off_t expected_offset = 0; >> >> size_t curr_aggregate = 0; >> >> size_t vector_count = 0; >> >> int ret = 0; >> >> >> >> conf = wb_inode->this->private; ? this line coredump >> >> >> >> list_for_each_entry_safe (req, tmp, liabilities, winds) { >> >> list_del_init (&req->winds); >> >> >> >> ?. >> >> >> >> >> >> volume ccs-write-behind >> >> 68: type performance/write-behind >> >> 69: subvolumes ccs-dht >> >> 70: end-volume >> >> 71: >> >> * 72: volume ccs-disk-usage **? we add a new xlator >> here for write op ,just for checking if disk if full. And synctask_new for >> write.* >> >> *73: type performance/disk-usage* >> >> *74: subvolumes ccs-write-behind* >> >> *75: end-volume* >> >> 76: >> >> 77: volume ccs-read-ahead >> >> 78: type performance/read-ahead >> >> 79: subvolumes ccs-disk-usage >> >> 80: end-volume >> >> >> >> >> >> >> >> Ps. Part of Our new translator code >> >> >> >> int >> >> du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, >> >> struct iovec *vector, int count, off_t off, uint32_t flags, >> >> struct iobref *iobref, dict_t *xdata) >> >> { >> >> int op_errno = -1; >> >> int ret = -1; >> >> du_local_t *local = NULL; >> >> loc_t tmp_loc = {0,}; >> >> >> >> VALIDATE_OR_GOTO (frame, err); >> >> VALIDATE_OR_GOTO (this, err); >> >> VALIDATE_OR_GOTO (fd, err); >> >> >> >> tmp_loc.gfid[15] = 1; >> >> tmp_loc.inode = fd->inode; >> >> tmp_loc.parent = fd->inode; >> >> local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); >> >> if (!local) { >> >> >> >> op_errno = ENOMEM; >> >> goto err; >> >> } >> >> local->vector = iov_dup (vector, count); >> >> local->offset = off; >> >> local->count = count; >> >> local->flags = flags; >> >> local->iobref = iobref_ref (iobref); >> >> >> >> ret = synctask_new(this->ctx->env, >> du_get_du_info,du_writev_resume,frame,frame); >> >> if(ret) >> >> { >> >> op_errno = -1; >> >> gf_log (this->name, GF_LOG_WARNING,"synctask_new return >> failure ret(%d) ",ret); >> >> goto err; >> >> } >> >> return 0; >> >> err: >> >> op_errno = (op_errno == -1) ? errno : op_errno; >> >> DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); >> >> return 0; >> >> } >> >> >> >> Br, >> >> Li Deqian >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From deqian.li at nokia-sbell.com Wed Jan 30 05:55:33 2019 From: deqian.li at nokia-sbell.com (Li, Deqian (NSB - CN/Hangzhou)) Date: Wed, 30 Jan 2019 05:55:33 +0000 Subject: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump In-Reply-To: References: <53269b77b7cf4bc59303101967083d95@nokia-sbell.com> Message-ID: <53d067d0b0484419941c25a3a3f8596f@nokia-sbell.com> Hi, >>Once the write response is unwound, kernel issues a FLUSH. So, this is a valid sequence of operations and nothing wrong. I think I need to explain more of my meaning. I added test log into wb_process_queue() etc, found the 4 logs as bellow. I mean normally the log order should be 1,4,2,3, because 1,4 are for one ?write?, but now it is 1,2,3,4, I think it?s abnormal (maybe caused by high CPU load or disk-usage xlator). And I think after FLUSH(2,3) the ?fd? will be destroyed then caused?4? incorrect pointer. Do you agree with this speculation? Here ?refcount? means ?fd-> refcount?. 1.[2019-01-28 04:23:26.751555] E [MSGID: 131007] [write-behind.c:1237:wb_do_unwinds] 0-ldq before do un wind : wb_in->this(0x7fc544014b10),wb_in(0x7fc5140a9b40),fop=WRITE, fd=0x7fc53c054280,refcount:8, threadid:3320 2.[2019-01-28 04:23:26.778756] E [fuse-bridge.c:2562:fuse_flush] 0-glusterfs-fuse: 1674240:ldq FLUSH fd 0x7fc53c054280,threadid:5649 3.[2019-01-28 04:23:26.786131] E [MSGID: 131007] [write-behind.c:1715:wb_do_winds] 0-ldq before do wind : wb_in->this(0x7fc544014b10),wb_in(0x7fc5140a9b40),fop=FLUSH, fd=0x7fc53c054280,refcount:14, threadid:5646 4. [2019-01-28 04:23:26.798515] E [MSGID: 131007] [write-behind.c:1755:wb_process_queue] 0-ldq before fulfill : wb_in->this(0xffffffffffffff00),wb_in(0x7fc5140a9b40), threadid:3320 >>Flush and release wait for completion of any on-going writes. So, unless write-behind has unwound the write, it won't see a flush. By the time write is unwound, write-behind makes sure to take references on objects it uses (like fds, iobref etc). In the wb_process_queue(), we can see after wb_do_unwinds() still need to do wb_do_winds(),wb_fulfill(). Do you think after one ?write? operation executed ?wb_do_unwinds()?, it will wake up the OP in FUSE queue like ?FLUSH? , and ?FLUSH? can finish it?s job and next release the ?fd?. Then the thread of previous ?write? continue to do ?wb_full()? then coredump? And do you have more detail document about the glusterfs write-behind, especially how the the ?fd->refcount? add and descrease. Thanks. Ps the debug code in red. void wb_process_queue (wb_inode_t *wb_inode) { list_head_t tasks = {0, }; list_head_t lies = {0, }; list_head_t liabilities = {0, }; int wind_failure = 0; INIT_LIST_HEAD (&tasks); INIT_LIST_HEAD (&lies); INIT_LIST_HEAD (&liabilities); do { gf_log_callingfn (wb_inode->this->name, GF_LOG_DEBUG, "processing queues"); LOCK (&wb_inode->lock); { __wb_preprocess_winds (wb_inode); __wb_pick_winds (wb_inode, &tasks, &liabilities); __wb_pick_unwinds (wb_inode, &lies); } UNLOCK (&wb_inode->lock); wb_do_unwinds (wb_inode, &lies); wb_do_winds (wb_inode, &tasks); gf_msg ("ldq before fulfill ", GF_LOG_ERROR, 0, WRITE_BEHIND_MSG_RES_UNAVAILABLE, "wb_in->this(%p),wb_in(%p), threadid:%ld", wb_inode->this,wb_inode, syscall(SYS_gettid)); /* If there is an error in wb_fulfill before winding write * requests, we would miss invocation of wb_process_queue * from wb_fulfill_cbk. So, retry processing again. */ wind_failure = wb_fulfill (wb_inode, &liabilities); gf_msg ("ldq after fulfill ", GF_LOG_ERROR, 0, WRITE_BEHIND_MSG_RES_UNAVAILABLE, "wb_in->this(%p),wb_in(%p), thid:%ld", wb_inode->this,wb_inode,syscall(SYS_gettid)); } while (wind_failure); return; } void wb_do_unwinds (wb_inode_t *wb_inode, list_head_t *lies) { wb_request_t *req = NULL; wb_request_t *tmp = NULL; call_frame_t *frame = NULL; struct iatt buf = {0, }; list_for_each_entry_safe (req, tmp, lies, unwinds) { frame = req->stub->frame; STACK_UNWIND_STRICT (writev, frame, req->op_ret, req->op_errno, &buf, &buf, NULL); /* :O */ req->stub->frame = NULL; list_del_init (&req->unwinds); if(req->fd) gf_msg ("ldq before do un wind ", GF_LOG_ERROR, 0, WRITE_BEHIND_MSG_RES_UNAVAILABLE, "wb_in->this(%p),wb_in(%p),fop=%s, fd=%p,refcount:%d, threadid:%ld", wb_inode->this,wb_inode,gf_fop_list[req->fop], req->fd,req->fd->refcount,syscall(SYS_gettid)); wb_request_unref (req); } return; } Br, Li Deqian From: Raghavendra Gowdappa Sent: Wednesday, January 30, 2019 12:15 PM To: Li, Deqian (NSB - CN/Hangzhou) Cc: gluster-users Subject: Re: query about glusterfs 3.12-3 write-behind.c coredump On Wed, Jan 30, 2019 at 9:44 AM Raghavendra Gowdappa > wrote: On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) > wrote: Hi, Could you help to check this coredump? We are using glusterfs 3.12-3(3 replicated bricks solution ) to do stability testing under high CPU load like 80% by stress and doing I/O. After several hours, coredump happened in glusterfs side . [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] Missing separate debuginfos, use: dnf debuginfo-install rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 (gdb) bt #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, liabilities=0x7fffdc234b50) at write-behind.c:1148 #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at write-behind.c:1718 #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) at write-behind.c:1825 #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, opaque=0x7fffdc0305a0) at disk-usage.c:490 #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) p wb_inode $1 = (wb_inode_t *) 0x7fffd406b3b0 (gdb) frame 2 #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) at write-behind.c:1825 1825 in write-behind.c (gdb) p *fd $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' , "\377\377\377\377", '\000' , __align = 0}}, _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous = _gf_false} (gdb) p fd $3 = (fd_t *) 0x7fffe4034070 (gdb) p wb_inode->this $1 = (xlator_t *) 0xffffffffffffff00 After adding test log I found the FOP sequence in write-behind xlator side was mass as bellow showing. In the FUSE side the FLUSH is after write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and ?wb_fulfill?. So I think this should has problem. wb_do_unwinds the write after caching it. * wb_do_unwinds unwind the write after caching it. Once the write response is unwound, kernel issues a FLUSH. So, this is a valid sequence of operations and nothing wrong. I think it?s possible that the FLUSH and later RELEASE operation will destroy the fd , it will cause ?wb_in->this(0xffffffffffffff00)?. Do you think so? And I think our new adding disk-usage xlator?s synctask_new will dealy the write operation, but the FLUSH operation without this delay(because not invoked the disk-usage xlator). Flush and release wait for completion of any on-going writes. So, unless write-behind has unwound the write, it won't see a flush. By the time write is unwound, write-behind makes sure to take references on objects it uses (like fds, iobref etc). So, I don't see a problem there. Do you agree with my speculation ? and how to fix?(we don?t want to move the disk-usage xlator) I've still not found the RCA. We can discuss about the fix once RCA is found. Problematic FOP sequence : FUSE side: WB side: Write 1 write1 Write2 do unwind Write 2 FLUSH Release(destroy fd) FLUSH write2 (wb_fulfill) then coredump. Release int wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) { wb_request_t *req = NULL; wb_request_t *head = NULL; wb_request_t *tmp = NULL; wb_conf_t *conf = NULL; off_t expected_offset = 0; size_t curr_aggregate = 0; size_t vector_count = 0; int ret = 0; conf = wb_inode->this->private; --> this line coredump list_for_each_entry_safe (req, tmp, liabilities, winds) { list_del_init (&req->winds); ?. volume ccs-write-behind 68: type performance/write-behind 69: subvolumes ccs-dht 70: end-volume 71: 72: volume ccs-disk-usage --> we add a new xlator here for write op ,just for checking if disk if full. And synctask_new for write. 73: type performance/disk-usage 74: subvolumes ccs-write-behind 75: end-volume 76: 77: volume ccs-read-ahead 78: type performance/read-ahead 79: subvolumes ccs-disk-usage 80: end-volume Ps. Part of Our new translator code int du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, struct iovec *vector, int count, off_t off, uint32_t flags, struct iobref *iobref, dict_t *xdata) { int op_errno = -1; int ret = -1; du_local_t *local = NULL; loc_t tmp_loc = {0,}; VALIDATE_OR_GOTO (frame, err); VALIDATE_OR_GOTO (this, err); VALIDATE_OR_GOTO (fd, err); tmp_loc.gfid[15] = 1; tmp_loc.inode = fd->inode; tmp_loc.parent = fd->inode; local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); if (!local) { op_errno = ENOMEM; goto err; } local->vector = iov_dup (vector, count); local->offset = off; local->count = count; local->flags = flags; local->iobref = iobref_ref (iobref); ret = synctask_new(this->ctx->env, du_get_du_info,du_writev_resume,frame,frame); if(ret) { op_errno = -1; gf_log (this->name, GF_LOG_WARNING,"synctask_new return failure ret(%d) ",ret); goto err; } return 0; err: op_errno = (op_errno == -1) ? errno : op_errno; DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); return 0; } Br, Li Deqian -------------- next part -------------- An HTML attachment was scrubbed... URL: From spisla80 at gmail.com Wed Jan 30 11:14:54 2019 From: spisla80 at gmail.com (David Spisla) Date: Wed, 30 Jan 2019 12:14:54 +0100 Subject: [Gluster-users] VolumeOpt Set fails of a freshly created volume In-Reply-To: References: Message-ID: Hello Gluster Community, today I got the same error messages in glusterd.log when setting volume options of a freshly created volume. See the log entry: [2019-01-30 10:15:55.597268] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0xdad2a) [0x7f08ce71ed2a] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0xda81c) [0x7f08ce71e81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) [0x7f08d4bd0575] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=integration-archive1 -o cluster.lookup-optimize=on --gd-workdir=/var/lib/glusterd *[2019-01-30 10:15:55.806303] W [socket.c:719:__socket_rwv] 0-management: readv on 10.10.12.102:24007 failed (Input/output error)* *[2019-01-30 10:15:55.806344] E [socket.c:246:ssl_dump_error_stack] 0-management: error:140943F2:SSL routines:ssl3_read_bytes:sslv3 alert unexpected messag*e The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 51 times between [2019-01-30 10:15:51.659656] and [2019-01-30 10:15:55.635151] [2019-01-30 10:15:55.806370] I [MSGID: 106004] [glusterd-handler.c:6430:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2019-01-30 10:15:55.806487] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0x24349) [0x7f08ce668349] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0x2d950) [0x7f08ce671950] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0xe0239) [0x7f08ce724239] ) 0-management: Lock for vol archive1 not held [2019-01-30 10:15:55.806505] W [MSGID: 106117] [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not released for archive1 [2019-01-30 10:15:55.806522] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0x24349) [0x7f08ce668349] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0x2d950) [0x7f08ce671950] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0xe0239) [0x7f08ce724239] ) 0-management: Lock for vol archive2 not held [2019-01-30 10:15:55.806529] W [MSGID: 106117] [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not released for archive2 [2019-01-30 10:15:55.806543] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0x24349) [0x7f08ce668349] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0x2d950) [0x7f08ce671950] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0xe0239) [0x7f08ce724239] ) 0-management: Lock for vol gluster_shared_storage not held [2019-01-30 10:15:55.806553] W [MSGID: 106117] [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not released for gluster_shared_storage [2019-01-30 10:15:55.806576] W [glusterd-locks.c:806:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0x24349) [0x7f08ce668349] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0x2d950) [0x7f08ce671950] -->/usr/lib64/glusterfs/5.3/xlator/mgmt/glusterd.so(+0xe0074) [0x7f08ce724074] ) 0-management: Lock owner mismatch. Lock for vol integration-archive1 held by 451b6e04-5098-4a35-a312-edbb0d8328a0 [2019-01-30 10:15:55.806584] W [MSGID: 106117] [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not released for integration-archive1 [2019-01-30 10:15:55.806846] E [rpc-clnt.c:346:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7f08d4b8122d] (--> /usr/lib64/libgfrpc.so.0(+0xca3d)[0x7f08d4948a3d] (--> /usr/lib64/libgfrpc.so.0(+0xcb5e)[0x7f08d4948b5e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8b)[0x7f08d494a0bb] (--> /usr/lib64/libgfrpc.so.0(+0xec68)[0x7f08d494ac68] ))))) 0-management: forced unwinding frame type(glusterd mgmt v3) op(--(1)) called at 2019-01-30 10:15:55.804680 (xid=0x1ae) [2019-01-30 10:15:55.806865] E [MSGID: 106115] [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Locking failed on fs-lrunning-c2-n2. Please check log file for details. [2019-01-30 10:15:55.806914] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-01-30 10:15:55.806898] E [MSGID: 106150] [glusterd-syncop.c:1904:gd_sync_task_begin] 0-management: Locking Peers Failed. The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 4 times between [2019-01-30 10:15:55.806914] and [2019-01-30 10:15:56.322122] [2019-01-30 10:15:56.322287] E [MSGID: 106529] [glusterd-volume-ops.c:1916:glusterd_op_stage_delete_volume] 0-management: Some of the peers are down [2019-01-30 10:15:56.322319] E [MSGID: 106301] [glusterd-syncop.c:1308:gd_stage_op_phase] 0-management: Staging of operation 'Volume Delete' failed on localhost : Some of the peers are down Again my peer "fs-lrunning-c2-n2" is not connected and again there is a ssl error message. @Milind Changire Any idea if this ssl error has an relation to the peer disconnect problem? Or is there any problem with the Portmapping in Glusterv5.x? Regards David Spisla Am Do., 17. Jan. 2019 um 03:42 Uhr schrieb Atin Mukherjee < amukherj at redhat.com>: > > > On Wed, Jan 16, 2019 at 9:48 PM David Spisla wrote: > >> Dear Gluster Community, >> >> i created a replica 4 volume from gluster-node1 on a 4-Node Cluster with >> SSL/TLS network encryption . During setting the 'cluster.use-compound-fops' >> option, i got the error: >> >> $ volume set: failed: Commit failed on gluster-node2. Please check log >> file for details. >> >> Here is the glusterd.log from gluster-node1: >> >> *[2019-01-15 15:18:36.813034] I [run.c:242:runner_log] >> (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) >> [0x7fc24d91cd2a] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) >> [0x7fc24d91c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) >> [0x7fc253dce0b5] ) 0-management: Ran script: >> /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh >> --volname=integration-archive1 -o cluster.use-compound-fops=on >> --gd-workdir=/var/lib/glusterd* >> [2019-01-15 15:18:36.821193] I [run.c:242:runner_log] >> (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) >> [0x7fc24d91cd2a] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) >> [0x7fc24d91c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) >> [0x7fc253dce0b5] ) 0-management: Ran script: >> /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh >> --volname=integration-archive1 -o cluster.use-compound-fops=on >> --gd-workdir=/var/lib/glusterd >> [2019-01-15 15:18:36.842383] W [socket.c:719:__socket_rwv] 0-management: >> readv on 10.10.12.42:24007 failed (Input/output error) >> *[2019-01-15 15:18:36.842415] E [socket.c:246:ssl_dump_error_stack] >> 0-management: error:140943F2:SSL routines:ssl3_read_bytes:sslv3 alert >> unexpected message* >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler" repeated 81 times between [2019-01-15 15:18:30.735508] and >> [2019-01-15 15:18:36.808994] >> [2019-01-15 15:18:36.842439] I [MSGID: 106004] >> [glusterd-handler.c:6430:__glusterd_peer_rpc_notify] 0-management: Peer < >> gluster-node2> (<02724bb6-cb34-4ec3-8306-c2950e0acf9b>), in state > in Cluster>, has disconnected from glusterd. >> > > The above shows there was a peer disconnect event received from > gluster-node2 and this sequence might have happened while the commit > operation was in-flight and hence the volume set failed on gluster-node2. > Related to ssl error, I'd request Milind to comment. > > [2019-01-15 15:18:36.842638] W >> [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] >> (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) >> [0x7fc24d866349] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) >> [0x7fc24d86f950] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) >> [0x7fc24d922239] ) 0-management: Lock for vol archive1 not held >> [2019-01-15 15:18:36.842656] W [MSGID: 106117] >> [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not >> released for archive1 >> [2019-01-15 15:18:36.842674] W >> [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] >> (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) >> [0x7fc24d866349] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) >> [0x7fc24d86f950] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) >> [0x7fc24d922239] ) 0-management: Lock for vol archive2 not held >> [2019-01-15 15:18:36.842680] W [MSGID: 106117] >> [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not >> released for archive2 >> [2019-01-15 15:18:36.842694] W >> [glusterd-locks.c:795:glusterd_mgmt_v3_unlock] >> (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) >> [0x7fc24d866349] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) >> [0x7fc24d86f950] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0239) >> [0x7fc24d922239] ) 0-management: Lock for vol gluster_shared_storage not >> held >> [2019-01-15 15:18:36.842702] W [MSGID: 106117] >> [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not >> released for gluster_shared_storage >> [2019-01-15 15:18:36.842719] W >> [glusterd-locks.c:806:glusterd_mgmt_v3_unlock] >> (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x24349) >> [0x7fc24d866349] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0x2d950) >> [0x7fc24d86f950] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xe0074) >> [0x7fc24d922074] ) 0-management: Lock owner mismatch. Lock for vol >> integration-archive1 held by ffdaa400-82cc-4ada-8ea7-144bf3714269 >> [2019-01-15 15:18:36.842727] W [MSGID: 106117] >> [glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not >> released for integration-archive1 >> [2019-01-15 15:18:36.842970] E [rpc-clnt.c:346:saved_frames_unwind] (--> >> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fc253d7f18d] (--> >> /usr/lib64/libgfrpc.so.0(+0xca3d)[0x7fc253b46a3d] (--> >> /usr/lib64/libgfrpc.so.0(+0xcb5e)[0x7fc253b46b5e] (--> >> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8b)[0x7fc253b480bb] >> (--> /usr/lib64/libgfrpc.so.0(+0xec68)[0x7fc253b48c68] ))))) 0-management: >> forced unwinding frame type(glusterd mgmt) op(--(4)) called at 2019-01-15 >> 15:18:36.802613 (xid=0x6da) >> [2019-01-15 15:18:36.842994] E [MSGID: 106152] >> [glusterd-syncop.c:104:gd_collate_errors] 0-glusterd: Commit failed on >> gluster-node2. Please check log file for details. >> >> And here glusterd.log from gluster-node2: >> >> *[2019-01-15 15:18:36.901788] I [run.c:242:runner_log] >> (-->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xdad2a) >> [0x7f9fba02cd2a] >> -->/usr/lib64/glusterfs/5.2/xlator/mgmt/glusterd.so(+0xda81c) >> [0x7f9fba02c81c] -->/usr/lib64/libglusterfs.so.0(runner_log+0x105) >> [0x7f9fc04de0b5] ) 0-management: Ran script: >> /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh >> --volname=integration-archive1 -o cluster.use-compound-fops=on >> --gd-workdir=/var/lib/glusterd* >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler" repeated 35 times between [2019-01-15 15:18:24.832023] and >> [2019-01-15 15:18:47.049407] >> [2019-01-15 15:18:47.049443] I [MSGID: 106163] >> [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] >> 0-management: using the op-version 50000 >> [2019-01-15 15:18:47.053439] E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler >> [2019-01-15 15:18:47.053479] E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler >> [2019-01-15 15:18:47.059899] I [MSGID: 106490] >> [glusterd-handler.c:2586:__glusterd_handle_incoming_friend_req] 0-glusterd: >> Received probe from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 >> [2019-01-15 15:18:47.063471] I [MSGID: 106493] >> [glusterd-handler.c:3843:glusterd_xfer_friend_add_resp] 0-glusterd: >> Responded to fs-lrunning-c1-n1 (0), ret: 0, op_ret: 0 >> [2019-01-15 15:18:47.066148] I [MSGID: 106492] >> [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-glusterd: >> Received friend update from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 >> [2019-01-15 15:18:47.067264] I [MSGID: 106502] >> [glusterd-handler.c:2812:__glusterd_handle_friend_update] 0-management: >> Received my uuid as Friend >> [2019-01-15 15:18:47.078696] I [MSGID: 106493] >> [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: >> Received ACC from uuid: ffdaa400-82cc-4ada-8ea7-144bf3714269 >> [2019-01-15 15:19:05.377216] E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler" repeated 3 times between [2019-01-15 15:19:05.377216] and >> [2019-01-15 15:19:06.124297] >> >> Maybe there was only a temporarily network interruption but on the other >> side there is a ssl error message in the log file from gluster-node1. >> Any ideas? >> >> Regards >> David Spisla >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.amedick at uni-luebeck.de Wed Jan 30 13:42:32 2019 From: g.amedick at uni-luebeck.de (Gudrun Mareike Amedick) Date: Wed, 30 Jan 2019 14:42:32 +0100 Subject: [Gluster-users] Files losing permissions in GlusterFS 3.12 In-Reply-To: <1548667421.10294.221.camel@uni-luebeck.de> References: <1548429058.2018.12.camel@uni-luebeck.de> <1548667421.10294.221.camel@uni-luebeck.de> Message-ID: <1548855752.2018.39.camel@uni-luebeck.de> Hi, a bit additional info inlineAm Montag, den 28.01.2019, 10:23 +0100 schrieb Frank Ruehlemann: > Am Montag, den 28.01.2019, 09:50 +0530 schrieb Nithya Balachandran: > > > > On Fri, 25 Jan 2019 at 20:51, Gudrun Mareike Amedick < > > g.amedick at uni-luebeck.de> wrote: > > > > > > > > Hi all, > > > > > > we have a problem with a distributed dispersed volume (GlusterFS 3.12). We > > > have files that lost their permissions or gained sticky bits. The files > > > themselves seem to be okay. > > > > > > It looks like this: > > > > > > # ls -lah $file1 > > > ---------- 1 www-data www-data 45M Jan 12 07:01 $file1 > > > > > > # ls -lah $file2 > > > -rw-rwS--T 1 $user $group 11K Jan??9 11:48 $file2 > > > > > > # ls -lah $file3 > > > ---------T 1 $user $group 6.8M Jan 12 08:17 $file3 > > > > > > These are linkto files (internal dht files) and should not be visible on > > the mount point. Are they consistently visible like this or do they revert > > to the proper permissions after some time? > They didn't heal yet, even after more than 4 weeks. Therefore we decided > to recommend our users to fix their files by setting the correct > permissions again, which worked without problems. But for analysis > reasons we still have some broken files nobody touched yet. > > We know these linkto files but they were never visible to clients. We > did these ls-commands on a client, not on a brick. They have linkfile permissions but on brick side, it looks like this: root at gluster06:~# ls -lah /$brick/$file3 ---------T 2 $user $group 1.7M Jan 12 08:17 /$brick/$file3 That seems to be too big for a linkfile. Also, there is no file it could link to. There's no other file with that name at that path on any other subvolume. > > > > > > > > > This is not what the permissions are supposed to look. They were 644 or > > > 660 before. And they definitely had no sticky bits. > > > The permissions on the bricks match what I see on client side. So I think > > > the original permissions are lost without a chance to recover them, right? > > > > > > > > > With some files with weird looking permissions (but not with all of them), > > > I can do this: > > > # ls -lah $path/$file4 > > > -rw-r--r-- 1 $user $group 6.0G Oct 11 09:34 $path/$file4 > > > ls -lah $path | grep $file4 > > > -rw-r-Sr-T??1 $user$group 6.0G Oct 11 09:34 $file4 > > > > > > > > So, the permissions I see depend on how I'm querying them. The permissions > > > on brick side agree with the ladder result, stat sees the former. I'm not > > > sure how that works. > > > > > The S and T bits indicate that a file is being migrated. The difference > > seems to be because of the way lookup versus readdirp handle this??- this > > looks like a bug. Lookup will strip out the internal permissions set. I > > don't think readdirp does. This is happening because a rebalance is in > > progress. > There is no active rebalance. At least in "gluster volume rebalance > $VOLUME status" is none visible. > > And in the rebalance log file of this volume is the last line: > "[2019-01-11 02:14:50.101944] W ? received signum (15), shutting down" > > > > > > > > > We know for at least a part of those files that they were okay at December > > > 19th. We got the first reports of weird-looking permissions at January > > > 12th. Between that, there was a rebalance running (January 7th to January > > > 11th). During that rebalance, a node was offline for a longer period of time > > > due to hardware issues. The output of "gluster volume heal $VOLUME info" > > > shows no files though. > > > > > > For all files with broken permissions we found so far, the following lines > > > are in the rebalance log: > > > > > > [2019-01-07 09:31:11.004802] I [MSGID: 109045] > > > [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not having link > > > subvol for $file5 > > > [2019-01-07 09:31:11.262273] I [MSGID: 109069] > > > [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] 0-$VOLUME-dht: > > > lookup_unlink returned with > > > op_ret -> 0 and op-errno -> 0 for $file5 > > > [2019-01-07 09:31:11.266014] I [dht-rebalance.c:1570:dht_migrate_file] > > > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > > > $VOLUME-readdir-ahead-5 > > > [2019-01-07 09:31:11.278120] I [dht-rebalance.c:1570:dht_migrate_file] > > > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > > > $VOLUME-readdir-ahead-5 > > > [2019-01-07 09:31:11.732175] W [dht-rebalance.c:2159:dht_migrate_file] > > > 0-$VOLUME-dht: $file5: failed to perform removexattr on > > > $VOLUME-readdir-ahead-0 > > > (No data available) > > > [2019-01-07 09:31:11.737319] W [MSGID: 109023] > > > [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file5: failed to do > > > a stat on $VOLUME-readdir- > > > ahead-0 [No such file or directory] > > > [2019-01-07 09:31:11.744382] I [MSGID: 109022] > > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > > > of $file5 from subvolume > > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > [2019-01-07 09:31:11.744676] I [MSGID: 109022] > > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > > > of $file5 from subvolume > > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > > > > > > > > > I've searched the brick logs for $file5 with broken permissions and found > > > this on all bricks from (I think) the subvolume $VOLUME-readdir-ahead-5: > > > > > > [2019-01-07 09:32:13.821545] I [MSGID: 113030] [posix.c:2171:posix_unlink] > > > 0-$VOLUME-posix: open-fd-key-status: 0 for $file5 > > > [2019-01-07 09:32:13.821609] I [MSGID: 113031] > > > [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr status: 0 > > > for $file5 > > > > > > > > > > > > Also, we noticed that many directories got their modification time > > > updated. It was set to the rebalance date. Is that supposed to happen? > > > > > > > > > We had parallel-readdir enabled during the rebalance. We disabled it since > > > we had empty directories that couldn't be deleted. I was able to delete > > > those dirs after that. > > > > > Was this disabled during the rebalance? parallel-readdirp changes the > > volume graph for clients but not for the rebalance process causing it to > > fail to find the linkto subvols. > Yes, parallel-readdirp was enabled during the rebalance. But we disabled > it after some files where invisible on the client side again. The timetable looks like this: December 12th: parallel-readdir enabled January 7th: rebalance started January 11th/12th: rebalance finished (varied a bit, some servers were faster) January 15th: parallel-readdir disabled > > > > > > > > > > > > Also, we have directories who lost their GFID on some bricks. Again. > > > > Is this the missing symlink problem that was reported earlier? Looks like. I had a dir with missing GFID on one brick, I couldn't see some files on client side, I recreated the GFID symlink and everything was fine again. And in the brick log, I had this entry (with 1d372a8a-4958-4700-8ef1-fa4f756baad3 being the GFID of the dir in question): [2019-01-13 17:57:55.020859] W [MSGID: 113103] [posix.c:301:posix_lookup] 0-$VOLUME-posix: Found stale gfid handle /srv/glusterfs/bricks/$brick/data/.glusterfs/1d/37/1d372a8a-4958-4700-8ef1-fa4f756baad3, removing it. [No such file or directory] Very familiar. At least, I know how to fix that :D Kind regards Gudrun > > > > Regards, > > Nithya > > > > > > > > > > > > > > > > > What happened? Can we do something to fix this? And could that happen > > > again? > > > > > > We want to upgrade to 4.1 soon. Is it safe to do that or could it make > > > things worse? > > > > > > Kind regards > > > > > > Gudrun Amedick_______________________________________________ > > > Gluster-users mailing list > > > Gluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 6743 bytes Desc: not available URL: From amudhan83 at gmail.com Wed Jan 30 13:55:58 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Wed, 30 Jan 2019 19:25:58 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Hi Atin, yes, it worked out thank you. what would be the cause of this issue? On Fri, Jan 25, 2019 at 1:56 PM Atin Mukherjee wrote: > Amudhan, > > So here's the issue: > > In node3, 'cat /var/lib/glusterd/peers/* ' doesn't show up node2's details > and that's why glusterd wasn't able to resolve the brick(s) hosted on node2. > > Can you please pick up 0083ec0c-40bf-472a-a128-458924e56c96 file from > /var/lib/glusterd/peers/ from node 4 and place it in the same location in > node 3 and then restart glusterd service on node 3? > > > On Thu, Jan 24, 2019 at 11:57 AM Amudhan P wrote: > >> Atin, >> >> Sorry, i missed to send entire `glusterd` folder. Now attached zip >> contains `glusterd` folder from all nodes. >> >> the problem node is node3 IP 10.1.2.3, `glusterd` log file is inside >> node3 folder. >> >> regards >> Amudhan >> >> On Wed, Jan 23, 2019 at 11:02 PM Atin Mukherjee >> wrote: >> >>> Amudhan, >>> >>> I see that you have provided the content of the configuration of the >>> volume gfs-tst where the request was to share the dump of >>> /var/lib/glusterd/* . I can not debug this further until you share the >>> correct dump. >>> >>> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee >>> wrote: >>> >>>> Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? >>>> Instead of doing too many back and forth I suggest you to share the content >>>> of /var/lib/glusterd from all the nodes. Also do mention which particular >>>> node the glusterd service is unable to come up. >>>> >>>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P wrote: >>>> >>>>> I have created the folder in the path as said but still, service >>>>> failed to start below is the error msg in glusterd.log >>>>> >>>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] >>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>> /var/run/glusterd.pid) >>>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >>>>> 0-management: Using /var/lib/glusterd as working directory >>>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>> channel creation failed [No such device] >>>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>>>> 0-rdma.management: Failed to initialize IB Device >>>>> [2019-01-16 14:50:14.563882] W >>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>> initialization failed >>>>> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>> transport >>>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>> op-version: 40100 >>>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>> connect returned 0 >>>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>> Failed to get tcp-user-timeout >>>>> [2019-01-16 14:50:15.675451] I >>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>> frame-timeout to 600 >>>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>> brick failed in restore* >>>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>> 'management' failed, review your volfile again* >>>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>> failed >>>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>> received signum (-1), shutting down >>>>> >>>>> >>>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee >>>>> wrote: >>>>> >>>>>> If gluster volume info/status shows the brick to be >>>>>> /media/disk4/brick4 then you'd need to mount the same path and hence you'd >>>>>> need to create the brick4 directory explicitly. I fail to understand the >>>>>> rationale how only /media/disk4 can be used as the mount path for the >>>>>> brick. >>>>>> >>>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P >>>>>> wrote: >>>>>> >>>>>>> Yes, I did mount bricks but the folder 'brick4' was still not >>>>>>> created inside the brick. >>>>>>> Do I need to create this folder because when I run replace-brick it >>>>>>> will create folder inside the brick. I have seen this behavior before when >>>>>>> running replace-brick or heal begins. >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Atin, >>>>>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>>>>> >>>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>> /var/run/glusterd.pid) >>>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>> set to 65536 >>>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>> directory >>>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>> working directory >>>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>> channel creation failed [No such device] >>>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>> initialization failed >>>>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>> listener, initing the transport failed >>>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>> continuing with succeeded transport >>>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>> op-version: 40100 >>>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>>>> directory] >>>>>>>>> >>>>>>>> >>>>>>>> This means that underlying brick /media/disk4/brick4 doesn't exist. >>>>>>>> You already mentioned that you had replaced the faulty disk, but have you >>>>>>>> not mounted it yet? >>>>>>>> >>>>>>>> >>>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>>>> connect returned 0 >>>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>>>> Failed to get tcp-user-timeout >>>>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>>>> frame-timeout to 600 >>>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>>>> brick failed in restore >>>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>> 'management' failed, review your volfile again >>>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>> failed >>>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>> [2019-01-15 20:17:00.693004] W >>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>> received signum (-1), shutting down >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee < >>>>>>>>> amukherj at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> This is a case of partial write of a transaction and as the host >>>>>>>>>> ran out of space for the root partition where all the glusterd related >>>>>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>>>>> reporting all nodes healthy and connected. >>>>>>>>>> >>>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> In short, when I started glusterd service I am getting following >>>>>>>>>>> error msg in the glusterd.log file in one server. >>>>>>>>>>> what needs to be done? >>>>>>>>>>> >>>>>>>>>>> error logged in glusterd.log >>>>>>>>>>> >>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>> set to 65536 >>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>> directory >>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>> working directory >>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>> initialization failed >>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>> op-version: 40100 >>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>>> file or directory] >>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>> failed >>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> In long, I am trying to simulate a situation. where volume >>>>>>>>>>> stoped abnormally and >>>>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>>>> >>>>>>>>>>> My test cluster is set up with 3 nodes and each has four disks, >>>>>>>>>>> I have setup a volume with disperse 4+2. >>>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>>>> system >>>>>>>>>>> >>>>>>>>>>> below are the steps done. >>>>>>>>>>> >>>>>>>>>>> 1. umount from client machine >>>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>>>> without stopping volume and stop service) >>>>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>>>> 4. powered ON all system >>>>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>>>>> file for details. >>>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : >>>>>>>>>>> FAILED : Volume gfs-tst already started >>>>>>>>>>> >>>>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>>> Online Pid >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>>> 1517 >>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>>> 1668 >>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>>> 1522 >>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>>> 1678 >>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>>> 1527 >>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>>> 1677 >>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>>> 1541 >>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>>> 1683 >>>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>>> Y 2662 >>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>>> 2786 >>>>>>>>>>> >>>>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>>>> `reset-brick` command >>>>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>>>> >>>>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>>>> >>>>>>>>>>> 11. reset-brick command was not working, so, tried stopping >>>>>>>>>>> volume and start with force command >>>>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>>>>> details >>>>>>>>>>> >>>>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>>>>> >>>>>>>>>>> in node-3 receiving following message. >>>>>>>>>>> >>>>>>>>>>> sudo service glusterd start >>>>>>>>>>> * Starting glusterd service glusterd >>>>>>>>>>> >>>>>>>>>>> [fail] >>>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>>>>>> >>>>>>>>>>> 13. checking glusterd log file found that OS drive was running >>>>>>>>>>> out of space >>>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>>>>> left on device] >>>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>>>> >>>>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>>>> >>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>> set to 65536 >>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>> directory >>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>> working directory >>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>> initialization failed >>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>> op-version: 40100 >>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>>> file or directory] >>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>> failed >>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>>> received signum (-1), shutting down >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 15. In other node running `volume status' still shows bricks >>>>>>>>>>> node3 is live >>>>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>>>> >>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>>> Online Pid >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>>> 1517 >>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>>> 1668 >>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>>> 1522 >>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>>> 1678 >>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>>> 1527 >>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>>> 1677 >>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>>> 1541 >>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>>> 1683 >>>>>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>>>>> 2662 >>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>>> 2786 >>>>>>>>>>> >>>>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> There are no active volume tasks >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>>>>> UUID Hostname State >>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>>>>>> >>>>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>>>>> Number of Peers: 2 >>>>>>>>>>> >>>>>>>>>>> Hostname: IP.3 >>>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>>>> >>>>>>>>>>> Hostname: IP.4 >>>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> regards >>>>>>>>>>> Amudhan >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mjc at avtechpulse.com Wed Jan 30 14:53:24 2019 From: mjc at avtechpulse.com (Dr. Michael J. Chudobiak) Date: Wed, 30 Jan 2019 09:53:24 -0500 Subject: [Gluster-users] chrome / chromium crash on gluster Message-ID: <893b30a1-e2cb-fa49-c014-9a68bbd1b7dd@avtechpulse.com> I run Fedora 29 clients and servers, with user home folders mounted on gluster. This worked fine with Fedora 27 clients, but on F29 clients the chrome and chromium browsers crash. The backtrace info (see below) suggests problems with sqlite. Anyone else run into this? gluster and sqlite have had issues in the past... Firefox runs just fine, even though it is an sqlite user too. chromium clients mounted on local drives work fine. - Mike clients: glusterfs-5.3-1.fc29.x86_64, chromium-71.0.3578.98-1.fc29.x86_64 server: glusterfs-server-5.3-1.fc29.x86_64 [mjc at daisy ~]$ chromium-browser [18826:18826:0130/094436.431828:ERROR:sandbox_linux.cc(364)] InitializeSandbox() called with multiple threads in process gpu-process. [18785:18785:0130/094440.905900:ERROR:x11_input_method_context_impl_gtk.cc(144)] Not implemented reached in virtual void libgtkui::X11InputMethodContextImplGtk::SetSurroundingText(const string16&, const gfx::Range&) Received signal 7 BUS_ADRERR 7fc30e9bd000 #0 0x7fc34b008261 base::debug::StackTrace::StackTrace() #1 0x7fc34b00869b base::debug::(anonymous namespace)::StackDumpSignalHandler() #2 0x7fc34b008cb7 base::debug::(anonymous namespace)::StackDumpSignalHandler() #3 0x7fc3401fe030 #4 0x7fc33f5820f0 __memmove_avx_unaligned_erms #5 0x7fc346099491 unixRead #6 0x7fc3460d2784 readDbPage #7 0x7fc3460d5e4f getPageNormal #8 0x7fc3460d5f01 getPageMMap #9 0x7fc3460958f5 btreeGetPage #10 0x7fc3460ec47b sqlite3BtreeBeginTrans #11 0x7fc3460fd1e8 sqlite3VdbeExec #12 0x7fc3461056af chrome_sqlite3_step #13 0x7fc3464071c7 sql::Statement::StepInternal() #14 0x7fc3464072de sql::Statement::Step() #15 0x555fd21699d7 autofill::AutofillTable::GetAutofillProfiles() #16 0x555fd2160808 autofill::AutofillProfileSyncableService::MergeDataAndStartSyncing() #17 0x555fd1d25207 syncer::SharedChangeProcessor::StartAssociation() #18 0x555fd1d09652 _ZN4base8internal7InvokerINS0_9BindStateIMN6syncer21SharedChangeProcessorEFvNS_17RepeatingCallbackIFvNS3_18DataTypeController15ConfigureResultERKNS3_15SyncMergeResultESA_EEEPNS3_10SyncClientEPNS3_29GenericChangeProcessorFactoryEPNS3_9UserShareESt10unique_ptrINS3_20DataTypeErrorHandlerESt14default_deleteISK_EEEJ13scoped_refptrIS4_ESC_SE_SG_SI_NS0_13PassedWrapperISN_EEEEEFvvEE3RunEPNS0_13BindStateBaseE #19 0x7fc34af4309d base::debug::TaskAnnotator::RunTask() #20 0x7fc34afcda86 base::internal::TaskTracker::RunOrSkipTask() #21 0x7fc34b01b6a2 base::internal::TaskTrackerPosix::RunOrSkipTask() #22 0x7fc34afd07d6 base::internal::TaskTracker::RunAndPopNextTask() #23 0x7fc34afca5e7 base::internal::SchedulerWorker::RunWorker() #24 0x7fc34afcac84 base::internal::SchedulerWorker::RunSharedWorker() #25 0x7fc34b01aa09 base::(anonymous namespace)::ThreadFunc() #26 0x7fc3401f358e start_thread #27 0x7fc33f51d6a3 __GI___clone r8: 00000cbfd93d4a00 r9: 00000000cbfd93d4 r10: 000000000000011c r11: 0000000000000000 r12: 00000cbfd940eb00 r13: 0000000000000000 r14: 0000000000000000 r15: 00000cbfd9336c00 di: 00000cbfd93d4a00 si: 00007fc30e9bd000 bp: 00007fc30faff7e0 bx: 0000000000000800 dx: 0000000000000800 ax: 00000cbfd93d4a00 cx: 0000000000000800 sp: 00007fc30faff788 ip: 00007fc33f5820f0 efl: 0000000000010287 cgf: 002b000000000033 erf: 0000000000000004 trp: 000000000000000e msk: 0000000000000000 cr2: 00007fc30e9bd000 [end of stack trace] Calling _exit(1). Core file will not be generated. From archon810 at gmail.com Wed Jan 30 20:26:22 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Wed, 30 Jan 2019 12:26:22 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] Message-ID: I found a similar issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a comment from 3 days ago from someone else with 5.3 who started seeing the spam. Here's the command that repeats over and over: [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] Is there any fix for this issue? Thanks. Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Wed Jan 30 20:43:35 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Wed, 30 Jan 2019 12:43:35 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Also, not sure if related or not, but I got a ton of these "Failed to dispatch handler" in my logs as well. Many people have been commenting about this issue here https://bugzilla.redhat.com/show_bug.cgi?id=1651246. ==> mnt-SITE_data1.log <== > [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fd966fcd329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] > ==> mnt-SITE_data3.log <== > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch > handler" repeated 413 times between [2019-01-30 20:36:23.881090] and > [2019-01-30 20:38:20.015593] > The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] > 2-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-0" > repeated 42 times between [2019-01-30 20:36:23.290287] and [2019-01-30 > 20:38:20.280306] > ==> mnt-SITE_data1.log <== > The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] > 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-0" > repeated 50 times between [2019-01-30 20:36:22.247367] and [2019-01-30 > 20:38:19.459789] > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch > handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and > [2019-01-30 20:38:20.546355] > [2019-01-30 20:38:21.492319] I [MSGID: 108031] > [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: > selecting local read_child SITE_data1-client-0 > ==> mnt-SITE_data3.log <== > [2019-01-30 20:38:22.349689] I [MSGID: 108031] > [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: > selecting local read_child SITE_data3-client-0 > ==> mnt-SITE_data1.log <== > [2019-01-30 20:38:22.762941] E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch > handler I'm hoping raising the issue here on the mailing list may bring some additional eyeballs and get them both fixed. Thanks. Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii wrote: > I found a similar issue here: > https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a comment > from 3 days ago from someone else with 5.3 who started seeing the spam. > > Here's the command that repeats over and over: > [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fd966fcd329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] > > Is there any fix for this issue? > > Thanks. > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Thu Jan 31 02:37:29 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Thu, 31 Jan 2019 08:07:29 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii wrote: > Also, not sure if related or not, but I got a ton of these "Failed to > dispatch handler" in my logs as well. Many people have been commenting > about this issue here https://bugzilla.redhat.com/show_bug.cgi?id=1651246. > https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. > ==> mnt-SITE_data1.log <== >> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fd966fcd329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >> ==> mnt-SITE_data3.log <== >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >> [2019-01-30 20:38:20.015593] >> The message "I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >> selecting local read_child SITE_data3-client-0" repeated 42 times between >> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >> ==> mnt-SITE_data1.log <== >> The message "I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >> selecting local read_child SITE_data1-client-0" repeated 50 times between >> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >> [2019-01-30 20:38:20.546355] >> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >> selecting local read_child SITE_data1-client-0 >> ==> mnt-SITE_data3.log <== >> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >> selecting local read_child SITE_data3-client-0 >> ==> mnt-SITE_data1.log <== >> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >> handler > > > I'm hoping raising the issue here on the mailing list may bring some > additional eyeballs and get them both fixed. > > Thanks. > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii > wrote: > >> I found a similar issue here: >> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a comment >> from 3 days ago from someone else with 5.3 who started seeing the spam. >> >> Here's the command that repeats over and over: >> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fd966fcd329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >> > +Milind Changire Can you check why this message is logged and send a fix? >> Is there any fix for this issue? >> >> Thanks. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Thu Jan 31 03:21:48 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 31 Jan 2019 08:51:48 +0530 Subject: [Gluster-users] Default Port Range for Bricks In-Reply-To: References: Message-ID: On Tue, Jan 29, 2019 at 8:52 PM David Spisla wrote: > Hello Gluster Community, > > in glusterd.vol are parameters to define the port range for the bricks. > They are commented out per default: > > # option base-port 49152 > # option max-port 65535 > I assume that glusterd is not using this range if the parameters are commented out. > > The current commented out config of base and max port you see defined in the glusterd.vol are the same default which is defined in glusterd codebase as well. The intention of introducing these options in the config was to ensure if users want to bring in more granular control w.r.t port range the same can be achieved by defining the range in this file. However from glusterfs-6 onwards, we have fixed a bug 1659857 which will consider the default max port to be 60999. > But what range instead? Is there a way to find this out? > > Regards > > David Spisla > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Thu Jan 31 03:24:25 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 31 Jan 2019 08:54:25 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: I'm not very sure how did you end up into a state where in one of the node lost information of one peer from the cluster. I suspect doing a replace node operation you somehow landed into this situation by an incorrect step. Until and unless you could elaborate more on what all steps you have performed in the cluster, it'd be difficult to figure out the exact cause. On Wed, Jan 30, 2019 at 7:25 PM Amudhan P wrote: > Hi Atin, > > yes, it worked out thank you. > > what would be the cause of this issue? > > > > On Fri, Jan 25, 2019 at 1:56 PM Atin Mukherjee > wrote: > >> Amudhan, >> >> So here's the issue: >> >> In node3, 'cat /var/lib/glusterd/peers/* ' doesn't show up node2's >> details and that's why glusterd wasn't able to resolve the brick(s) hosted >> on node2. >> >> Can you please pick up 0083ec0c-40bf-472a-a128-458924e56c96 file from >> /var/lib/glusterd/peers/ from node 4 and place it in the same location in >> node 3 and then restart glusterd service on node 3? >> >> >> On Thu, Jan 24, 2019 at 11:57 AM Amudhan P wrote: >> >>> Atin, >>> >>> Sorry, i missed to send entire `glusterd` folder. Now attached zip >>> contains `glusterd` folder from all nodes. >>> >>> the problem node is node3 IP 10.1.2.3, `glusterd` log file is inside >>> node3 folder. >>> >>> regards >>> Amudhan >>> >>> On Wed, Jan 23, 2019 at 11:02 PM Atin Mukherjee >>> wrote: >>> >>>> Amudhan, >>>> >>>> I see that you have provided the content of the configuration of the >>>> volume gfs-tst where the request was to share the dump of >>>> /var/lib/glusterd/* . I can not debug this further until you share the >>>> correct dump. >>>> >>>> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee >>>> wrote: >>>> >>>>> Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? >>>>> Instead of doing too many back and forth I suggest you to share the content >>>>> of /var/lib/glusterd from all the nodes. Also do mention which particular >>>>> node the glusterd service is unable to come up. >>>>> >>>>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P >>>>> wrote: >>>>> >>>>>> I have created the folder in the path as said but still, service >>>>>> failed to start below is the error msg in glusterd.log >>>>>> >>>>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] >>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>> /var/run/glusterd.pid) >>>>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >>>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >>>>>> 0-management: Using /var/lib/glusterd as working directory >>>>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >>>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>> channel creation failed [No such device] >>>>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>> [2019-01-16 14:50:14.563882] W >>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>> initialization failed >>>>>> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >>>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>>> transport >>>>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>> op-version: 40100 >>>>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>> connect returned 0 >>>>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>> Failed to get tcp-user-timeout >>>>>> [2019-01-16 14:50:15.675451] I >>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>> frame-timeout to 600 >>>>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>> brick failed in restore* >>>>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>> 'management' failed, review your volfile again* >>>>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>> failed >>>>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>> received signum (-1), shutting down >>>>>> >>>>>> >>>>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee >>>>>> wrote: >>>>>> >>>>>>> If gluster volume info/status shows the brick to be >>>>>>> /media/disk4/brick4 then you'd need to mount the same path and hence you'd >>>>>>> need to create the brick4 directory explicitly. I fail to understand the >>>>>>> rationale how only /media/disk4 can be used as the mount path for the >>>>>>> brick. >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P >>>>>>> wrote: >>>>>>> >>>>>>>> Yes, I did mount bricks but the folder 'brick4' was still not >>>>>>>> created inside the brick. >>>>>>>> Do I need to create this folder because when I run replace-brick it >>>>>>>> will create folder inside the brick. I have seen this behavior before when >>>>>>>> running replace-brick or heal begins. >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Atin, >>>>>>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>>>>>> >>>>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>> set to 65536 >>>>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>> directory >>>>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>> working directory >>>>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>> channel creation failed [No such device] >>>>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>> initialization failed >>>>>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>> listener, initing the transport failed >>>>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>> continuing with succeeded transport >>>>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>> op-version: 40100 >>>>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>>>>> directory] >>>>>>>>>> >>>>>>>>> >>>>>>>>> This means that underlying brick /media/disk4/brick4 doesn't >>>>>>>>> exist. You already mentioned that you had replaced the faulty disk, but >>>>>>>>> have you not mounted it yet? >>>>>>>>> >>>>>>>>> >>>>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>>>>> connect returned 0 >>>>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>>>>> Failed to get tcp-user-timeout >>>>>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>>>>> frame-timeout to 600 >>>>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>>>>> brick failed in restore >>>>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>> failed >>>>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>> [2019-01-15 20:17:00.693004] W >>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>> received signum (-1), shutting down >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee < >>>>>>>>>> amukherj at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> This is a case of partial write of a transaction and as the host >>>>>>>>>>> ran out of space for the root partition where all the glusterd related >>>>>>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>>>>>> reporting all nodes healthy and connected. >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> In short, when I started glusterd service I am getting >>>>>>>>>>>> following error msg in the glusterd.log file in one server. >>>>>>>>>>>> what needs to be done? >>>>>>>>>>>> >>>>>>>>>>>> error logged in glusterd.log >>>>>>>>>>>> >>>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>>> set to 65536 >>>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>>> directory >>>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>>> working directory >>>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>>> initialization failed >>>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>>> op-version: 40100 >>>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>>>> file or directory] >>>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>>> failed >>>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> In long, I am trying to simulate a situation. where volume >>>>>>>>>>>> stoped abnormally and >>>>>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>>>>> >>>>>>>>>>>> My test cluster is set up with 3 nodes and each has four disks, >>>>>>>>>>>> I have setup a volume with disperse 4+2. >>>>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>>>>> system >>>>>>>>>>>> >>>>>>>>>>>> below are the steps done. >>>>>>>>>>>> >>>>>>>>>>>> 1. umount from client machine >>>>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>>>>> without stopping volume and stop service) >>>>>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>>>>> 4. powered ON all system >>>>>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>>>>>> file for details. >>>>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : >>>>>>>>>>>> FAILED : Volume gfs-tst already started >>>>>>>>>>>> >>>>>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>>> Gluster process TCP Port RDMA >>>>>>>>>>>> Port Online Pid >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>>>> 1517 >>>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>>>> 1668 >>>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>>>> 1522 >>>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>>>> 1678 >>>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>>>> 1527 >>>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>>>> 1677 >>>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>>>> 1541 >>>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>>>> 1683 >>>>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>>>> Y 2662 >>>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>>>> 2786 >>>>>>>>>>>> >>>>>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>>>>> `reset-brick` command >>>>>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>>>>> >>>>>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>>>>> >>>>>>>>>>>> 11. reset-brick command was not working, so, tried stopping >>>>>>>>>>>> volume and start with force command >>>>>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force >>>>>>>>>>>> : FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>>>>>> details >>>>>>>>>>>> >>>>>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>>>>>> >>>>>>>>>>>> in node-3 receiving following message. >>>>>>>>>>>> >>>>>>>>>>>> sudo service glusterd start >>>>>>>>>>>> * Starting glusterd service glusterd >>>>>>>>>>>> >>>>>>>>>>>> [fail] >>>>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more >>>>>>>>>>>> information. >>>>>>>>>>>> >>>>>>>>>>>> 13. checking glusterd log file found that OS drive was running >>>>>>>>>>>> out of space >>>>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>>>>>> left on device] >>>>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>>>>> >>>>>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>>>>> >>>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>>> set to 65536 >>>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>>> directory >>>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>>> working directory >>>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>>> initialization failed >>>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>>> op-version: 40100 >>>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>>>> file or directory] >>>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>>> failed >>>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>>>> received signum (-1), shutting down >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 15. In other node running `volume status' still shows bricks >>>>>>>>>>>> node3 is live >>>>>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>>>>> >>>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>>> Gluster process TCP Port RDMA >>>>>>>>>>>> Port Online Pid >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>>>> 1517 >>>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>>>> 1668 >>>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>>>> 1522 >>>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>>>> 1678 >>>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>>>> 1527 >>>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>>>> 1677 >>>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>>>> 1541 >>>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>>>> 1683 >>>>>>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>>>>>> 2662 >>>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>>>> 2786 >>>>>>>>>>>> >>>>>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>> There are no active volume tasks >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>>>>>> UUID Hostname State >>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost >>>>>>>>>>>> Connected >>>>>>>>>>>> >>>>>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>>>>>> Number of Peers: 2 >>>>>>>>>>>> >>>>>>>>>>>> Hostname: IP.3 >>>>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>>>>> >>>>>>>>>>>> Hostname: IP.4 >>>>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> regards >>>>>>>>>>>> Amudhan >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Thu Jan 31 03:33:06 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 31 Jan 2019 09:03:06 +0530 Subject: [Gluster-users] Is it required for a node to meet quorum over all the nodes in storage pool? In-Reply-To: References: Message-ID: On Fri, Jan 25, 2019 at 1:41 PM Jeevan Patnaik wrote: > Hi, > > I'm just going through the concepts of quorum and split-brains with a > cluster in general, and trying to understand GlusterFS quorums again which > I previously found difficult to accurately understand. > > When we talk about server quorums, what I understand is that the concept > is similar to STONITH in cluster i.e., we shoot the node that probably have > issues/ make the bricks down preventing access at all. But I don't get how > it calculates quorum. > > My understanding: > In a distributed replicated volume, > 1. All bricks in a replica set should have same data writes and hence, it > is required to meet atleast 51% quorum on those replica sets. Now > considering following 3x replica configuration: > ServerA,B,C,D,E,F-> brickA,B,C,D,E,F respectively and serverG without any > brick in storage pool. > Please note server quorum isn't calculated based on number of active bricks rather number of active nodes in the cluster. So in this case even if server G doesn't host any bricks in the storage pool, the quorum will be decided based on total number of servers/peers in the cluster vs total number of active peers in the cluster. If you're interested to know about it more, please refer https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/ or https://github.com/gluster/glusterfs/blob/master/xlators/mgmt/glusterd/src/glusterd-server-quorum.c#L281 (in case you are happy to browse the source code and understand the logic). > Scenario: > ServerA,B,F formed a partition i.e., they are isolated with other nodes in > storage pool. > > But serverA,B,C bricks are of same sub-volume, Hence if we consider quorum > over sub-volumes, A and B meets quorum for it's only participating > sub-volume and can serve the corresponding bricks. And the corresponding > bricks on C should go down. > > But when we consider quorum over storage pool, C,D,E,G meets quorum > whereas A,B,F is not. Hence, bricks on A,B,F should fail. And for C, the > quorum still will not me met for it's sub-volume. So, it will go to read > only mode. Sub-volume on D and E should work normally. > > So, with assumption that only sub-volume quorum is considered, we don't > have any downtime on sub-volumes, but we have two partitions and if clients > can access both, clients can still write and read on both the partitions > separately and without data conflict. The split-brain problem arrives when > some clients can access one partition and some other. > > If quorum is considered for entire storage pool, then this split-brain > will not be seen as the problem nodes will be dead. > > And so why is it's not mandatory to enable server quorum to avoid this > split-brain issue? > > And I also assume that quorum percentage should be greater than 50%. > There's any option to set custom percentage. Why is it required? > If all that is required is to kill the problem node partition (group) by > identifying if it has the largest possible share (i.e. greater than 50), > does the percentage really matter? > > Thanks in advance! > > Regards, > Jeevan. > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Thu Jan 31 06:20:19 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Thu, 31 Jan 2019 11:50:19 +0530 Subject: [Gluster-users] glusterfs 4.1.6 error in starting glusterd service In-Reply-To: References: Message-ID: Hi Atin, This is the steps exactly I have done which caused failure. additional to this node3 OS drive was running out of space when service failed. so I have cleared some space in OS drive but still service failed to start. Trying to simulate a situation. where volume stoped abnormally and entire cluster restarted with some missing disks. My test cluster is set up with 3 nodes and each has four disks, I have setup a volume with disperse 4+2. In Node-3 2 disks have failed, to replace I have shutdown all system below are the steps done. 1. umount from client machine 2. shutdown all system by running `shutdown -h now` command ( without stopping volume and stop service) 3. replace faulty disk in Node-3 4. powered ON all system 5. format replaced drives, and mount all drives 6. start glusterd service in all node (success) 7. Now running `voulume status` command from node-3 output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for details. 8. running `voulume start gfs-tst` command from node-3 output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : Volume gfs-tst already started 9. running `gluster v status` in other node. showing all brick available but 'self-heal daemon' not running @gfstst-node2:~$ sudo gluster v status Status of volume: gfs-tst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 Self-heal Daemon on localhost N/A N/A Y 2662 Self-heal Daemon on IP.4 N/A N/A Y 2786 10. in the above output 'volume already started'. so, running `reset-brick` command v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : /media/disk3/brick3 is already part of a volume 11. reset-brick command was not working, so, tried stopping volume and start with force command output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : FAILED : Pre-validation failed on localhost. Please check log file for details 12. now stopped service in all node and tried starting again. except node-3 other nodes service started successfully without any issues. in node-3 receiving following message. sudo service glusterd start * Starting glusterd service glusterd [fail] /usr/local/sbin/glusterd: option requires an argument -- 'f' Try `glusterd --help' or `glusterd --usage' for more information. 13. checking glusterd log file found that OS drive was running out of space output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space left on device] [2019-01-15 16:51:37.210874] E [MSGID: 106190] [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: Unable to write volume values for gfs-tst 14. cleared some space in OS drive but still, service is not running. below is the error logged in glusterd.log [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536 [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory [2019-01-15 17:50:13.964437] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2019-01-15 17:50:14.967681] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 [2019-01-15 17:50:14.973931] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: d6bf51a7-c296-492f-8dac-e81efa9dd22d [2019-01-15 17:50:15.046620] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such file or directory] [2019-01-15 17:50:15.046685] E [MSGID: 106201] [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: gfs-tst [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2019-01-15 17:50:15.046732] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed [2019-01-15 17:50:15.046741] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: received signum (-1), shutting down 15. In other node running `volume status' @gfstst-node2:~$ sudo gluster v status Status of volume: gfs-tst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 Self-heal Daemon on localhost N/A N/A Y 2662 Self-heal Daemon on IP.4 N/A N/A Y 2786 Task Status of Volume gfs-tst ------------------------------------------------------------------------------ There are no active volume tasks 16. 'peer status' command showing node-3 disconnected root at gfstst-node2:~$ sudo gluster pool list UUID Hostname State d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected root at gfstst-node2:~$ sudo gluster peer status Number of Peers: 2 Hostname: IP.3 Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d State: Peer in Cluster (Disconnected) Hostname: IP.4 Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 State: Peer in Cluster (Connected) regards Amudhan On Thu, Jan 31, 2019 at 8:54 AM Atin Mukherjee wrote: > I'm not very sure how did you end up into a state where in one of the node > lost information of one peer from the cluster. I suspect doing a replace > node operation you somehow landed into this situation by an incorrect step. > Until and unless you could elaborate more on what all steps you have > performed in the cluster, it'd be difficult to figure out the exact cause. > > On Wed, Jan 30, 2019 at 7:25 PM Amudhan P wrote: > >> Hi Atin, >> >> yes, it worked out thank you. >> >> what would be the cause of this issue? >> >> >> >> On Fri, Jan 25, 2019 at 1:56 PM Atin Mukherjee >> wrote: >> >>> Amudhan, >>> >>> So here's the issue: >>> >>> In node3, 'cat /var/lib/glusterd/peers/* ' doesn't show up node2's >>> details and that's why glusterd wasn't able to resolve the brick(s) hosted >>> on node2. >>> >>> Can you please pick up 0083ec0c-40bf-472a-a128-458924e56c96 file from >>> /var/lib/glusterd/peers/ from node 4 and place it in the same location in >>> node 3 and then restart glusterd service on node 3? >>> >>> >>> On Thu, Jan 24, 2019 at 11:57 AM Amudhan P wrote: >>> >>>> Atin, >>>> >>>> Sorry, i missed to send entire `glusterd` folder. Now attached zip >>>> contains `glusterd` folder from all nodes. >>>> >>>> the problem node is node3 IP 10.1.2.3, `glusterd` log file is inside >>>> node3 folder. >>>> >>>> regards >>>> Amudhan >>>> >>>> On Wed, Jan 23, 2019 at 11:02 PM Atin Mukherjee >>>> wrote: >>>> >>>>> Amudhan, >>>>> >>>>> I see that you have provided the content of the configuration of the >>>>> volume gfs-tst where the request was to share the dump of >>>>> /var/lib/glusterd/* . I can not debug this further until you share the >>>>> correct dump. >>>>> >>>>> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee >>>>> wrote: >>>>> >>>>>> Can you please run 'glusterd -LDEBUG' and share back the >>>>>> glusterd.log? Instead of doing too many back and forth I suggest you to >>>>>> share the content of /var/lib/glusterd from all the nodes. Also do mention >>>>>> which particular node the glusterd service is unable to come up. >>>>>> >>>>>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P >>>>>> wrote: >>>>>> >>>>>>> I have created the folder in the path as said but still, service >>>>>>> failed to start below is the error msg in glusterd.log >>>>>>> >>>>>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] >>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>> /var/run/glusterd.pid) >>>>>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] >>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>> set to 65536 >>>>>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] >>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>> directory >>>>>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] >>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>> working directory >>>>>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>> channel creation failed [No such device] >>>>>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>> [2019-01-16 14:50:14.563882] W >>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>> initialization failed >>>>>>> [2019-01-16 14:50:14.563957] W >>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>> listener, initing the transport failed >>>>>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] >>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>> continuing with succeeded transport >>>>>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>> op-version: 40100 >>>>>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>> connect returned 0 >>>>>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>> Failed to get tcp-user-timeout >>>>>>> [2019-01-16 14:50:15.675451] I >>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>> frame-timeout to 600 >>>>>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>> brick failed in restore* >>>>>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>> 'management' failed, review your volfile again* >>>>>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>> failed >>>>>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>> received signum (-1), shutting down >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee >>>>>>> wrote: >>>>>>> >>>>>>>> If gluster volume info/status shows the brick to be >>>>>>>> /media/disk4/brick4 then you'd need to mount the same path and hence you'd >>>>>>>> need to create the brick4 directory explicitly. I fail to understand the >>>>>>>> rationale how only /media/disk4 can be used as the mount path for the >>>>>>>> brick. >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yes, I did mount bricks but the folder 'brick4' was still not >>>>>>>>> created inside the brick. >>>>>>>>> Do I need to create this folder because when I run replace-brick >>>>>>>>> it will create folder inside the brick. I have seen this behavior before >>>>>>>>> when running replace-brick or heal begins. >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee < >>>>>>>>> amukherj at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Atin, >>>>>>>>>>> I have copied the content of 'gfs-tst' from vol folder in >>>>>>>>>>> another node. when starting service again fails with error msg in >>>>>>>>>>> glusterd.log file. >>>>>>>>>>> >>>>>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>> set to 65536 >>>>>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>> directory >>>>>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>> working directory >>>>>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] >>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>> initialization failed >>>>>>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>> op-version: 40100 >>>>>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>>>>>> directory] >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This means that underlying brick /media/disk4/brick4 doesn't >>>>>>>>>> exist. You already mentioned that you had replaced the faulty disk, but >>>>>>>>>> have you not mounted it yet? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>>>>>> connect returned 0 >>>>>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>>>>>> Failed to get tcp-user-timeout >>>>>>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>>>>>> frame-timeout to 600 >>>>>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>>>>>> brick failed in restore >>>>>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>> failed >>>>>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>> [2019-01-15 20:17:00.693004] W >>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>>> received signum (-1), shutting down >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee < >>>>>>>>>>> amukherj at redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> This is a case of partial write of a transaction and as the >>>>>>>>>>>> host ran out of space for the root partition where all the glusterd related >>>>>>>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>>>>>>> reporting all nodes healthy and connected. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> In short, when I started glusterd service I am getting >>>>>>>>>>>>> following error msg in the glusterd.log file in one server. >>>>>>>>>>>>> what needs to be done? >>>>>>>>>>>>> >>>>>>>>>>>>> error logged in glusterd.log >>>>>>>>>>>>> >>>>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>>>> set to 65536 >>>>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>>>> directory >>>>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>>>> working directory >>>>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>>>> initialization failed >>>>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>>>> op-version: 40100 >>>>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>>>>> file or directory] >>>>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>>>> failed >>>>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> In long, I am trying to simulate a situation. where volume >>>>>>>>>>>>> stoped abnormally and >>>>>>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>>>>>> >>>>>>>>>>>>> My test cluster is set up with 3 nodes and each has four >>>>>>>>>>>>> disks, I have setup a volume with disperse 4+2. >>>>>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>>>>>> system >>>>>>>>>>>>> >>>>>>>>>>>>> below are the steps done. >>>>>>>>>>>>> >>>>>>>>>>>>> 1. umount from client machine >>>>>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>>>>>> without stopping volume and stop service) >>>>>>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>>>>>> 4. powered ON all system >>>>>>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>>>>>>> file for details. >>>>>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : >>>>>>>>>>>>> FAILED : Volume gfs-tst already started >>>>>>>>>>>>> >>>>>>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>>>> Gluster process TCP Port RDMA >>>>>>>>>>>>> Port Online Pid >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 >>>>>>>>>>>>> Y 1517 >>>>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 >>>>>>>>>>>>> Y 1668 >>>>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 >>>>>>>>>>>>> Y 1522 >>>>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 >>>>>>>>>>>>> Y 1678 >>>>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 >>>>>>>>>>>>> Y 1527 >>>>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 >>>>>>>>>>>>> Y 1677 >>>>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 >>>>>>>>>>>>> Y 1541 >>>>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 >>>>>>>>>>>>> Y 1683 >>>>>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>>>>> Y 2662 >>>>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A >>>>>>>>>>>>> Y 2786 >>>>>>>>>>>>> >>>>>>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>>>>>> `reset-brick` command >>>>>>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>>>>>> >>>>>>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>>>>>> >>>>>>>>>>>>> 11. reset-brick command was not working, so, tried stopping >>>>>>>>>>>>> volume and start with force command >>>>>>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force >>>>>>>>>>>>> : FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>>>>>>> details >>>>>>>>>>>>> >>>>>>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>>>>>>> >>>>>>>>>>>>> in node-3 receiving following message. >>>>>>>>>>>>> >>>>>>>>>>>>> sudo service glusterd start >>>>>>>>>>>>> * Starting glusterd service glusterd >>>>>>>>>>>>> >>>>>>>>>>>>> [fail] >>>>>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more >>>>>>>>>>>>> information. >>>>>>>>>>>>> >>>>>>>>>>>>> 13. checking glusterd log file found that OS drive was running >>>>>>>>>>>>> out of space >>>>>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>>>>>>> left on device] >>>>>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>>>>>> >>>>>>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>>>>>> >>>>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>>>> set to 65536 >>>>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>>>> directory >>>>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>>>> working directory >>>>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>>>> initialization failed >>>>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>>>> op-version: 40100 >>>>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>>>>> file or directory] >>>>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>>>> failed >>>>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>>>>> received signum (-1), shutting down >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 15. In other node running `volume status' still shows bricks >>>>>>>>>>>>> node3 is live >>>>>>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>>>>>> >>>>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>>>> Gluster process TCP Port RDMA >>>>>>>>>>>>> Port Online Pid >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 >>>>>>>>>>>>> Y 1517 >>>>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 >>>>>>>>>>>>> Y 1668 >>>>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 >>>>>>>>>>>>> Y 1522 >>>>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 >>>>>>>>>>>>> Y 1678 >>>>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 >>>>>>>>>>>>> Y 1527 >>>>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 >>>>>>>>>>>>> Y 1677 >>>>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 >>>>>>>>>>>>> Y 1541 >>>>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 >>>>>>>>>>>>> Y 1683 >>>>>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>>>>> Y 2662 >>>>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A >>>>>>>>>>>>> Y 2786 >>>>>>>>>>>>> >>>>>>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> There are no active volume tasks >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>>>>>>> UUID Hostname State >>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 >>>>>>>>>>>>> Disconnected >>>>>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost >>>>>>>>>>>>> Connected >>>>>>>>>>>>> >>>>>>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>>>>>>> Number of Peers: 2 >>>>>>>>>>>>> >>>>>>>>>>>>> Hostname: IP.3 >>>>>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>>>>>> >>>>>>>>>>>>> Hostname: IP.4 >>>>>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> regards >>>>>>>>>>>>> Amudhan >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> >>>>>>>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Thu Jan 31 07:21:09 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 31 Jan 2019 12:51:09 +0530 Subject: [Gluster-users] chrome / chromium crash on gluster In-Reply-To: <893b30a1-e2cb-fa49-c014-9a68bbd1b7dd@avtechpulse.com> References: <893b30a1-e2cb-fa49-c014-9a68bbd1b7dd@avtechpulse.com> Message-ID: Interesting, I run F29 for all development, and didn't see anything like this. Please share 'gluster volume info'. And also logs from mount process. -Amar On Wed, Jan 30, 2019 at 8:33 PM Dr. Michael J. Chudobiak < mjc at avtechpulse.com> wrote: > I run Fedora 29 clients and servers, with user home folders mounted on > gluster. This worked fine with Fedora 27 clients, but on F29 clients the > chrome and chromium browsers crash. The backtrace info (see below) > suggests problems with sqlite. Anyone else run into this? gluster and > sqlite have had issues in the past... > > Firefox runs just fine, even though it is an sqlite user too. > > chromium clients mounted on local drives work fine. > > - Mike > > > clients: glusterfs-5.3-1.fc29.x86_64, > chromium-71.0.3578.98-1.fc29.x86_64 > > server: glusterfs-server-5.3-1.fc29.x86_64 > > > [mjc at daisy ~]$ chromium-browser > [18826:18826:0130/094436.431828:ERROR:sandbox_linux.cc(364)] > InitializeSandbox() called with multiple threads in process gpu-process. > [18785:18785:0130/094440.905900:ERROR:x11_input_method_context_impl_gtk.cc(144)] > > Not implemented reached in virtual void > libgtkui::X11InputMethodContextImplGtk::SetSurroundingText(const > string16&, const gfx::Range&) > Received signal 7 BUS_ADRERR 7fc30e9bd000 > #0 0x7fc34b008261 base::debug::StackTrace::StackTrace() > #1 0x7fc34b00869b base::debug::(anonymous > namespace)::StackDumpSignalHandler() > #2 0x7fc34b008cb7 base::debug::(anonymous > namespace)::StackDumpSignalHandler() > #3 0x7fc3401fe030 > #4 0x7fc33f5820f0 __memmove_avx_unaligned_erms > #5 0x7fc346099491 unixRead > #6 0x7fc3460d2784 readDbPage > #7 0x7fc3460d5e4f getPageNormal > #8 0x7fc3460d5f01 getPageMMap > #9 0x7fc3460958f5 btreeGetPage > #10 0x7fc3460ec47b sqlite3BtreeBeginTrans > #11 0x7fc3460fd1e8 sqlite3VdbeExec > #12 0x7fc3461056af chrome_sqlite3_step > #13 0x7fc3464071c7 sql::Statement::StepInternal() > #14 0x7fc3464072de sql::Statement::Step() > #15 0x555fd21699d7 autofill::AutofillTable::GetAutofillProfiles() > #16 0x555fd2160808 > autofill::AutofillProfileSyncableService::MergeDataAndStartSyncing() > #17 0x555fd1d25207 syncer::SharedChangeProcessor::StartAssociation() > #18 0x555fd1d09652 > > _ZN4base8internal7InvokerINS0_9BindStateIMN6syncer21SharedChangeProcessorEFvNS_17RepeatingCallbackIFvNS3_18DataTypeController15ConfigureResultERKNS3_15SyncMergeResultESA_EEEPNS3_10SyncClientEPNS3_29GenericChangeProcessorFactoryEPNS3_9UserShareESt10unique_ptrINS3_20DataTypeErrorHandlerESt14default_deleteISK_EEEJ13scoped_refptrIS4_ESC_SE_SG_SI_NS0_13PassedWrapperISN_EEEEEFvvEE3RunEPNS0_13BindStateBaseE > #19 0x7fc34af4309d base::debug::TaskAnnotator::RunTask() > #20 0x7fc34afcda86 base::internal::TaskTracker::RunOrSkipTask() > #21 0x7fc34b01b6a2 base::internal::TaskTrackerPosix::RunOrSkipTask() > #22 0x7fc34afd07d6 base::internal::TaskTracker::RunAndPopNextTask() > #23 0x7fc34afca5e7 base::internal::SchedulerWorker::RunWorker() > #24 0x7fc34afcac84 base::internal::SchedulerWorker::RunSharedWorker() > #25 0x7fc34b01aa09 base::(anonymous namespace)::ThreadFunc() > #26 0x7fc3401f358e start_thread > #27 0x7fc33f51d6a3 __GI___clone > r8: 00000cbfd93d4a00 r9: 00000000cbfd93d4 r10: 000000000000011c r11: > 0000000000000000 > r12: 00000cbfd940eb00 r13: 0000000000000000 r14: 0000000000000000 r15: > 00000cbfd9336c00 > di: 00000cbfd93d4a00 si: 00007fc30e9bd000 bp: 00007fc30faff7e0 bx: > 0000000000000800 > dx: 0000000000000800 ax: 00000cbfd93d4a00 cx: 0000000000000800 sp: > 00007fc30faff788 > ip: 00007fc33f5820f0 efl: 0000000000010287 cgf: 002b000000000033 erf: > 0000000000000004 > trp: 000000000000000e msk: 0000000000000000 cr2: 00007fc30e9bd000 > [end of stack trace] > Calling _exit(1). Core file will not be generated. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From spisla80 at gmail.com Thu Jan 31 08:06:32 2019 From: spisla80 at gmail.com (David Spisla) Date: Thu, 31 Jan 2019 09:06:32 +0100 Subject: [Gluster-users] Default Port Range for Bricks In-Reply-To: References: Message-ID: Thank you for the clarification. Am Do., 31. Jan. 2019 um 04:22 Uhr schrieb Atin Mukherjee < amukherj at redhat.com>: > > > On Tue, Jan 29, 2019 at 8:52 PM David Spisla wrote: > >> Hello Gluster Community, >> >> in glusterd.vol are parameters to define the port range for the bricks. >> They are commented out per default: >> >> # option base-port 49152 >> # option max-port 65535 >> I assume that glusterd is not using this range if the parameters are commented out. >> >> The current commented out config of base and max port you see defined in > the glusterd.vol are the same default which is defined in glusterd codebase > as well. The intention of introducing these options in the config was to > ensure if users want to bring in more granular control w.r.t port range the > same can be achieved by defining the range in this file. > > However from glusterfs-6 onwards, we have fixed a bug 1659857 which will > consider the default max port to be 60999. > > >> But what range instead? Is there a way to find this out? >> >> Regards >> >> David Spisla >> >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Thu Jan 31 09:16:30 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Thu, 31 Jan 2019 14:46:30 +0530 Subject: [Gluster-users] Files losing permissions in GlusterFS 3.12 In-Reply-To: <1548855752.2018.39.camel@uni-luebeck.de> References: <1548429058.2018.12.camel@uni-luebeck.de> <1548667421.10294.221.camel@uni-luebeck.de> <1548855752.2018.39.camel@uni-luebeck.de> Message-ID: On Wed, 30 Jan 2019 at 19:12, Gudrun Mareike Amedick < g.amedick at uni-luebeck.de> wrote: > Hi, > > a bit additional info inlineAm Montag, den 28.01.2019, 10:23 +0100 schrieb > Frank Ruehlemann: > > Am Montag, den 28.01.2019, 09:50 +0530 schrieb Nithya Balachandran: > > > > > > On Fri, 25 Jan 2019 at 20:51, Gudrun Mareike Amedick < > > > g.amedick at uni-luebeck.de> wrote: > > > > > > > > > > > Hi all, > > > > > > > > we have a problem with a distributed dispersed volume (GlusterFS > 3.12). We > > > > have files that lost their permissions or gained sticky bits. The > files > > > > themselves seem to be okay. > > > > > > > > It looks like this: > > > > > > > > # ls -lah $file1 > > > > ---------- 1 www-data www-data 45M Jan 12 07:01 $file1 > > > > > > > > # ls -lah $file2 > > > > -rw-rwS--T 1 $user $group 11K Jan 9 11:48 $file2 > > > > > > > > # ls -lah $file3 > > > > ---------T 1 $user $group 6.8M Jan 12 08:17 $file3 > > > > > > > > These are linkto files (internal dht files) and should not be > visible on > > > the mount point. Are they consistently visible like this or do they > revert > > > to the proper permissions after some time? > > They didn't heal yet, even after more than 4 weeks. Therefore we decided > > to recommend our users to fix their files by setting the correct > > permissions again, which worked without problems. But for analysis > > reasons we still have some broken files nobody touched yet. > > > > We know these linkto files but they were never visible to clients. We > > did these ls-commands on a client, not on a brick. > > They have linkfile permissions but on brick side, it looks like this: > > root at gluster06:~# ls -lah /$brick/$file3 > ---------T 2 $user $group 1.7M Jan 12 08:17 /$brick/$file3 > > That seems to be too big for a linkfile. Also, there is no file it could > link to. There's no other file with that name at that path on any other > subvolume. > This sounds like the rebalance failed to transition the file from a linkto to a data file once the migration was complete. Please check the rebalance logs on all nodes for any messages that refer to this file. If you still see any such files, please check the its xattrs directly on the brick. You should see one called trusted.glusterfs.dht.linkto. Let me know if that is missing. Regards, Nithya > > > > > > > > > > > > > > > This is not what the permissions are supposed to look. They were 644 > or > > > > 660 before. And they definitely had no sticky bits. > > > > The permissions on the bricks match what I see on client side. So I > think > > > > the original permissions are lost without a chance to recover them, > right? > > > > > > > > > > > > With some files with weird looking permissions (but not with all of > them), > > > > I can do this: > > > > # ls -lah $path/$file4 > > > > -rw-r--r-- 1 $user $group 6.0G Oct 11 09:34 $path/$file4 > > > > ls -lah $path | grep $file4 > > > > -rw-r-Sr-T 1 $user$group 6.0G Oct 11 09:34 $file4 > > > > > > > > > > > So, the permissions I see depend on how I'm querying them. The > permissions > > > > on brick side agree with the ladder result, stat sees the former. > I'm not > > > > sure how that works. > > > > > > > The S and T bits indicate that a file is being migrated. The difference > > > seems to be because of the way lookup versus readdirp handle this - > this > > > looks like a bug. Lookup will strip out the internal permissions set. I > > > don't think readdirp does. This is happening because a rebalance is in > > > progress. > > There is no active rebalance. At least in "gluster volume rebalance > > $VOLUME status" is none visible. > > > > And in the rebalance log file of this volume is the last line: > > "[2019-01-11 02:14:50.101944] W ? received signum (15), shutting down" > > > > > > > > > > > > > We know for at least a part of those files that they were okay at > December > > > > 19th. We got the first reports of weird-looking permissions at > January > > > > 12th. Between that, there was a rebalance running (January 7th to > January > > > > 11th). During that rebalance, a node was offline for a longer period > of time > > > > due to hardware issues. The output of "gluster volume heal $VOLUME > info" > > > > shows no files though. > > > > > > > > For all files with broken permissions we found so far, the following > lines > > > > are in the rebalance log: > > > > > > > > [2019-01-07 09:31:11.004802] I [MSGID: 109045] > > > > [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not > having link > > > > subvol for $file5 > > > > [2019-01-07 09:31:11.262273] I [MSGID: 109069] > > > > [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] > 0-$VOLUME-dht: > > > > lookup_unlink returned with > > > > op_ret -> 0 and op-errno -> 0 for $file5 > > > > [2019-01-07 09:31:11.266014] I > [dht-rebalance.c:1570:dht_migrate_file] > > > > 0-$VOLUME-dht: $file5: attempting to move from > $VOLUME-readdir-ahead-0 to > > > > $VOLUME-readdir-ahead-5 > > > > [2019-01-07 09:31:11.278120] I > [dht-rebalance.c:1570:dht_migrate_file] > > > > 0-$VOLUME-dht: $file5: attempting to move from > $VOLUME-readdir-ahead-0 to > > > > $VOLUME-readdir-ahead-5 > > > > [2019-01-07 09:31:11.732175] W > [dht-rebalance.c:2159:dht_migrate_file] > > > > 0-$VOLUME-dht: $file5: failed to perform removexattr on > > > > $VOLUME-readdir-ahead-0 > > > > (No data available) > > > > [2019-01-07 09:31:11.737319] W [MSGID: 109023] > > > > [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file5: > failed to do > > > > a stat on $VOLUME-readdir- > > > > ahead-0 [No such file or directory] > > > > [2019-01-07 09:31:11.744382] I [MSGID: 109022] > > > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed > migration > > > > of $file5 from subvolume > > > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > [2019-01-07 09:31:11.744676] I [MSGID: 109022] > > > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed > migration > > > > of $file5 from subvolume > > > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > > > > > > > > > > > > > I've searched the brick logs for $file5 with broken permissions and > found > > > > this on all bricks from (I think) the subvolume > $VOLUME-readdir-ahead-5: > > > > > > > > [2019-01-07 09:32:13.821545] I [MSGID: 113030] > [posix.c:2171:posix_unlink] > > > > 0-$VOLUME-posix: open-fd-key-status: 0 for $file5 > > > > [2019-01-07 09:32:13.821609] I [MSGID: 113031] > > > > [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr > status: 0 > > > > for $file5 > > > > > > > > > > > > > > > > Also, we noticed that many directories got their modification time > > > > updated. It was set to the rebalance date. Is that supposed to > happen? > > > > > > > > > > > > We had parallel-readdir enabled during the rebalance. We disabled it > since > > > > we had empty directories that couldn't be deleted. I was able to > delete > > > > those dirs after that. > > > > > > > Was this disabled during the rebalance? parallel-readdirp changes the > > > volume graph for clients but not for the rebalance process causing it > to > > > fail to find the linkto subvols. > > Yes, parallel-readdirp was enabled during the rebalance. But we disabled > > it after some files where invisible on the client side again. > > The timetable looks like this: > > December 12th: parallel-readdir enabled > January 7th: rebalance started > January 11th/12th: rebalance finished (varied a bit, some servers were > faster) > January 15th: parallel-readdir disabled > > > > > > > > > > > > > > > > > > Also, we have directories who lost their GFID on some bricks. Again. > > > > > > Is this the missing symlink problem that was reported earlier? > > Looks like. I had a dir with missing GFID on one brick, I couldn't see > some files on client side, I recreated the GFID symlink and everything was > fine > again. > And in the brick log, I had this entry (with > 1d372a8a-4958-4700-8ef1-fa4f756baad3 being the GFID of the dir in question): > > [2019-01-13 17:57:55.020859] W [MSGID: 113103] [posix.c:301:posix_lookup] > 0-$VOLUME-posix: Found stale gfid handle > /srv/glusterfs/bricks/$brick/data/.glusterfs/1d/37/1d372a8a-4958-4700-8ef1-fa4f756baad3, > removing it. [No such file or directory] > > Very familiar. At least, I know how to fix that :D > > Kind regards > > Gudrun > > > > > > > Regards, > > > Nithya > > > > > > > > > > > > > > > > > > > > > > > What happened? Can we do something to fix this? And could that happen > > > > again? > > > > > > > > We want to upgrade to 4.1 soon. Is it safe to do that or could it > make > > > > things worse? > > > > > > > > Kind regards > > > > > > > > Gudrun Amedick_______________________________________________ > > > > Gluster-users mailing list > > > > Gluster-users at gluster.org > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > > > Gluster-users mailing list > > > Gluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From mjc at avtechpulse.com Thu Jan 31 13:54:06 2019 From: mjc at avtechpulse.com (Dr. Michael J. Chudobiak) Date: Thu, 31 Jan 2019 08:54:06 -0500 Subject: [Gluster-users] chrome / chromium crash on gluster In-Reply-To: References: <893b30a1-e2cb-fa49-c014-9a68bbd1b7dd@avtechpulse.com> Message-ID: <8c750643-fce2-0e55-4f40-bdfed69ec5f7@avtechpulse.com> On 1/31/19 2:21 AM, Amar Tumballi Suryanarayan wrote: > Interesting, I run F29 for all development, and didn't see anything like > this. > > Please share 'gluster volume info'. And also logs from mount process. [root at gluster1 ~]# gluster volume info Volume Name: volume1 Type: Replicate Volume ID: 91ef5aed-94be-44ff-a19d-c41682808159 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gluster1:/gluster/brick1/data Brick2: gluster2:/gluster/brick2/data Options Reconfigured: nfs.disable: on server.allow-insecure: on cluster.favorite-child-policy: mtime And a client mount log is below - although the log is full megabytes of: The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 20178 times between [2019-01-31 13:44:14.962950] and [2019-01-31 13:46:00.013310] and [2019-01-31 13:46:07.470163] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] so I've just shown the start of the log. I guess that's related to https://bugzilla.redhat.com/show_bug.cgi?id=1651246. - Mike Mount log: [2019-01-31 13:44:00.775353] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --process-name fuse --volfile-server=gluster1 --volfile-server=gluster2 --volfile-id=/volume1 /fileserver2) [2019-01-31 13:44:00.817140] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-01-31 13:44:00.926491] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2019-01-31 13:44:00.928102] I [MSGID: 114020] [client.c:2354:notify] 0-volume1-client-0: parent translators are ready, attempting connect on transport [2019-01-31 13:44:00.931063] I [MSGID: 114020] [client.c:2354:notify] 0-volume1-client-1: parent translators are ready, attempting connect on transport [2019-01-31 13:44:00.932144] I [rpc-clnt.c:2042:rpc_clnt_reconfig] 0-volume1-client-0: changing port to 49152 (from 0) Final graph: +------------------------------------------------------------------------------+ 1: volume volume1-client-0 2: type protocol/client 3: option ping-timeout 42 4: option remote-host gluster1 5: option remote-subvolume /gluster/brick1/data 6: option transport-type socket 7: option transport.tcp-user-timeout 0 8: option transport.socket.keepalive-time 20 9: option transport.socket.keepalive-interval 2 10: option transport.socket.keepalive-count 9 11: option send-gids true 12: end-volume 13: 14: volume volume1-client-1 15: type protocol/client 16: option ping-timeout 42 17: option remote-host gluster2 18: option remote-subvolume /gluster/brick2/data 19: option transport-type socket 20: option transport.tcp-user-timeout 0 21: option transport.socket.keepalive-time 20 22: option transport.socket.keepalive-interval 2 23: option transport.socket.keepalive-count 9 24: option send-gids true 25: end-volume 26: 27: volume volume1-replicate-0 28: type cluster/replicate 29: option afr-pending-xattr volume1-client-0,volume1-client-1 30: option favorite-child-policy mtime 31: option use-compound-fops off 32: subvolumes volume1-client-0 volume1-client-1 33: end-volume 34: 35: volume volume1-dht 36: type cluster/distribute [2019-01-31 13:44:00.932495] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler 37: option lock-migration off 38: option force-migration off 39: subvolumes volume1-replicate-0 40: end-volume 41: 42: volume volume1-write-behind 43: type performance/write-behind 44: subvolumes volume1-dht 45: end-volume 46: 47: volume volume1-read-ahead 48: type performance/read-ahead 49: subvolumes volume1-write-behind 50: end-volume 51: 52: volume volume1-readdir-ahead 53: type performance/readdir-ahead 54: option parallel-readdir off 55: option rda-request-size 131072 56: option rda-cache-limit 10MB 57: subvolumes volume1-read-ahead 58: end-volume 59: 60: volume volume1-io-cache 61: type performance/io-cache 62: subvolumes volume1-readdir-ahead 63: end-volume 64: 65: volume volume1-quick-read 66: type performance/quick-read 67: subvolumes volume1-io-cache 68: end-volume 69: 70: volume volume1-open-behind 71: type performance/open-behind 72: subvolumes volume1-quick-read 73: end-volume 74: 75: volume volume1-md-cache 76: type performance/md-cache 77: subvolumes volume1-open-behind 78: end-volume 79: 80: volume volume1 81: type debug/io-stats 82: option log-level INFO 83: option latency-measurement off 84: option count-fop-hits off 85: subvolumes volume1-md-cache 86: end-volume 87: 88: volume meta-autoload 89: type meta 90: subvolumes volume1 91: end-volume 92: +------------------------------------------------------------------------------+ [2019-01-31 13:44:00.933375] I [rpc-clnt.c:2042:rpc_clnt_reconfig] 0-volume1-client-1: changing port to 49152 (from 0) [2019-01-31 13:44:00.933549] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-01-31 13:44:00.934170] I [MSGID: 114046] [client-handshake.c:1107:client_setvolume_cbk] 0-volume1-client-0: Connected to volume1-client-0, attached to remote volume '/gluster/brick1/data'. [2019-01-31 13:44:00.934210] I [MSGID: 108005] [afr-common.c:5237:__afr_handle_child_up_event] 0-volume1-replicate-0: Subvolume 'volume1-client-0' came back up; going online. [2019-01-31 13:44:00.935291] I [MSGID: 114046] [client-handshake.c:1107:client_setvolume_cbk] 0-volume1-client-1: Connected to volume1-client-1, attached to remote volume '/gluster/brick2/data'. [2019-01-31 13:44:00.937661] I [fuse-bridge.c:4267:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.28 [2019-01-31 13:44:00.937691] I [fuse-bridge.c:4878:fuse_graph_sync] 0-fuse: switched to graph 0 [2019-01-31 13:44:14.852144] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:14.962950] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-01-31 13:44:15.038615] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.040956] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.041044] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.041467] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.471018] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.477003] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.482380] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.487047] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.603624] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.607726] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] [2019-01-31 13:44:15.607906] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7c45) [0x7fb0e0b49c45] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaba1) [0x7fb0e0b5cba1] -->/lib64/libglusterfs.so.0(dict_ref+0x60) [0x7fb0f2457c40] ) 0-dict: dict is NULL [Invalid argument] From amudhan83 at gmail.com Thu Jan 31 14:08:26 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Thu, 31 Jan 2019 19:38:26 +0530 Subject: [Gluster-users] glusterfs 4.1.6 improving folder listing Message-ID: Hi, What is the option to improve folder listing speed in glusterfs 4.1.6 with distributed-disperse volume? regards Amudhan -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.amedick at uni-luebeck.de Thu Jan 31 15:36:58 2019 From: g.amedick at uni-luebeck.de (Gudrun Mareike Amedick) Date: Thu, 31 Jan 2019 16:36:58 +0100 Subject: [Gluster-users] Files losing permissions in GlusterFS 3.12 In-Reply-To: References: <1548429058.2018.12.camel@uni-luebeck.de> <1548667421.10294.221.camel@uni-luebeck.de> <1548855752.2018.39.camel@uni-luebeck.de> Message-ID: <1548949018.2018.57.camel@uni-luebeck.de> Hi Nithya, That's what I'm getting from file3: getfattr -d -m. -e hex $file3? # file: $file3 trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x00000000000000000000000000000000 trusted.ec.size=0x00000000006c8aba trusted.ec.version=0x000000000000000f0000000000000019 trusted.gfid=0x47d6124290e844e2b733740134a657ce trusted.gfid2path.60d8a15c6ccaf15b=0x36363732366635372d396533652d343337372d616637382d6366353061636434306265322f616c676f732e63707974686f6e2d33356d2d783 8365f36342d6c696e75782d676e752e736f trusted.glusterfs.quota.66726f57-9e3e-4377-af78-cf50acd40be2.contri.3=0x00000000001b24000000000000000001 trusted.pgfid.66726f57-9e3e-4377-af78-cf50acd40be2=0x00000001So, no dht attribute. I think. That's what I found in the rebalance logs. rebalance.log.3 was another rebalance that, to our knowledge, finished without problems. I included the results from both rebalances, just in case. There is no mention of this file in the logs of the other servers. root at gluster06:/var/log/glusterfs# zgrep $file3 $VOLUME-rebalance.log* $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.243620] I [MSGID: 109045] [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not having link subvol for $file3 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.275213] I [MSGID: 109069] [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] 0-$VOLUME-dht: lookup_unlink returned with op_ret -> 0 and op-errno -> 0 for $file3 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.307754] I [dht-rebalance.c:1570:dht_migrate_file] 0-$VOLUME-dht: $file3: attempting to move from $VOLUME- readdir-ahead-6 to $VOLUME-readdir-ahead-8 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.341451] I [dht-rebalance.c:1570:dht_migrate_file] 0-$VOLUME-dht: $file3: attempting to move from $VOLUME- readdir-ahead-6 to $VOLUME-readdir-ahead-8 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.488473] I [MSGID: 109022] [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration of $file3 from subvolume $VOLUME-readdir-ahead-6 to $VOLUME-readdir-ahead-8 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.494803] W [MSGID: 109023] [dht-rebalance.c:2094:dht_migrate_file] 0-$VOLUME-dht: Migrate file failed:$file3: failed to get xattr from $VOLUME-readdir-ahead-6 [No such file or directory] $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.499016] W [dht-rebalance.c:2159:dht_migrate_file] 0-$VOLUME-dht: $file3: failed to perform removexattr on $VOLUME-readdir-ahead-8 (No data available) $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.499776] W [MSGID: 109023] [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file3: failed to do a stat on $VOLUME-readdir-ahead-6 [No such file or directory] $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.500900] I [MSGID: 109022] [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration of $file3 from subvolume $VOLUME-readdir-ahead-6 to $VOLUME-readdir-ahead-8 $VOLUME-rebalance.log.3.gz:[2018-12-10 23:18:43.145616] I [dht-rebalance.c:1570:dht_migrate_file] 0-$VOLUME-dht: $file3: attempting to move from $VOLUME-disperse-6 to $VOLUME-disperse-8 $VOLUME-rebalance.log.3.gz:[2018-12-10 23:18:43.150303] W [MSGID: 109023] [dht-rebalance.c:1013:__dht_check_free_space] 0-$VOLUME-dht: data movement of file {blocks:13896 name:($file3)} would result in dst node ($VOLUME-disperse-8:23116260576) having lower disk space than the source node ($VOLUME- disperse-6:23521698592).Skipping file. $VOLUME-rebalance.log.3.gz:[2018-12-10 23:18:43.153051] I [MSGID: 109126] [dht-rebalance.c:2812:gf_defrag_migrate_single_file] 0-$VOLUME-dht: File migration skipped for $file3. Kind regards, Gudrun Am Donnerstag, den 31.01.2019, 14:46 +0530 schrieb Nithya Balachandran: > > > On Wed, 30 Jan 2019 at 19:12, Gudrun Mareike Amedick wrote: > > Hi, > > > > a bit additional info inlineAm Montag, den 28.01.2019, 10:23 +0100 schrieb Frank Ruehlemann: > > > Am Montag, den 28.01.2019, 09:50 +0530 schrieb Nithya Balachandran: > > > >? > > > > On Fri, 25 Jan 2019 at 20:51, Gudrun Mareike Amedick < > > > > g.amedick at uni-luebeck.de> wrote: > > > >? > > > > >? > > > > > Hi all, > > > > >? > > > > > we have a problem with a distributed dispersed volume (GlusterFS 3.12). We > > > > > have files that lost their permissions or gained sticky bits. The files > > > > > themselves seem to be okay. > > > > >? > > > > > It looks like this: > > > > >? > > > > > # ls -lah $file1 > > > > > ---------- 1 www-data www-data 45M Jan 12 07:01 $file1 > > > > >? > > > > > # ls -lah $file2 > > > > > -rw-rwS--T 1 $user $group 11K Jan??9 11:48 $file2 > > > > >? > > > > > # ls -lah $file3 > > > > > ---------T 1 $user $group 6.8M Jan 12 08:17 $file3 > > > > >? > > > > > These are linkto files (internal dht files) and should not be visible on > > > > the mount point. Are they consistently visible like this or do they revert > > > > to the proper permissions after some time? > > > They didn't heal yet, even after more than 4 weeks. Therefore we decided > > > to recommend our users to fix their files by setting the correct > > > permissions again, which worked without problems. But for analysis > > > reasons we still have some broken files nobody touched yet. > > >? > > > We know these linkto files but they were never visible to clients. We > > > did these ls-commands on a client, not on a brick. > > > > They have linkfile permissions but on brick side, it looks like this: > > > > root at gluster06:~# ls -lah /$brick/$file3 > > ---------T 2 $user $group 1.7M Jan 12 08:17 /$brick/$file3 > > > > That seems to be too big for a linkfile. Also, there is no file it could link to. There's no other file with that name at that path on any other > > subvolume. > This sounds like the rebalance failed to transition the file from a linkto to a data file once the migration was complete. Please check the > rebalance logs on all nodes for any messages that refer to this file. > If you still see any such files, please check the its xattrs directly on the brick. You should see one called trusted.glusterfs.dht.linkto. Let me > know if that is missing. > > Regards, > Nithya > > > > > > >? > > > >? > > > > >? > > > > > This is not what the permissions are supposed to look. They were 644 or > > > > > 660 before. And they definitely had no sticky bits. > > > > > The permissions on the bricks match what I see on client side. So I think > > > > > the original permissions are lost without a chance to recover them, right? > > > > >? > > > > >? > > > > > With some files with weird looking permissions (but not with all of them), > > > > > I can do this: > > > > > # ls -lah $path/$file4 > > > > > -rw-r--r-- 1 $user $group 6.0G Oct 11 09:34 $path/$file4 > > > > > ls -lah $path | grep $file4 > > > > > -rw-r-Sr-T??1 $user$group 6.0G Oct 11 09:34 $file4 > > > >? > > > > >? > > > > > So, the permissions I see depend on how I'm querying them. The permissions > > > > > on brick side agree with the ladder result, stat sees the former. I'm not > > > > > sure how that works. > > > > >? > > > > The S and T bits indicate that a file is being migrated. The difference > > > > seems to be because of the way lookup versus readdirp handle this??- this > > > > looks like a bug. Lookup will strip out the internal permissions set. I > > > > don't think readdirp does. This is happening because a rebalance is in > > > > progress. > > > There is no active rebalance. At least in "gluster volume rebalance > > > $VOLUME status" is none visible. > > >? > > > And in the rebalance log file of this volume is the last line: > > > "[2019-01-11 02:14:50.101944] W ? received signum (15), shutting down" > > >? > > > >? > > > > >? > > > > > We know for at least a part of those files that they were okay at December > > > > > 19th. We got the first reports of weird-looking permissions at January > > > > > 12th. Between that, there was a rebalance running (January 7th to January > > > > > 11th). During that rebalance, a node was offline for a longer period of time > > > > > due to hardware issues. The output of "gluster volume heal $VOLUME info" > > > > > shows no files though. > > > > >? > > > > > For all files with broken permissions we found so far, the following lines > > > > > are in the rebalance log: > > > > >? > > > > > [2019-01-07 09:31:11.004802] I [MSGID: 109045] > > > > > [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not having link > > > > > subvol for $file5 > > > > > [2019-01-07 09:31:11.262273] I [MSGID: 109069] > > > > > [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] 0-$VOLUME-dht: > > > > > lookup_unlink returned with > > > > > op_ret -> 0 and op-errno -> 0 for $file5 > > > > > [2019-01-07 09:31:11.266014] I [dht-rebalance.c:1570:dht_migrate_file] > > > > > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > > > > > $VOLUME-readdir-ahead-5 > > > > > [2019-01-07 09:31:11.278120] I [dht-rebalance.c:1570:dht_migrate_file] > > > > > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > > > > > $VOLUME-readdir-ahead-5 > > > > > [2019-01-07 09:31:11.732175] W [dht-rebalance.c:2159:dht_migrate_file] > > > > > 0-$VOLUME-dht: $file5: failed to perform removexattr on > > > > > $VOLUME-readdir-ahead-0 > > > > > (No data available) > > > > > [2019-01-07 09:31:11.737319] W [MSGID: 109023] > > > > > [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file5: failed to do > > > > > a stat on $VOLUME-readdir- > > > > > ahead-0 [No such file or directory] > > > > > [2019-01-07 09:31:11.744382] I [MSGID: 109022] > > > > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > > > > > of $file5 from subvolume > > > > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > > [2019-01-07 09:31:11.744676] I [MSGID: 109022] > > > > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > > > > > of $file5 from subvolume > > > > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > >? > > > > >? > > > > >? > > > > > I've searched the brick logs for $file5 with broken permissions and found > > > > > this on all bricks from (I think) the subvolume $VOLUME-readdir-ahead-5: > > > > >? > > > > > [2019-01-07 09:32:13.821545] I [MSGID: 113030] [posix.c:2171:posix_unlink] > > > > > 0-$VOLUME-posix: open-fd-key-status: 0 for $file5 > > > > > [2019-01-07 09:32:13.821609] I [MSGID: 113031] > > > > > [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr status: 0 > > > > > for $file5 > > > > >? > > > > >? > > > > >? > > > > > Also, we noticed that many directories got their modification time > > > > > updated. It was set to the rebalance date. Is that supposed to happen? > > > > >? > > > > >? > > > > > We had parallel-readdir enabled during the rebalance. We disabled it since > > > > > we had empty directories that couldn't be deleted. I was able to delete > > > > > those dirs after that. > > > > >? > > > > Was this disabled during the rebalance? parallel-readdirp changes the > > > > volume graph for clients but not for the rebalance process causing it to > > > > fail to find the linkto subvols. > > > Yes, parallel-readdirp was enabled during the rebalance. But we disabled > > > it after some files where invisible on the client side again. > > > > The timetable looks like this: > > > > December 12th: parallel-readdir enabled > > January 7th: rebalance started > > January 11th/12th: rebalance finished (varied a bit, some servers were faster) > > January 15th: parallel-readdir disabled > > > > >? > > > >? > > > > >? > > > > >? > > > > > Also, we have directories who lost their GFID on some bricks. Again. > > > >? > > > > Is this the missing symlink problem that was reported earlier? > > > > Looks like. I had a dir with missing GFID on one brick, I couldn't see some files on client side, I recreated the GFID symlink and everything was > > fine > > again. > > And in the brick log, I had this entry (with 1d372a8a-4958-4700-8ef1-fa4f756baad3 being the GFID of the dir in question): > > > > [2019-01-13 17:57:55.020859] W [MSGID: 113103] [posix.c:301:posix_lookup] 0-$VOLUME-posix: Found stale gfid handle > > /srv/glusterfs/bricks/$brick/data/.glusterfs/1d/37/1d372a8a-4958-4700-8ef1-fa4f756baad3, removing it. [No such file or directory] > > > > Very familiar. At least, I know how to fix that :D > > > > Kind regards > > > > Gudrun > > > > > >? > > > > Regards, > > > > Nithya > > > >? > > > > >? > > > > >? > > > > >? > > > > >? > > > > > What happened? Can we do something to fix this? And could that happen > > > > > again? > > > > >? > > > > > We want to upgrade to 4.1 soon. Is it safe to do that or could it make > > > > > things worse? > > > > >? > > > > > Kind regards > > > > >? > > > > > Gudrun Amedick_______________________________________________ > > > > > Gluster-users mailing list > > > > > Gluster-users at gluster.org > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > > > > Gluster-users mailing list > > > > Gluster-users at gluster.org > > > > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 6743 bytes Desc: not available URL: From amye at redhat.com Thu Jan 31 17:33:13 2019 From: amye at redhat.com (Amye Scavarda) Date: Thu, 31 Jan 2019 17:33:13 +0000 Subject: [Gluster-users] Gluster Monthly Newsletter, January 2019 Message-ID: Gluster Monthly Newsletter, January 2019 Gluster Community Survey - open from February 1st through February 28! Give us your feedback, we?ll send you a never before seen Gluster branded item! https://www.gluster.org/gluster-community-survey-february-2019/ See you at FOSDEM! We have a jampacked Software Defined Storage day on Sunday, Feb 3rd (with a few sessions on the previous day): https://fosdem.org/2019/schedule/track/software_defined_storage/ We also have a shared stand with Ceph, come find us! Gluster 6 - We?re in planning for our Gluster 6 release, currently scheduled for Feb-March 2019. More details on the mailing lists at https://lists.gluster.org/pipermail/gluster-devel/2018-November/055672.html Contributors Top Contributing Companies: Red Hat, Comcast, DataLab, Gentoo Linux, Facebook, BioDec, Samsung, Etersoft Top Contributors in January: Amar Tumballi, Kinglong Mee, Sunny Kumar, Susant Palai, Ravishankar N Noteworthy Threads: [Gluster-users] GCS 0.5 release - https://lists.gluster.org/pipermail/gluster-users/2019-January/035597.html [Gluster-users] Announcing Gluster release 5.3 and 4.1.7 - https://lists.gluster.org/pipermail/gluster-users/2019-January/035656.html [Gluster-users] Improvements to Gluster upstream documentation - https://lists.gluster.org/pipermail/gluster-users/2019-January/035741.html [Gluster-devel] Tests for the GCS stack using the k8s framework https://lists.gluster.org/pipermail/gluster-devel/2019-January/055765.html [Gluster-devel] Gluster Maintainer's meeting: 7th Jan, 2019 - Agenda https://lists.gluster.org/pipermail/gluster-devel/2019-January/055767.html [Gluster-devel] Implementing multiplexing for self heal client. https://lists.gluster.org/pipermail/gluster-devel/2019-January/055768.html [Gluster-devel] Regression health for release-5.next and release-6 https://lists.gluster.org/pipermail/gluster-devel/2019-January/055775.html [Gluster-devel] FUSE directory filehandle https://lists.gluster.org/pipermail/gluster-devel/2019-January/055776.html [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench https://lists.gluster.org/pipermail/gluster-devel/2019-January/055782.html [Gluster-devel] Release 6: Kick off! https://lists.gluster.org/pipermail/gluster-devel/2019-January/055793.html [Gluster-devel] Maintainer's meeting: Jan 21st, 2019 https://lists.gluster.org/pipermail/gluster-devel/2019-January/055798.html [Gluster-devel] Infer results - Glusterfs https://lists.gluster.org/pipermail/gluster-devel/2019-January/055814.html Events: FOSDEM, Feb 2-3 2019 in Brussels, Belgium - https://fosdem.org/2019/ Vault: February 25?26, 2019 - https://www.usenix.org/conference/vault19/ -- Amye Scavarda | amye at redhat.com | Gluster Community Lead -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Thu Jan 31 18:09:45 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Thu, 31 Jan 2019 10:09:45 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Within 24 hours after updating from rock solid 4.1 to 5.3, I already got a crash which others have mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to unmount, kill gluster, and remount: [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-3" repeated 5 times between [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 72 times between [2019-01-31 09:37:53.746741] and [2019-01-31 09:38:04.696993] pending frames: frame : type(1) op(READ) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2019-01-31 09:38:04 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.3 /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] /lib64/libc.so.6(+0x36160)[0x7fccd622d160] /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] --------- Do the pending patches fix the crash or only the repeated warnings? I'm running glusterfs on OpenSUSE 15.0 installed via http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, not too sure how to make it core dump. If it's not fixed by the patches above, has anyone already opened a ticket for the crashes that I can join and monitor? This is going to create a massive problem for us since production systems are crashing. Thanks. Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa wrote: > > > On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii > wrote: > >> Also, not sure if related or not, but I got a ton of these "Failed to >> dispatch handler" in my logs as well. Many people have been commenting >> about this issue here https://bugzilla.redhat.com/show_bug.cgi?id=1651246 >> . >> > > https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. > > >> ==> mnt-SITE_data1.log <== >>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fd966fcd329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>> ==> mnt-SITE_data3.log <== >>> The message "E [MSGID: 101191] >>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>> [2019-01-30 20:38:20.015593] >>> The message "I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>> ==> mnt-SITE_data1.log <== >>> The message "I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>> The message "E [MSGID: 101191] >>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>> [2019-01-30 20:38:20.546355] >>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>> selecting local read_child SITE_data1-client-0 >>> ==> mnt-SITE_data3.log <== >>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>> selecting local read_child SITE_data3-client-0 >>> ==> mnt-SITE_data1.log <== >>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>> handler >> >> >> I'm hoping raising the issue here on the mailing list may bring some >> additional eyeballs and get them both fixed. >> >> Thanks. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii >> wrote: >> >>> I found a similar issue here: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a comment >>> from 3 days ago from someone else with 5.3 who started seeing the spam. >>> >>> Here's the command that repeats over and over: >>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fd966fcd329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>> >> > +Milind Changire Can you check why this message is > logged and send a fix? > > >>> Is there any fix for this issue? >>> >>> Thanks. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: