<div dir="ltr"><div>Hi Sunny and Strahil,</div><div><br></div><div>Thanks again for your responses. We don't have a lot of renaming activity - maybe some, but not a lot. We do have files which are open for writing for quite a while - they're call recordings being written as the call happens.<br></div><div><br></div><div>We've installed GlusterFS using the Ubuntu packages and I'd really like to avoid compiling and applying patches.</div><div><br></div><div>After enabling DEBUG the log at /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log doesn't seem to show anything very unusual:<br></div><div><br></div><div>[2020-06-03 02:49:01.992177] I [master(worker /nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0.0000 UNL=0<br>[2020-06-03 02:49:01.992465] I [master(worker /nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata Time Taken SETA=0 SETX=0 meta_duration=0.0000data_duration=13.0954 DATA=8 XATT=0<br>[2020-06-03 02:49:01.992863] I [master(worker /nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed changelog_end=1591152508 entry_stime=(1591152352, 0) changelog_start=1591152494 stime=(1591152507, 0) duration=13.1077 num_changelogs=2 mode=live_changelog<br>[2020-06-03 02:49:02.958687] D [repce(worker /nodirectwritedata/gluster/gvol0):196:push] RepceClient: call 19017:139678812452608:1591152542.96 keep_alive({'version': (1, 0), 'retval': 0, 'uuid': '8ee85fae-f3aa-4285-ad48-67a1dc17ed73', 'timeout': 1591152662, 'volume_mark': (1583043396, 161632)},) ...<br>[2020-06-03 02:49:02.979139] D [repce(worker /nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call 19017:139678812452608:1591152542.96 keep_alive -> 28<br>[2020-06-03 02:49:06.998127] D [master(worker /nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... crawl #114 done, took 30.180089 seconds<br>[2020-06-03 02:49:07.10132] D [repce(worker /nodirectwritedata/gluster/gvol0):196:push] RepceClient: call 19017:139679441716992:1591152547.01 scan() ...<br>[2020-06-03 02:49:07.10781] D [repce(worker /nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call 19017:139679441716992:1591152547.01 scan -> None<br>[2020-06-03 02:49:07.10935] D [repce(worker /nodirectwritedata/gluster/gvol0):196:push] RepceClient: call 19017:139679441716992:1591152547.01 getchanges() ...<br>[2020-06-03 02:49:07.11579] D [repce(worker /nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call 19017:139679441716992:1591152547.01 getchanges -> ['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591152522']<br>[2020-06-03 02:49:07.11720] I [master(worker /nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time stime=(1591152507, 0)<br></div><div><br></div><div>Am I looking in the right place to find out what that gsyncd.py process is doing?</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 2 Jun 2020 at 21:58, Sunny Kumar <<a href="mailto:sunkumar@redhat.com" target="_blank">sunkumar@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi David,<br>
<br>
You haven't answered my previous question regarding the type of your workload.<br>
---<br>
You can use the below command to enable debug log.<br>
<br>
`gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config log-level DEBUG`<br>
<br>
and after capturing log again switch back to info mode:<br>
<br>
`gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config log-level INFO`<br>
<br>
Please share the debug log and geo-rep config to debug further:<br>
for config:<br>
<br>
`gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config`<br>
<br>
/sunny<br>
<br>
<br>
On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov <<a href="mailto:hunter86_bg@yahoo.com" target="_blank">hunter86_bg@yahoo.com</a>> wrote:<br>
><br>
> Hi David,<br>
><br>
> in which log do you see the entries ?<br>
><br>
> I think I got an explanation why you see the process only on one of the master nodes - geo-rep session is established from only 1 master node /I hope someone corrects me if I'm wrong/ to one slave node. Thus it will be natural to see the high CPU usage on only 1 master node in your situation.<br>
><br>
> Do you see anything else in the : var/log/glusterfs/geo-replication/<logs> (master nodes) or in /var/log/glusterfs/geo-replication-slaves (slaves) that could hint of the exact issue. I have a vague feeling that that python script is constantly looping over some data causing the CPU hog.<br>
><br>
> Sadly, I can't find an instruction for increasing the log level of the geo rep log .<br>
><br>
><br>
> Best Regards,<br>
> Strahil Nikolov<br>
><br>
><br>
> На 2 юни 2020 г. 6:14:46 GMT+03:00, David Cunningham <<a href="mailto:dcunningham@voisonics.com" target="_blank">dcunningham@voisonics.com</a>> написа:<br>
> >Hi Strahil and Sunny,<br>
> ><br>
> >Thank you for the replies. I checked the gfid on the master and slaves<br>
> >and<br>
> >they are the same. After moving the file away and back again it doesn't<br>
> >seem to be having the issue with that file any more.<br>
> ><br>
> >We are still getting higher CPU usage on one of the master nodes than<br>
> >the<br>
> >others. It logs this every few seconds:<br>
> ><br>
> >[2020-06-02 03:10:15.637815] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time<br>
> >Taken<br>
> > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0.0000<br>
> >UNL=0<br>
> >[2020-06-02 03:10:15.638010] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata<br>
> >Time Taken SETA=0 SETX=0<br>
> >meta_duration=0.0000data_duration=12.7878<br>
> > DATA=4 XATT=0<br>
> >[2020-06-02 03:10:15.638286] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch<br>
> >Completed<br>
> >changelog_end=1591067378 entry_stime=(1591067167, 0)<br>
> >changelog_start=1591067364 stime=(1591067377, 0)<br>
> >duration=12.8068<br>
> > num_changelogs=2 mode=live_changelog<br>
> >[2020-06-02 03:10:20.658601] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time<br>
> > stime=(1591067377, 0)<br>
> >[2020-06-02 03:10:34.21799] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken<br>
> > duration=0.3826 num_files=8 job=1 return_code=0<br>
> >[2020-06-02 03:10:46.440535] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time<br>
> >Taken<br>
> > MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 duration=0.1314<br>
> >UNL=1<br>
> >[2020-06-02 03:10:46.440809] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata<br>
> >Time Taken SETA=0 SETX=0<br>
> >meta_duration=0.0000data_duration=13.0171<br>
> > DATA=14 XATT=0<br>
> >[2020-06-02 03:10:46.441205] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch<br>
> >Completed<br>
> >changelog_end=1591067420 entry_stime=(1591067419, 0)<br>
> >changelog_start=1591067392 stime=(1591067419, 0)<br>
> >duration=13.0322<br>
> > num_changelogs=3 mode=live_changelog<br>
> >[2020-06-02 03:10:51.460925] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time<br>
> > stime=(1591067419, 0)<br>
> ><br>
> >[2020-06-02 03:11:04.448913] I [master(worker<br>
> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken<br>
> >duration=0.3466 num_files=3 job=1 return_code=0<br>
> ><br>
> >Whereas the other master nodes only log this:<br>
> ><br>
> >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] <top>:<br>
> >Using<br>
> >session config file<br>
> >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf<br>
> >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: Using<br>
> >session config file<br>
> >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf<br>
> ><br>
> >Can anyone help with what might cause the high CPU usage on one master<br>
> >node? The process is this one, and is using 70-100% of CPU:<br>
> ><br>
> >python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py<br>
> >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path<br>
> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id<br>
> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id<br>
> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1<br>
> >--resource-remote nvfs30 --resource-remote-id<br>
> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78<br>
> ><br>
> >Thank you in advance!<br>
> ><br>
> ><br>
> ><br>
> ><br>
> >On Sat, 30 May 2020 at 20:20, Strahil Nikolov <<a href="mailto:hunter86_bg@yahoo.com" target="_blank">hunter86_bg@yahoo.com</a>><br>
> >wrote:<br>
> ><br>
> >> Hey David,<br>
> >><br>
> >> for me a gfid mismatch means that the file was replaced/recreated<br>
> > -<br>
> >> just like vim in linux does (and it is expected for config file).<br>
> >><br>
> >> Have you checked the gfid of the file on both source and<br>
> >destination,<br>
> >> do they really match or they are different ?<br>
> >><br>
> >> What happens when you move away the file from the slave , does it<br>
> >fixes<br>
> >> the issue ?<br>
> >><br>
> >> Best Regards,<br>
> >> Strahil Nikolov<br>
> >><br>
> >> На 30 май 2020 г. 1:10:56 GMT+03:00, David Cunningham <<br>
> >> <a href="mailto:dcunningham@voisonics.com" target="_blank">dcunningham@voisonics.com</a>> написа:<br>
> >> >Hello,<br>
> >> ><br>
> >> >We're having an issue with a geo-replication process with unusually<br>
> >> >high<br>
> >> >CPU use and giving "Entry not present on master. Fixing gfid<br>
> >mismatch<br>
> >> >in<br>
> >> >slave" errors. Can anyone help on this?<br>
> >> ><br>
> >> >We have 3 GlusterFS replica nodes (we'll call the master), which<br>
> >also<br>
> >> >push<br>
> >> >data to a remote server (slave) using geo-replication. This has been<br>
> >> >running fine for a couple of months, but yesterday one of the master<br>
> >> >nodes<br>
> >> >started having unusually high CPU use. It's this process:<br>
> >> ><br>
> >> >root@cafs30:/var/log/glusterfs# ps aux | grep 32048<br>
> >> >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 493:51<br>
> >> >python2<br>
> >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py<br>
> >worker<br>
> >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path<br>
> >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id<br>
> >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id<br>
> >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13<br>
> >--subvol-num 1<br>
> >> >--resource-remote nvfs30 --resource-remote-id<br>
> >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78<br>
> >> ><br>
> >> >Here's what is being logged in<br>
> >> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log:<br>
> >> ><br>
> >> >[2020-05-29 21:57:18.843524] I [master(worker<br>
> >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time<br>
> >> > stime=(1590789408, 0)<br>
> >> >[2020-05-29 21:57:30.626172] I [master(worker<br>
> >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]<br>
> >> >_GMaster: Entry not present on master. Fixing gfid mismatch in<br>
> >slave.<br>
> >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid':<br>
> >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode':<br>
> >33204,<br>
> >> >u'entry':<br>
> >> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg',<br>
> >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':<br>
> >True,<br>
> >> >u'slave_name': None, u'slave_gfid':<br>
> >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': False,<br>
> >> >u'dst':<br>
> >> >False})<br>
> >> >[2020-05-29 21:57:30.627893] I [master(worker<br>
> >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]<br>
> >> >_GMaster: Entry not present on master. Fixing gfid mismatch in<br>
> >slave.<br>
> >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid':<br>
> >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode':<br>
> >33204,<br>
> >> >u'entry':<br>
> >><br>
> >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg',<br>
> >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':<br>
> >True,<br>
> >> >u'slave_name': None, u'slave_gfid':<br>
> >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': False,<br>
> >> >u'dst':<br>
> >> >False})<br>
> >> >[2020-05-29 21:57:30.629532] I [master(worker<br>
> >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]<br>
> >> >_GMaster: Entry not present on master. Fixing gfid mismatch in<br>
> >slave.<br>
> >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid':<br>
> >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode':<br>
> >33204,<br>
> >> >u'entry':<br>
> >><br>
> >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml',<br>
> >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':<br>
> >True,<br>
> >> >u'slave_name': None, u'slave_gfid':<br>
> >> >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': False,<br>
> >> >u'dst':<br>
> >> >False})<br>
> >> >[2020-05-29 21:57:30.659123] I [master(worker<br>
> >> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures]<br>
> >_GMaster:<br>
> >> >Sucessfully fixed entry ops with gfid mismatch retry_count=1<br>
> >> >[2020-05-29 21:57:30.659343] I [master(worker<br>
> >> >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster:<br>
> >Retry<br>
> >> >original entries. count = 1<br>
> >> >[2020-05-29 21:57:30.725810] I [master(worker<br>
> >> >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster:<br>
> >> >Sucessfully fixed all entry ops with gfid mismatch<br>
> >> >[2020-05-29 21:57:31.747319] I [master(worker<br>
> >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time<br>
> >Taken<br>
> >> >duration=0.7409 num_files=18 job=1 return_code=0<br>
> >> ><br>
> >> >We've verified that the files like polycom_00a859f7xxxx.cfg referred<br>
> >to<br>
> >> >in<br>
> >> >the error do exist on the master nodes and slave.<br>
> >> ><br>
> >> >We found this bug fix:<br>
> >> ><a href="https://bugzilla.redhat.com/show_bug.cgi?id=1642865" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1642865</a><br>
> >> ><br>
> >> >However that fix went in 5.1, and we're running 5.12 on the master<br>
> >> >nodes<br>
> >> >and slave. A couple of GlusterFS clients connected to the master<br>
> >nodes<br>
> >> >are<br>
> >> >running 5.13.<br>
> >> ><br>
> >> >Would anyone have any suggestions? Thank you in advance.<br>
> >><br>
> ________<br>
><br>
><br>
><br>
> Community Meeting Calendar:<br>
><br>
> Schedule -<br>
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
> Bridge: <a href="https://bluejeans.com/441850968" rel="noreferrer" target="_blank">https://bluejeans.com/441850968</a><br>
><br>
> Gluster-users mailing list<br>
> <a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
> <a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>David Cunningham, Voisonics Limited<br><a href="http://voisonics.com/" target="_blank">http://voisonics.com/</a><br>USA: +1 213 221 1092<br>New Zealand: +64 (0)28 2558 3782</div></div></div></div></div></div></div></div></div></div></div>