[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
David Cunningham
dcunningham at voisonics.com
Fri Jun 5 22:21:55 UTC 2020
Hi Sunny and Strahil,
Thanks again for your responses. We don't have a lot of renaming activity -
maybe some, but not a lot. We do have files which are open for writing for
quite a while - they're call recordings being written as the call happens.
We've installed GlusterFS using the Ubuntu packages and I'd really like to
avoid compiling and applying patches.
After enabling DEBUG the log at
/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log doesn't
seem to show anything very unusual:
[2020-06-03 02:49:01.992177] I [master(worker
/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken
MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0.0000
UNL=0
[2020-06-03 02:49:01.992465] I [master(worker
/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
Time Taken SETA=0 SETX=0 meta_duration=0.0000data_duration=13.0954
DATA=8 XATT=0
[2020-06-03 02:49:01.992863] I [master(worker
/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed
changelog_end=1591152508 entry_stime=(1591152352, 0)
changelog_start=1591152494 stime=(1591152507, 0) duration=13.1077
num_changelogs=2 mode=live_changelog
[2020-06-03 02:49:02.958687] D [repce(worker
/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call
19017:139678812452608:1591152542.96 keep_alive({'version': (1, 0),
'retval': 0, 'uuid': '8ee85fae-f3aa-4285-ad48-67a1dc17ed73', 'timeout':
1591152662, 'volume_mark': (1583043396, 161632)},) ...
[2020-06-03 02:49:02.979139] D [repce(worker
/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call
19017:139678812452608:1591152542.96 keep_alive -> 28
[2020-06-03 02:49:06.998127] D [master(worker
/nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... crawl #114
done, took 30.180089 seconds
[2020-06-03 02:49:07.10132] D [repce(worker
/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call
19017:139679441716992:1591152547.01 scan() ...
[2020-06-03 02:49:07.10781] D [repce(worker
/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call
19017:139679441716992:1591152547.01 scan -> None
[2020-06-03 02:49:07.10935] D [repce(worker
/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call
19017:139679441716992:1591152547.01 getchanges() ...
[2020-06-03 02:49:07.11579] D [repce(worker
/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call
19017:139679441716992:1591152547.01 getchanges ->
['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591152522']
[2020-06-03 02:49:07.11720] I [master(worker
/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
stime=(1591152507, 0)
Am I looking in the right place to find out what that gsyncd.py process is
doing?
On Tue, 2 Jun 2020 at 21:58, Sunny Kumar <sunkumar at redhat.com> wrote:
> Hi David,
>
> You haven't answered my previous question regarding the type of your
> workload.
> ---
> You can use the below command to enable debug log.
>
> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config log-level
> DEBUG`
>
> and after capturing log again switch back to info mode:
>
> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config log-level
> INFO`
>
> Please share the debug log and geo-rep config to debug further:
> for config:
>
> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config`
>
> /sunny
>
>
> On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov <hunter86_bg at yahoo.com>
> wrote:
> >
> > Hi David,
> >
> > in which log do you see the entries ?
> >
> > I think I got an explanation why you see the process only on one of the
> master nodes - geo-rep session is established from only 1 master node /I
> hope someone corrects me if I'm wrong/ to one slave node. Thus it will be
> natural to see the high CPU usage on only 1 master node in your situation.
> >
> > Do you see anything else in the :
> var/log/glusterfs/geo-replication/<logs> (master nodes) or in
> /var/log/glusterfs/geo-replication-slaves (slaves) that could hint of the
> exact issue. I have a vague feeling that that python script is constantly
> looping over some data causing the CPU hog.
> >
> > Sadly, I can't find an instruction for increasing the log level of the
> geo rep log .
> >
> >
> > Best Regards,
> > Strahil Nikolov
> >
> >
> > На 2 юни 2020 г. 6:14:46 GMT+03:00, David Cunningham <
> dcunningham at voisonics.com> написа:
> > >Hi Strahil and Sunny,
> > >
> > >Thank you for the replies. I checked the gfid on the master and slaves
> > >and
> > >they are the same. After moving the file away and back again it doesn't
> > >seem to be having the issue with that file any more.
> > >
> > >We are still getting higher CPU usage on one of the master nodes than
> > >the
> > >others. It logs this every few seconds:
> > >
> > >[2020-06-02 03:10:15.637815] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time
> > >Taken
> > > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0.0000
> > >UNL=0
> > >[2020-06-02 03:10:15.638010] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
> > >Time Taken SETA=0 SETX=0
> > >meta_duration=0.0000data_duration=12.7878
> > > DATA=4 XATT=0
> > >[2020-06-02 03:10:15.638286] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
> > >Completed
> > >changelog_end=1591067378 entry_stime=(1591067167, 0)
> > >changelog_start=1591067364 stime=(1591067377, 0)
> > >duration=12.8068
> > > num_changelogs=2 mode=live_changelog
> > >[2020-06-02 03:10:20.658601] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
> > > stime=(1591067377, 0)
> > >[2020-06-02 03:10:34.21799] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
> > > duration=0.3826 num_files=8 job=1 return_code=0
> > >[2020-06-02 03:10:46.440535] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time
> > >Taken
> > > MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 duration=0.1314
> > >UNL=1
> > >[2020-06-02 03:10:46.440809] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
> > >Time Taken SETA=0 SETX=0
> > >meta_duration=0.0000data_duration=13.0171
> > > DATA=14 XATT=0
> > >[2020-06-02 03:10:46.441205] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
> > >Completed
> > >changelog_end=1591067420 entry_stime=(1591067419, 0)
> > >changelog_start=1591067392 stime=(1591067419, 0)
> > >duration=13.0322
> > > num_changelogs=3 mode=live_changelog
> > >[2020-06-02 03:10:51.460925] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
> > > stime=(1591067419, 0)
> > >
> > >[2020-06-02 03:11:04.448913] I [master(worker
> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
> > >duration=0.3466 num_files=3 job=1 return_code=0
> > >
> > >Whereas the other master nodes only log this:
> > >
> > >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] <top>:
> > >Using
> > >session config file
> > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
> > >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: Using
> > >session config file
> > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
> > >
> > >Can anyone help with what might cause the high CPU usage on one master
> > >node? The process is this one, and is using 70-100% of CPU:
> > >
> > >python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
> > >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
> > >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id
> > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
> > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1
> > >--resource-remote nvfs30 --resource-remote-id
> > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78
> > >
> > >Thank you in advance!
> > >
> > >
> > >
> > >
> > >On Sat, 30 May 2020 at 20:20, Strahil Nikolov <hunter86_bg at yahoo.com>
> > >wrote:
> > >
> > >> Hey David,
> > >>
> > >> for me a gfid mismatch means that the file was replaced/recreated
> > > -
> > >> just like vim in linux does (and it is expected for config file).
> > >>
> > >> Have you checked the gfid of the file on both source and
> > >destination,
> > >> do they really match or they are different ?
> > >>
> > >> What happens when you move away the file from the slave , does it
> > >fixes
> > >> the issue ?
> > >>
> > >> Best Regards,
> > >> Strahil Nikolov
> > >>
> > >> На 30 май 2020 г. 1:10:56 GMT+03:00, David Cunningham <
> > >> dcunningham at voisonics.com> написа:
> > >> >Hello,
> > >> >
> > >> >We're having an issue with a geo-replication process with unusually
> > >> >high
> > >> >CPU use and giving "Entry not present on master. Fixing gfid
> > >mismatch
> > >> >in
> > >> >slave" errors. Can anyone help on this?
> > >> >
> > >> >We have 3 GlusterFS replica nodes (we'll call the master), which
> > >also
> > >> >push
> > >> >data to a remote server (slave) using geo-replication. This has been
> > >> >running fine for a couple of months, but yesterday one of the master
> > >> >nodes
> > >> >started having unusually high CPU use. It's this process:
> > >> >
> > >> >root at cafs30:/var/log/glusterfs# ps aux | grep 32048
> > >> >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 493:51
> > >> >python2
> > >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
> > >worker
> > >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
> > >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id
> > >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
> > >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13
> > >--subvol-num 1
> > >> >--resource-remote nvfs30 --resource-remote-id
> > >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78
> > >> >
> > >> >Here's what is being logged in
> > >> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log:
> > >> >
> > >> >[2020-05-29 21:57:18.843524] I [master(worker
> > >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
> > >> > stime=(1590789408, 0)
> > >> >[2020-05-29 21:57:30.626172] I [master(worker
> > >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
> > >slave.
> > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid':
> > >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode':
> > >33204,
> > >> >u'entry':
> > >> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg',
> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':
> > >True,
> > >> >u'slave_name': None, u'slave_gfid':
> > >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': False,
> > >> >u'dst':
> > >> >False})
> > >> >[2020-05-29 21:57:30.627893] I [master(worker
> > >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
> > >slave.
> > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid':
> > >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode':
> > >33204,
> > >> >u'entry':
> > >>
> > >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg',
> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':
> > >True,
> > >> >u'slave_name': None, u'slave_gfid':
> > >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': False,
> > >> >u'dst':
> > >> >False})
> > >> >[2020-05-29 21:57:30.629532] I [master(worker
> > >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
> > >slave.
> > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid':
> > >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode':
> > >33204,
> > >> >u'entry':
> > >>
> >
> >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml',
> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':
> > >True,
> > >> >u'slave_name': None, u'slave_gfid':
> > >> >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': False,
> > >> >u'dst':
> > >> >False})
> > >> >[2020-05-29 21:57:30.659123] I [master(worker
> > >> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures]
> > >_GMaster:
> > >> >Sucessfully fixed entry ops with gfid mismatch retry_count=1
> > >> >[2020-05-29 21:57:30.659343] I [master(worker
> > >> >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster:
> > >Retry
> > >> >original entries. count = 1
> > >> >[2020-05-29 21:57:30.725810] I [master(worker
> > >> >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster:
> > >> >Sucessfully fixed all entry ops with gfid mismatch
> > >> >[2020-05-29 21:57:31.747319] I [master(worker
> > >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time
> > >Taken
> > >> >duration=0.7409 num_files=18 job=1 return_code=0
> > >> >
> > >> >We've verified that the files like polycom_00a859f7xxxx.cfg referred
> > >to
> > >> >in
> > >> >the error do exist on the master nodes and slave.
> > >> >
> > >> >We found this bug fix:
> > >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865
> > >> >
> > >> >However that fix went in 5.1, and we're running 5.12 on the master
> > >> >nodes
> > >> >and slave. A couple of GlusterFS clients connected to the master
> > >nodes
> > >> >are
> > >> >running 5.13.
> > >> >
> > >> >Would anyone have any suggestions? Thank you in advance.
> > >>
> > ________
> >
> >
> >
> > Community Meeting Calendar:
> >
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://bluejeans.com/441850968
> >
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
--
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200606/3a2a20d1/attachment.html>
More information about the Gluster-users
mailing list