[Bugs] [Bug 1512483] New: Not all files synced using geo-replication

bugzilla at redhat.com bugzilla at redhat.com
Mon Nov 13 10:42:08 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1512483

            Bug ID: 1512483
           Summary: Not all files synced using geo-replication
           Product: GlusterFS
           Version: mainline
         Component: geo-replication
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: khiremat at redhat.com
                CC: bugs at gluster.org, dimitri.ars at gmail.com,
                    khiremat at redhat.com, moagrawa at redhat.com
        Depends On: 1510342
            Blocks: 1510994



+++ This bug was initially created as a clone of Bug #1510342 +++

Description of problem:
When using Sonatype Nexus3 as a docker repository on glusterfs and
geo-replicating this volume, at least the .bytes files which contain the docker
layer data do not get synced. The files are created on the geo-replicated site,
but remain 0 bytes. Other files, like the .properties files are synced
properly.
The moment you add a character to the .bytes file manually (echo >>), the
.bytes file data does get synced...it seems like gluster doesn't detect writing
data to the file in some cases, at least the way Nexus3 does it to those .bytes
files. We suspect that there will more applications / files affected by this,
resulting in a corrupt / incomplete data on the geo-replicated site.

Version-Release number of selected component (if applicable):
3.12.1-2

How reproducible:
100%

Steps to Reproduce:
1. Run Sonatype Nexus3 with a hosted docker repository and it's data
(/nexus-data) on a glusterfs volume which is geo-replicated.
2. docker push an arbitrary image into this nexus docker repo
3. ls -laRf blobs/default/content | grep .bytes
   on both main site and geo-replicated site and see that the ones on the main
site are non-0 bytes and on the geo-replicated site they're 0 bytes

Actual results:
main:
-rw-r--r--. 2 200 200  529 Nov  3 15:27
e39d9b3a-53e0-44bc-b4a2-31d145aeec81.bytes
-rw-r--r--. 1 200 200 1991435 Nov  3 19:00
613953bb-b542-4db7-ba18-01a331361994.bytes


geo-replicated:
-rw-r--r--. 0 root root    0 Nov  3 19:01
613953bb-b542-4db7-ba18-01a331361994.bytes
-rw-r--r--. 0 root root    0 Nov  3 15:28
e39d9b3a-53e0-44bc-b4a2-31d145aeec81.bytes

Expected results:
main:
-rw-r--r--. 2 200 200  529 Nov  3 15:27
e39d9b3a-53e0-44bc-b4a2-31d145aeec81.bytes
-rw-r--r--. 1 200 200 1991435 Nov  3 19:00
613953bb-b542-4db7-ba18-01a331361994.bytes


geo-replicated:
-rw-r--r--. 2 200 200  529 Nov  3 15:27
e39d9b3a-53e0-44bc-b4a2-31d145aeec81.bytes
-rw-r--r--. 1 200 200 1991435 Nov  3 19:00
613953bb-b542-4db7-ba18-01a331361994.bytes

Additional info:
Tried both rsync and use-tarssh, same issue.
Date/time is the same on main and geo-replicated site servers
An initial sync does correctly sync the .bytes files.
Maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=1437244

--- Additional comment from Mohit Agrawal on 2017-11-08 23:06:21 EST ---

Hi,

 Can you please share the brick logs from master and slave nodes?

Regards
Mohit Agrawal

--- Additional comment from Dimitri Ars on 2017-11-09 02:29 EST ---



--- Additional comment from Dimitri Ars on 2017-11-09 02:30 EST ---



--- Additional comment from Dimitri Ars on 2017-11-09 02:52:40 EST ---

logs attached, not containing the files from the first comment, but others
which have the same problem, for example:

[root at X94pabgluster0 chap-24]# ls -al
total 9
drwxr-sr-x. 2  200 200 4096 Nov  8 20:04 .
drwxr-sr-x. 3  200 200 4096 Nov  8 20:04 ..
-rw-r--r--. 0 root 200    0 Nov  8 20:04
d327a69b-f9fb-455e-80b6-5bec8df0b6b9.bytes
-rw-r--r--. 1  200 200  356 Nov  8 20:04
d327a69b-f9fb-455e-80b6-5bec8df0b6b9.properties
[root at X94pabgluster0 chap-24]# getfattr -n glusterfs.gfid2path
/mnt/blobs/default/content/vol-05/chap-24/d327a69b-f9fb-455e-80b6-5bec8df0b6b9.properties
getfattr: Removing leading '/' from absolute path names
# file:
mnt/blobs/default/content/vol-05/chap-24/d327a69b-f9fb-455e-80b6-5bec8df0b6b9.properties
glusterfs.gfid2path="/blobs/default/content/vol-05/chap-24/d327a69b-f9fb-455e-80b6-5bec8df0b6b9.properties"

[root at X94pabgluster0 chap-24]# getfattr -n glusterfs.gfid2path
/mnt/blobs/default/content/vol-05/chap-24/d327a69b-f9fb-455e-80b6-5bec8df0b6b9.bytes
getfattr: Removing leading '/' from absolute path names
# file:
mnt/blobs/default/content/vol-05/chap-24/d327a69b-f9fb-455e-80b6-5bec8df0b6b9.bytes
glusterfs.gfid2path="/blobs/default/content/vol-29/chap-39/9ba4d149-3d54-4988-912a-cf3e81a4854d.bytes"


So there's a 0 bytes file with 0 links owned by root (but the group is already
the "destination owner".

--- Additional comment from Kotresh HR on 2017-11-09 05:31:34 EST ---

Hi,

We are interested in the I/O pattern that's get recorded in the changelog. 
Could you please share the changelogs from below path. It will be a tar file.

/var/lib/glusterfsd/misc/<master-vol-name>/<ssh....>/<md5sum of
brickpath>/.processed/archive<date>.tar

--- Additional comment from Dimitri Ars on 2017-11-09 07:47 EST ---

changelog for Nexus3 adding docker images and anything else it does...we didn't
use nexus to do other transactions, so hopefully it isn't too cluttered. As
stated, the .properties files created are doing fine, the .bytes files end up
as 0 on site B...
Don't know about the strange gfid2path's on site A and B for the .bytes
files...

--- Additional comment from Dimitri Ars on 2017-11-09 16:05:38 EST ---

Did some more testing...reproducing the problem is easily done with for example
the following shell commands:

cp /etc/group f1 ; \
mv f1 f2 ; \
ln f2 f3 ; \
mv f3 f4 ; \
unlink f2

file f4 should contain the /etc/group contents, and it does on site A, but on
site B it's the 0 bytes...

A:
-rw-r--r--.  1 nexus nexus  395 Nov  9 21:00 f4

B:
-rw-r--r--.  0 root 200    0 Nov  9 21:00 f4

--- Additional comment from Dimitri Ars on 2017-11-09 16:32:32 EST ---

Even further deduced;
echo testing > f1 ; \
ln f1 f2 ; \
mv f2 f3 ; \
unlink f1

Then f3 is the problem file.
If you takeout the unlink things are a bit better but still not correct, we
then have 2 linked files (f1 and f3) on site A, and 2 separate files (f1 and
f3) on site B, both have the correct content, but expected that site B would
have 2 linked files as well..
A:
-rw-r--r--.  2 nexus nexus    8 Nov  9 21:16 f1
-rw-r--r--.  2 nexus nexus    8 Nov  9 21:16 f3
B:
-rw-r--r--.  1 200 200    8 Nov  9 21:16 f1
-rw-r--r--.  1 200 200    8 Nov  9 21:16 f3

If I leave out the rename of the hardlink as well, things go fine. If I then
rename f2 to f3 after waiting for 15 seconds or so (changelog rollover) then
the rename goes fine as well, all correct on both A and B.
It looks kind of like https://bugzilla.redhat.com/show_bug.cgi?id=1448914 which
had this issue for extended attributes. Also looks like
https://bugzilla.redhat.com/show_bug.cgi?id=1296175 was somewhat related.

--- Additional comment from Dimitri Ars on 2017-11-10 06:17:46 EST ---

Although it's easy to reproduce and see, this is the related error logging
happening on site B geo-replication-slaves log
(4a734e2c-202a-4b11-8676-9a3219dc2101:192.168.5.7.%2Fvar%2Flib%2Fheketi%2Fmounts%2Fvg_1f8ca94513acde49ebe3167b58004159%2Fbrick_fe0308e73bb49c5a30f504ef853a731d%2Fbrick.vol_20e37cc674b396d041691341d69b81a6.gluster.log:)

[2017-11-10 10:06:06.718657] W [MSGID: 114031]
[client-rpc-fops.c:493:client3_3_stat_cbk]
0-vol_eeaf4e18532e9769aed04199eda0d1bd-client-2: remote operation failed [No
such file or directory]
[2017-11-10 10:06:06.719246] W [MSGID: 114031]
[client-rpc-fops.c:2860:client3_3_lookup_cbk]
0-vol_eeaf4e18532e9769aed04199eda0d1bd-client-2: remote operation failed. Path:
/.gfid/5cbc6952-f50c-4040-af3b-cd9dfb2f8596
(5cbc6952-f50c-4040-af3b-cd9dfb2f8596) [No such file or directory]
[2017-11-10 10:06:06.719300] W [MSGID: 114031]
[client-rpc-fops.c:2860:client3_3_lookup_cbk]
0-vol_eeaf4e18532e9769aed04199eda0d1bd-client-1: remote operation failed. Path:
/.gfid/5cbc6952-f50c-4040-af3b-cd9dfb2f8596
(5cbc6952-f50c-4040-af3b-cd9dfb2f8596) [No such file or directory]
[2017-11-10 10:06:06.719322] W [MSGID: 114031]
[client-rpc-fops.c:2860:client3_3_lookup_cbk]
0-vol_eeaf4e18532e9769aed04199eda0d1bd-client-0: remote operation failed. Path:
/.gfid/5cbc6952-f50c-4040-af3b-cd9dfb2f8596
(5cbc6952-f50c-4040-af3b-cd9dfb2f8596) [No such file or directory]
[2017-11-10 10:06:06.721119] E [MSGID: 109040]
[dht-helper.c:1378:dht_migration_complete_check_task]
0-vol_eeaf4e18532e9769aed04199eda0d1bd-dht:
/.gfid/5cbc6952-f50c-4040-af3b-cd9dfb2f8596: failed to lookup the file on
vol_eeaf4e18532e9769aed04199eda0d1bd-dht [No such file or directory]
[2017-11-10 10:06:06.721213] W [fuse-bridge.c:874:fuse_attr_cbk]
0-glusterfs-fuse: 2510782: STAT() /.gfid/5cbc6952-f50c-4040-af3b-cd9dfb2f8596
=> -1 (No such file or directory)


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1510342
[Bug 1510342] Not all files synced using geo-replication
https://bugzilla.redhat.com/show_bug.cgi?id=1510994
[Bug 1510994] [GSS] Not all files synced using geo-replication
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list