[Bugs] [Bug 1512691] New: PostgreSQL DB Restore: unexpected data beyond EOF

Mon Nov 13 20:54:31 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1512691

            Bug ID: 1512691
           Summary: PostgreSQL DB Restore: unexpected data beyond EOF
           Product: GlusterFS
           Version: 3.12
         Component: nfs
          Assignee: bugs at gluster.org
          Reporter: norman.j.maul at aexp.com
                CC: bugs at gluster.org, fuzz at namm.de, hchiramm at redhat.com,
                    javishi at gmail.com, kostrzewa at 9livesdata.com,
                    rgoncalves at gmail.com, tcarlin at redhat.com

Per bug 1435832, I'm re-reporting this under a newer version (3.12). I'm not
the original submitter, but I can confirm that this problem still occurs in
3.12.1.

My steps to reproduce:

Install GlusterFS on 3+ nodes (I used 3.12.1 packages from CentOS-SIG), on
RHEL7.3
I used Heketi to set up the cluster, but AFAICT it shouldn't matter
create a volume... I used heketi again, distributed-replicated, 1x3.
mount it somewhere (ex: /mnt)

Then you can run the "pgbench" tool like this (pass the right mountpoint):

docker run -d --net host -v /mnt:/var/lib/postgresql/data --name pgbench
postgres:alpine
docker exec -it pgbench psql -U postgres
create database pgbench;
time docker exec -it pgbench pgbench pgbench -U postgres -i -s 100
time docker exec -it pgbench pgbench pgbench -U postgres -c 50 -j 8 -t 20000 -r
-P 10

Everything up to and including "create database" works fine. The first pgbench
command *almost* works... it adds all the rows, but then fails at the end,
around when it would do a vacuum. The second pgbench command fails
spectacularly right away.

If you reduce the second command to "-c1 -j1", it will work (at least for a
while- it's incredibly slow for me so I didn't wait around to see if it works
completely).

If /mnt is a regular local filesystem (or Ceph-RBD), this works fine. It only
fails if that's a GlusterFS volume.

+++ This bug was initially created as a clone of Bug #1435832 +++

Description of problem:

I'm running Gluster in a kubernetes cluster with the help of
https://github.com/gluster/gluster-kubernetes.  I have a postgresql container
where the /var/lib/postgresql/data/pgdata directory is a glusterfs mounted
persistent volume.  I then run another container to restore a PostgreSQL
backup.  It successfully restores all tables except one, which happens to be
the largest sized table >100MB.  The error given for that table is:

```
ERROR:  unexpected data beyond EOF in block 14917 of relation base/78620/78991
HINT:  This has been seen to occur with buggy kernels; consider updating your
system.
CONTEXT:  COPY es_config_app_solutiondraft, line 906
```

I have tried several different containers to perform the restore on including
ubuntu:16:04, postgresql:9.6.2, and alpine:3.5.  All have the same issue. 
Interestingly, the entire restore works including the large table if I run it
directly on the postgresql container.  That makes me think this is related to
container to container networking and not necessarily gluster's fault but
wanted to report it in case there are any suggestions or kernel setting tweaks
to fix the issue.

Version-Release number of selected component (if applicable):

GlusterFS 3.8.5

PostgreSQL 9.6.2 Container:
uname -a
Linux develop-postgresql-3992946951-3srqg 4.4.0-65-generic #86-Ubuntu SMP Thu
Feb 23 17:49:58 UTC 2017 x86_64 GNU/Linux

Other containers used for the restore are running the same 4.4.0-65-generic
kernel.

Kubernetes 1.5.1
Docker 1.12.6

How reproducible:

First, get kubernetes working with gluster and heketi.  See
https://github.com/gluster/gluster-kubernetes

Steps to Reproduce:
1. Start a PostgreSQL "pod" with the /var/lib/postgresql/data/pgdata set up as
persistent volume.
2. Start a second container that can access the postgres container.
3. Attempt to restore a backup containing a large table >100MB.

Actual results:

Restore fails on large table with above error.

Expected results:

Restore applies cleanly, even for large tables.

Additional info:

Volume is mounted as type fuse.glusterfs.  From postgresql container:
# mount
10.163.148.196:vol_6d09a586370e26a718a74d5d280f8dfd on
/var/lib/postgresql/data/pgdata type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

Googling this error does return some fairly old results that don't really have
anything conclusive.

--- Additional comment from  on 2017-03-28 14:19:25 EDT ---

My thoughts about it being a container networking issue were incorrect.  I now
believe this is truly a glusterfs + postgresql issue.  I confirmed that I
occasionally do get restore failures on the postgresql container itself which
eliminates the container networking interface (CNI).  I also get occasional
successful restores on separate restore containers which further eliminates
CNI.  The "unexpected data beyond EOF" error occurs intermittently with about a
~30% success rate regardless of how the restore is attempted.

Also, the table size for the failing table is actually 244MB.  All other tables
that do successfully restore are under 10MB.

--- Additional comment from Zbigniew Kostrzewa on 2017-09-28 01:29:38 EDT ---

Just recently I bumped into the same error using GlusterFS 3.10.5 and 3.12.1
(from SIG repositories).
I have created a cluster of 3 VMs with CentOS 7.2 (uname below) and spin up a
PostgreSQL 9.6.2 docker (v17.06) container. GlusterFS volume was bind-mounted
into the container to default location where PostgreSQL stores its data
(/var/lib/postgresql/data). When filling up the database with data at some
point I got this "unexpected data beyond EOF" error.

On PostgreSQL's mailing list similar issue was discussed but about PostgreSQL
on NFS. In fact such issue was reported and fixed already in RHEL5
(https://bugzilla.redhat.com/show_bug.cgi?id=672981).

I tried using latest PostgreSQL's docker image (i.e. 9.6.5), unfortunately with
the same results.

uname -a:
Linux node-10-9-4-109 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
x86_64 x86_64 x86_64 GNU/Linux

--- Additional comment from Rui on 2017-10-23 06:28:54 EDT ---

I'm having the same problem here.

I have installed postgresql 9.6.5 on 3.10.0-693.2.2.el7.x86_64 and executed
pgbench with a scale factor of 1000, i.e. 10.000.000 accounts. 

First run was executed using the O.S filesystem. Everything went well.

After that I have stopped postgresql, created a GlusterFS replicated volume (3
replicas), and copied postgresql data directory into the GlusterFS volume.The
volume is mounted as type fuse.glusterfs.

10.112.76.37:gv0 on /mnt/batatas type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

After that I've tried to run pgbench. Running with concurrency level of one,
things work fine. However, running with concurrency level > 1, this error
occurs:

client 1 aborted in state 9: ERROR:  unexpected data beyond EOF in block 316 of
relation base/16384/16516
HINT:  This has been seen to occur with buggy kernels; consider updating your
system.

I'm using glusterfs 3.12.2.

Any idea?

--- Additional comment from Niels de Vos on 2017-11-07 05:40:31 EST ---

This bug is getting closed because the 3.8 version is marked End-Of-Life. There
will be no further updates to this version. Please open a new bug against a
version that still receives bugfixes if you are still facing this issue in a
more current release.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.