[Bugs] [Bug 1512691] PostgreSQL DB Restore: unexpected data beyond EOF

Sat Jun 30 14:24:18 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1512691

--- Comment #25 from Raghavendra G <rgowdapp at redhat.com> ---
Note the pattern of "postgres: autovacuum launcher process"

openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.002975>
fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000007>
read(4,
"\235\274\245\1\207\366\206\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"...,
8192) = 2457 <0.000079>
close(4)                                = 0 <0.000114>
sendto(10, "\1\0\0\0
\0\0\0\nN\331\371\332\22\2\0\352\254\321\371\332\22\2\0\0\0\0\0\0\0\0\0", 32,
0, NULL, 0) = 32 <0.000025>
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout) <0.010212>
openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000870>
fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000008>
read(4,
"\235\274\245\1\207\366\206\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"...,
8192) = 2457 <0.000106>
close(4)                                = 0 <0.000091>
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout) <0.010089>
openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000704>
fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000008>
read(4,
"\235\274\245\1\364^\331\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"...,
8192) = 2457 <0.000268>
close(4)                                = 0 <0.000076>
openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000433>

Another process "postgres: stats collector process", acts as a writer to
global.stat (and other .stat files). It,
1. creates a pg_stat_tmp/global.tmp
2. writes to the file
3. rename (pg_stat_tmp/global.tmp, pg_stat_tmp/global.stat)

openat(AT_FDCWD, "pg_stat_tmp/global.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
<0.004589>
fstat(3, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000011>
openat(AT_FDCWD, "pg_stat_tmp/db_0.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
<0.001744>
fstat(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000005>
write(4,
"\235\274\245\1T\275\4\0\0\0\0\0\0\24\0\0\0\0\0\0\0(\0\0\0\0\0\0\0\0\0\0"...,
2033) = 2033 <0.000099>
close(4)                                = 0 <0.000706>
rename("pg_stat_tmp/db_0.tmp", "pg_stat_tmp/db_0.stat") = 0 <0.003799>
write(3,
"\235\274\245\1\7C9\370\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"...,
2457) = 2457 <0.000346>
close(3)                                = 0 <0.001612>
rename("pg_stat_tmp/global.tmp", "pg_stat_tmp/global.stat") = 0 <0.003582>

Basically, this I/O pattern expects that fd opened on a file before rename to 
be valid after rename overwrote that file (fd is opened on dst of rename before
rename is done).

While looking at open-behind for rename I found that while rename tries to open
fd on dst before rename is wound down, I found a bug in that code. Note that,

1. Only first ever fd opened on an inode is opened-behind. Rest of the opens
are unwound only after open is done on bricks.
2. So, ob_rename tries to open _that_ first fd, by doing this:

<ob_rename>
        stub = fop_rename_stub (frame, default_rename_resume, src, dst, xdata);
        if (!stub)
                goto err;

        if (dst->inode)
        fd = fd_lookup (dst->inode, 0);

        open_and_resume (this, fd, stub);
</ob_rename>

Note how it gets the fd to be opened - by doing fd_lookup (with pid 0).
However, fd_lookup with pid = 0 returns the first fd in the list of fds opened
on the inode.

<__fd_lookup>

    list_for_each_entry (iter_fd, &inode->fd_list, inode_list) {
                if (iter_fd->anonymous)
            /* If someone was interested in getting an                          
                           anonymous fd (or was OK getting an anonymous fd),    
                           they can as well call fd_anonymous() directly */
                    continue;

                   if (!pid || iter_fd->pid == pid) {
            fd = __fd_ref (iter_fd);
                    break;
                }
        }

</__fd_lookup>

But, if we look at __fd_bind which ads an fd to this list,

<__fd_bind>

        list_del_init (&fd->inode_list);
        list_add (&fd->inode_list, &fd->inode->fd_list);
        fd->inode->fd_count++;

</__fd_bind>

So, the first member of inode->fd_list is the recently opened fd on inode. But
the opened-behind fd is the oldest fd that is opened on the inode and is the
last member of inode->fd_list.

So, when more than one fds are opened on a file ob_rename actually tries to
open the fd which is already opened on bricks missing the fd which is
opened-behind and not opened on brick. When rename is done, dst is overwritten
with a new gfid . When operations like fstat/readv are done on the
opened-behind fd, they fail with ESTALE as the gfid is no longer present on
bricks being overwritten by rename.

Similar bug is present in ob_unlink too.

I couldn't reproduce the corruption while stracing (may be tracing altered the
race), but I could reproduce the bug without stracing (with all perf xlators
tuned on). When I turned off open-behind, the bug went away. Given the
reasoning and the fact that I couldn't reproduce the bug with open-behind off,
I conclude the RCA for corruption is the one I explained above.

I am going to send a fix to ob_rename and ob_unlink and report back the results
with the fix.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=TTG6VwMcU0&a=cc_unsubscribe