<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 6, 2018 at 10:22 PM, Paul Anderson <span dir="ltr">&lt;<a href="mailto:pha@umich.edu" target="_blank">pha@umich.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Raghavendra,<br>

<br>

I&#39;ve commited my tests case to <a href="https://github.com/powool/gluster.git" rel="noreferrer" target="_blank">https://github.com/powool/<wbr>gluster.git</a> -<br>

it&#39;s grungy, and a work in progress, but I am happy to take change<br>

suggestions, especially if it will save folks significant time.<br>

<br>

For the rest, I&#39;ll reply inline below...<br>

<span class="gmail-"><br>

On Mon, Mar 5, 2018 at 10:39 PM, Raghavendra Gowdappa<br>

&lt;<a href="mailto:rgowdapp@redhat.com">rgowdapp@redhat.com</a>&gt; wrote:<br>

&gt; +Csaba.<br>

&gt;<br>

&gt; On Tue, Mar 6, 2018 at 2:52 AM, Paul Anderson &lt;<a href="mailto:pha@umich.edu">pha@umich.edu</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; Raghavendra,<br>

&gt;&gt;<br>

&gt;&gt; Thanks very much for your reply.<br>

&gt;&gt;<br>

&gt;&gt; I fixed our data corruption problem by disabling the volume<br>

&gt;&gt; performance.write-behind flag as you suggested, and simultaneously<br>

&gt;&gt; disabling caching in my client side mount command.<br>

&gt;<br>

&gt;<br>

&gt; Good to know it worked. Can you give us the output of<br>

&gt; # gluster volume info<br>

<br>

</span>[root@node-1 /]# gluster volume info<br>

<br>

Volume Name: dockerstore<br>

Type: Replicate<br>

Volume ID: fb08b9f4-0784-4534-9ed3-<wbr>e01ff71a0144<br>

Status: Started<br>

Snapshot Count: 0<br>

Number of Bricks: 1 x 3 = 3<br>

Transport-type: tcp<br>

Bricks:<br>

Brick1: 172.18.0.4:/data/glusterfs/<wbr>store/dockerstore<br>

Brick2: 172.18.0.3:/data/glusterfs/<wbr>store/dockerstore<br>

Brick3: 172.18.0.2:/data/glusterfs/<wbr>store/dockerstore<br>

Options Reconfigured:<br>

performance.client-io-threads: off<br>

nfs.disable: on<br>

transport.address-family: inet<br>

locks.mandatory-locking: optimal<br>

performance.flush-behind: off<br>

performance.write-behind: off<br>

<span class="gmail-"><br>

&gt;<br>

&gt; We would like to debug the problem in write-behind. Some questions:<br>

&gt;<br>

&gt; 1. What version of Glusterfs are you using?<br>

<br>

</span>On the server nodes:<br>

<br>

[root@node-1 /]# gluster --version<br>

glusterfs 3.13.2<br>

Repository revision: git://<a href="http://git.gluster.org/glusterfs.git" rel="noreferrer" target="_blank">git.gluster.org/<wbr>glusterfs.git</a><br>

<br>

On the docker container sqlite test node:<br>

<br>

root@b4055d8547d2:/# glusterfs --version<br>

glusterfs 3.8.8 built on Jan 11 2017 14:07:11<br></blockquote><div><br></div><div>I guess this is where client is mounted. If I am correct on where glusterfs client is mounted, client is running quite a old version. There have been significant number of fixes between 3.8.8 and current master. I would suggest to try out 3.13.2 patched with [1]. If you get a chance to try this out, please report back how did the tests go.<br></div><div><br></div><div>[1] <a href="https://review.gluster.org/19673">https://review.gluster.org/19673</a></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

I recognize that version skew could be an issue.<br>

<span class="gmail-"><br>

&gt; 2. Were you able to figure out whether its stale data or metadata that is<br>

&gt; causing the issue?<br>

<br>

</span>I lean towards stale data based on the only real observation I have:<br>

<br>

While debugging, I put log messages in as to when the flock() is<br>

acquired, and when it is released. There is no instance where two<br>

different processes ever hold the same flock()&#39;d file. From what I<br>

have read, the locks are considered metadata, and they appear to me to<br>

be working, so that&#39;s why I&#39;m inclined to think stale data is the<br>

issue.<br>

<span class="gmail-"><br>

&gt;<br>

&gt; There have been patches merged in write-behind in recent past and one in the<br>

&gt; works which address metadata consistency. Would like to understand whether<br>

&gt; you&#39;ve run into any of the already identified issues.<br>

<br>

</span>Agreed!<br>

<br>

Thanks,<br>

<br>

Paul<br>

<div class="gmail-HOEnZb"><div class="gmail-h5"><br>

&gt;<br>

&gt; regards,<br>

&gt; Raghavendra<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; In very modest testing, the flock() case appears to me to work well -<br>

&gt;&gt; before it would corrupt the db within a few transactions.<br>

&gt;&gt;<br>

&gt;&gt; Testing using built in sqlite3 locks is better (fcntl range locks),<br>

&gt;&gt; but has some behavioral issues (probably just requires query retry<br>

&gt;&gt; when the file is locked). I&#39;ll research this more, although the test<br>

&gt;&gt; case is not critical to our use case.<br>

&gt;&gt;<br>

&gt;&gt; There are no signs of O_DIRECT use in the sqlite3 code that I can see.<br>

&gt;&gt;<br>

&gt;&gt; I intend to set up tests that run much longer than a few minutes, to<br>

&gt;&gt; see if there are any longer term issues. Also, I want to experiment<br>

&gt;&gt; with data durability by killing various gluster server nodes during<br>

&gt;&gt; the tests.<br>

&gt;&gt;<br>

&gt;&gt; If anyone would like our test scripts, I can either tar them up and<br>

&gt;&gt; email them or put them in github - either is fine with me. (they rely<br>

&gt;&gt; on current builds of docker and docker-compose)<br>

&gt;&gt;<br>

&gt;&gt; Thanks again!!<br>

&gt;&gt;<br>

&gt;&gt; Paul<br>

&gt;&gt;<br>

&gt;&gt; On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa<br>

&gt;&gt; &lt;<a href="mailto:rgowdapp@redhat.com">rgowdapp@redhat.com</a>&gt; wrote:<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; On Mon, Mar 5, 2018 at 8:21 PM, Paul Anderson &lt;<a href="mailto:pha@umich.edu">pha@umich.edu</a>&gt; wrote:<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Hi,<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; tl;dr summary of below: flock() works, but what does it take to make<br>

&gt;&gt; &gt;&gt; sync()/fsync() work in a 3 node GFS cluster?<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; I am under the impression that POSIX flock, POSIX<br>

&gt;&gt; &gt;&gt; fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all<br>

&gt;&gt; &gt;&gt; supported in cluster operations, such that in theory, SQLite3 should<br>

&gt;&gt; &gt;&gt; be able to atomically lock the file (or a subset of page), modify<br>

&gt;&gt; &gt;&gt; pages, flush the pages to gluster, then release the lock, and thus<br>

&gt;&gt; &gt;&gt; satisfy the ACID property that SQLite3 appears to try to accomplish on<br>

&gt;&gt; &gt;&gt; a local filesystem.<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; In a test we wrote that fires off 10 simple concurrernt SQL insert,<br>

&gt;&gt; &gt;&gt; read, update loops, we discovered that we at least need to use flock()<br>

&gt;&gt; &gt;&gt; around the SQLite3 db connection open/update/close to protect it.<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; However, that is not enough - although from testing, it looks like<br>

&gt;&gt; &gt;&gt; flock() works as advertised across gluster mounted files, sync/fsync<br>

&gt;&gt; &gt;&gt; don&#39;t appear to, so we end up getting corruption in the SQLite3 file<br>

&gt;&gt; &gt;&gt; (pragma integrity_check generally will show a bunch of problems after<br>

&gt;&gt; &gt;&gt; a short test).<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Is what we&#39;re trying to do achievable? We&#39;re testing using the docker<br>

&gt;&gt; &gt;&gt; container gluster/gluster-centos as the three servers, with a php test<br>

&gt;&gt; &gt;&gt; inside of php-cli using filesystem mounts. If we mount the gluster FS<br>

&gt;&gt; &gt;&gt; via sapk/plugin-gluster into the php-cli containers using docker, we<br>

&gt;&gt; &gt;&gt; seem to have better success sometimes, but I haven&#39;t figured out why,<br>

&gt;&gt; &gt;&gt; yet.<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; I did see that I needed to set the server volume parameter<br>

&gt;&gt; &gt;&gt; &#39;performance.flush-behind off&#39;, otherwise it seems that flushes won&#39;t<br>

&gt;&gt; &gt;&gt; block as would be needed by SQLite3.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; If you are relying on fsync this shouldn&#39;t matter as fsync makes sure<br>

&gt;&gt; &gt; data<br>

&gt;&gt; &gt; is synced to disk.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Does anyone have any suggestions? Any words of widsom would be much<br>

&gt;&gt; &gt;&gt; appreciated.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Can you experiment with turning on/off various performance xlators?<br>

&gt;&gt; &gt; Based on<br>

&gt;&gt; &gt; earlier issues, its likely that there is stale metadata which might be<br>

&gt;&gt; &gt; causing the issue (not necessarily improper fsync behavior). I would<br>

&gt;&gt; &gt; suggest<br>

&gt;&gt; &gt; turning off all performance xlators. You can refer [1] for a related<br>

&gt;&gt; &gt; discussion. In theory the only perf xlator relevant for fsync is<br>

&gt;&gt; &gt; write-behind and I am not aware of any issues where fsync is not<br>

&gt;&gt; &gt; working.<br>

&gt;&gt; &gt; Does glusterfs log file has any messages complaining about writes or<br>

&gt;&gt; &gt; fsync<br>

&gt;&gt; &gt; failing? Does your application use O_DIRECT? If yes, please note that<br>

&gt;&gt; &gt; you<br>

&gt;&gt; &gt; need to turn the option performance.strict-o-direct on for write-behind<br>

&gt;&gt; &gt; to<br>

&gt;&gt; &gt; honour O_DIRECT<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Also, is it possible to identify nature of corruption - Data or<br>

&gt;&gt; &gt; metadata?<br>

&gt;&gt; &gt; More detailed explanation will help to RCA the issue.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Also, is your application running on a single mount or from multiple<br>

&gt;&gt; &gt; mounts?<br>

&gt;&gt; &gt; Can you collect strace of your application (strace -ff -T -p &lt;pid&gt; -o<br>

&gt;&gt; &gt; &lt;file&gt;)? If possible can you also collect fuse-dump using option<br>

&gt;&gt; &gt; --dump-fuse<br>

&gt;&gt; &gt; while mounting glusterfs?<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; [1]<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; <a href="http://lists.gluster.org/pipermail/gluster-users/2018-February/033503.html" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>pipermail/gluster-users/2018-<wbr>February/033503.html</a><br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Thanks,<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Paul<br>

&gt;&gt; &gt;&gt; ______________________________<wbr>_________________<br>

&gt;&gt; &gt;&gt; Gluster-users mailing list<br>

&gt;&gt; &gt;&gt; <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>

&gt;&gt; &gt;&gt; <a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;<br>

&gt;<br>

</div></div></blockquote></div><br></div></div>