<div dir="ltr">Adding csaba<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 6, 2018 at 9:09 AM, Raghavendra Gowdappa <span dir="ltr"><<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">+Csaba.<br><div><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Tue, Mar 6, 2018 at 2:52 AM, Paul Anderson <span dir="ltr"><<a href="mailto:pha@umich.edu" target="_blank">pha@umich.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Raghavendra,<br>
<br>
Thanks very much for your reply.<br>
<br>
I fixed our data corruption problem by disabling the volume<br>
performance.write-behind flag as you suggested, and simultaneously<br>
disabling caching in my client side mount command.<br></blockquote><div><br></div></span><div>Good to know it worked. Can you give us the output of</div><div># gluster volume info</div><div><br></div><div>We would like to debug the problem in write-behind. Some questions:<br></div><div><br></div><div>1. What version of Glusterfs are you using?</div><div>2. Were you able to figure out whether its stale data or metadata that is causing the issue?</div><div><br></div><div>There have been patches merged in write-behind in recent past and one in the works which address metadata consistency. Would like to understand whether you've run into any of the already identified issues.</div><div><br></div><div>regards,</div><div>Raghavendra<br></div><div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
In very modest testing, the flock() case appears to me to work well -<br>
before it would corrupt the db within a few transactions.<br>
<br>
Testing using built in sqlite3 locks is better (fcntl range locks),<br>
but has some behavioral issues (probably just requires query retry<br>
when the file is locked). I'll research this more, although the test<br>
case is not critical to our use case.<br>
<br>
There are no signs of O_DIRECT use in the sqlite3 code that I can see.<br>
<br>
I intend to set up tests that run much longer than a few minutes, to<br>
see if there are any longer term issues. Also, I want to experiment<br>
with data durability by killing various gluster server nodes during<br>
the tests.<br>
<br>
If anyone would like our test scripts, I can either tar them up and<br>
email them or put them in github - either is fine with me. (they rely<br>
on current builds of docker and docker-compose)<br>
<br>
Thanks again!!<br>
<span class="m_8920908922300281730im m_8920908922300281730HOEnZb"><br>
Paul<br>
<br>
On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa<br>
<<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>> wrote:<br>
><br>
><br>
</span><div class="m_8920908922300281730HOEnZb"><div class="m_8920908922300281730h5">> On Mon, Mar 5, 2018 at 8:21 PM, Paul Anderson <<a href="mailto:pha@umich.edu" target="_blank">pha@umich.edu</a>> wrote:<br>
>><br>
>> Hi,<br>
>><br>
>> tl;dr summary of below: flock() works, but what does it take to make<br>
>> sync()/fsync() work in a 3 node GFS cluster?<br>
>><br>
>> I am under the impression that POSIX flock, POSIX<br>
>> fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all<br>
>> supported in cluster operations, such that in theory, SQLite3 should<br>
>> be able to atomically lock the file (or a subset of page), modify<br>
>> pages, flush the pages to gluster, then release the lock, and thus<br>
>> satisfy the ACID property that SQLite3 appears to try to accomplish on<br>
>> a local filesystem.<br>
>><br>
>> In a test we wrote that fires off 10 simple concurrernt SQL insert,<br>
>> read, update loops, we discovered that we at least need to use flock()<br>
>> around the SQLite3 db connection open/update/close to protect it.<br>
>><br>
>> However, that is not enough - although from testing, it looks like<br>
>> flock() works as advertised across gluster mounted files, sync/fsync<br>
>> don't appear to, so we end up getting corruption in the SQLite3 file<br>
>> (pragma integrity_check generally will show a bunch of problems after<br>
>> a short test).<br>
>><br>
>> Is what we're trying to do achievable? We're testing using the docker<br>
>> container gluster/gluster-centos as the three servers, with a php test<br>
>> inside of php-cli using filesystem mounts. If we mount the gluster FS<br>
>> via sapk/plugin-gluster into the php-cli containers using docker, we<br>
>> seem to have better success sometimes, but I haven't figured out why,<br>
>> yet.<br>
>><br>
>> I did see that I needed to set the server volume parameter<br>
>> 'performance.flush-behind off', otherwise it seems that flushes won't<br>
>> block as would be needed by SQLite3.<br>
><br>
><br>
> If you are relying on fsync this shouldn't matter as fsync makes sure data<br>
> is synced to disk.<br>
><br>
>><br>
>> Does anyone have any suggestions? Any words of widsom would be much<br>
>> appreciated.<br>
><br>
><br>
> Can you experiment with turning on/off various performance xlators? Based on<br>
> earlier issues, its likely that there is stale metadata which might be<br>
> causing the issue (not necessarily improper fsync behavior). I would suggest<br>
> turning off all performance xlators. You can refer [1] for a related<br>
> discussion. In theory the only perf xlator relevant for fsync is<br>
> write-behind and I am not aware of any issues where fsync is not working.<br>
> Does glusterfs log file has any messages complaining about writes or fsync<br>
> failing? Does your application use O_DIRECT? If yes, please note that you<br>
> need to turn the option performance.strict-o-direct on for write-behind to<br>
> honour O_DIRECT<br>
><br>
> Also, is it possible to identify nature of corruption - Data or metadata?<br>
> More detailed explanation will help to RCA the issue.<br>
><br>
> Also, is your application running on a single mount or from multiple mounts?<br>
> Can you collect strace of your application (strace -ff -T -p <pid> -o<br>
> <file>)? If possible can you also collect fuse-dump using option --dump-fuse<br>
> while mounting glusterfs?<br>
><br>
> [1]<br>
> <a href="http://lists.gluster.org/pipermail/gluster-users/2018-February/033503.html" rel="noreferrer" target="_blank">http://lists.gluster.org/piper<wbr>mail/gluster-users/2018-Februa<wbr>ry/033503.html</a><br>
><br>
>><br>
>> Thanks,<br>
>><br>
>> Paul<br>
>> ______________________________<wbr>_________________<br>
>> Gluster-users mailing list<br>
>> <a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
>> <a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/mailm<wbr>an/listinfo/gluster-users</a><br>
><br>
><br>
</div></div></blockquote></div></div></div><br></div></div></div>
</blockquote></div><br></div>