[Gluster-users] SQLite3 on 3 node cluster FS?

Raghavendra Gowdappa rgowdapp at redhat.com
Tue Mar 6 17:32:40 UTC 2018


On Tue, Mar 6, 2018 at 10:58 PM, Raghavendra Gowdappa <rgowdapp at redhat.com>
wrote:

>
>
> On Tue, Mar 6, 2018 at 10:22 PM, Paul Anderson <pha at umich.edu> wrote:
>
>> Raghavendra,
>>
>> I've commited my tests case to https://github.com/powool/gluster.git -
>> it's grungy, and a work in progress, but I am happy to take change
>> suggestions, especially if it will save folks significant time.
>>
>> For the rest, I'll reply inline below...
>>
>> On Mon, Mar 5, 2018 at 10:39 PM, Raghavendra Gowdappa
>> <rgowdapp at redhat.com> wrote:
>> > +Csaba.
>> >
>> > On Tue, Mar 6, 2018 at 2:52 AM, Paul Anderson <pha at umich.edu> wrote:
>> >>
>> >> Raghavendra,
>> >>
>> >> Thanks very much for your reply.
>> >>
>> >> I fixed our data corruption problem by disabling the volume
>> >> performance.write-behind flag as you suggested, and simultaneously
>> >> disabling caching in my client side mount command.
>> >
>> >
>> > Good to know it worked. Can you give us the output of
>> > # gluster volume info
>>
>> [root at node-1 /]# gluster volume info
>>
>> Volume Name: dockerstore
>> Type: Replicate
>> Volume ID: fb08b9f4-0784-4534-9ed3-e01ff71a0144
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 3 = 3
>> Transport-type: tcp
>> Bricks:
>> Brick1: 172.18.0.4:/data/glusterfs/store/dockerstore
>> Brick2: 172.18.0.3:/data/glusterfs/store/dockerstore
>> Brick3: 172.18.0.2:/data/glusterfs/store/dockerstore
>> Options Reconfigured:
>> performance.client-io-threads: off
>> nfs.disable: on
>> transport.address-family: inet
>> locks.mandatory-locking: optimal
>> performance.flush-behind: off
>> performance.write-behind: off
>>
>> >
>> > We would like to debug the problem in write-behind. Some questions:
>> >
>> > 1. What version of Glusterfs are you using?
>>
>> On the server nodes:
>>
>> [root at node-1 /]# gluster --version
>> glusterfs 3.13.2
>> Repository revision: git://git.gluster.org/glusterfs.git
>>
>> On the docker container sqlite test node:
>>
>> root at b4055d8547d2:/# glusterfs --version
>> glusterfs 3.8.8 built on Jan 11 2017 14:07:11
>>
>
> I guess this is where client is mounted. If I am correct on where
> glusterfs client is mounted, client is running quite a old version. There
> have been significant number of fixes between 3.8.8 and current master.
>

... significant number of fixes to write-behind...

I would suggest to try out 3.13.2 patched with [1]. If you get a chance to
> try this out, please report back how did the tests go.
>

I would suggest to try out 3.13.2 patched with [1] and run tests with
write-behind turned on.


> [1] https://review.gluster.org/19673
>
>
>> I recognize that version skew could be an issue.
>>
>> > 2. Were you able to figure out whether its stale data or metadata that
>> is
>> > causing the issue?
>>
>> I lean towards stale data based on the only real observation I have:
>>
>> While debugging, I put log messages in as to when the flock() is
>> acquired, and when it is released. There is no instance where two
>> different processes ever hold the same flock()'d file. From what I
>> have read, the locks are considered metadata, and they appear to me to
>> be working, so that's why I'm inclined to think stale data is the
>> issue.
>>
>> >
>> > There have been patches merged in write-behind in recent past and one
>> in the
>> > works which address metadata consistency. Would like to understand
>> whether
>> > you've run into any of the already identified issues.
>>
>> Agreed!
>>
>> Thanks,
>>
>> Paul
>>
>> >
>> > regards,
>> > Raghavendra
>> >>
>> >>
>> >> In very modest testing, the flock() case appears to me to work well -
>> >> before it would corrupt the db within a few transactions.
>> >>
>> >> Testing using built in sqlite3 locks is better (fcntl range locks),
>> >> but has some behavioral issues (probably just requires query retry
>> >> when the file is locked). I'll research this more, although the test
>> >> case is not critical to our use case.
>> >>
>> >> There are no signs of O_DIRECT use in the sqlite3 code that I can see.
>> >>
>> >> I intend to set up tests that run much longer than a few minutes, to
>> >> see if there are any longer term issues. Also, I want to experiment
>> >> with data durability by killing various gluster server nodes during
>> >> the tests.
>> >>
>> >> If anyone would like our test scripts, I can either tar them up and
>> >> email them or put them in github - either is fine with me. (they rely
>> >> on current builds of docker and docker-compose)
>> >>
>> >> Thanks again!!
>> >>
>> >> Paul
>> >>
>> >> On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa
>> >> <rgowdapp at redhat.com> wrote:
>> >> >
>> >> >
>> >> > On Mon, Mar 5, 2018 at 8:21 PM, Paul Anderson <pha at umich.edu> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> tl;dr summary of below: flock() works, but what does it take to make
>> >> >> sync()/fsync() work in a 3 node GFS cluster?
>> >> >>
>> >> >> I am under the impression that POSIX flock, POSIX
>> >> >> fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
>> >> >> supported in cluster operations, such that in theory, SQLite3 should
>> >> >> be able to atomically lock the file (or a subset of page), modify
>> >> >> pages, flush the pages to gluster, then release the lock, and thus
>> >> >> satisfy the ACID property that SQLite3 appears to try to accomplish
>> on
>> >> >> a local filesystem.
>> >> >>
>> >> >> In a test we wrote that fires off 10 simple concurrernt SQL insert,
>> >> >> read, update loops, we discovered that we at least need to use
>> flock()
>> >> >> around the SQLite3 db connection open/update/close to protect it.
>> >> >>
>> >> >> However, that is not enough - although from testing, it looks like
>> >> >> flock() works as advertised across gluster mounted files, sync/fsync
>> >> >> don't appear to, so we end up getting corruption in the SQLite3 file
>> >> >> (pragma integrity_check generally will show a bunch of problems
>> after
>> >> >> a short test).
>> >> >>
>> >> >> Is what we're trying to do achievable? We're testing using the
>> docker
>> >> >> container gluster/gluster-centos as the three servers, with a php
>> test
>> >> >> inside of php-cli using filesystem mounts. If we mount the gluster
>> FS
>> >> >> via sapk/plugin-gluster into the php-cli containers using docker, we
>> >> >> seem to have better success sometimes, but I haven't figured out
>> why,
>> >> >> yet.
>> >> >>
>> >> >> I did see that I needed to set the server volume parameter
>> >> >> 'performance.flush-behind off', otherwise it seems that flushes
>> won't
>> >> >> block as would be needed by SQLite3.
>> >> >
>> >> >
>> >> > If you are relying on fsync this shouldn't matter as fsync makes sure
>> >> > data
>> >> > is synced to disk.
>> >> >
>> >> >>
>> >> >> Does anyone have any suggestions? Any words of widsom would be much
>> >> >> appreciated.
>> >> >
>> >> >
>> >> > Can you experiment with turning on/off various performance xlators?
>> >> > Based on
>> >> > earlier issues, its likely that there is stale metadata which might
>> be
>> >> > causing the issue (not necessarily improper fsync behavior). I would
>> >> > suggest
>> >> > turning off all performance xlators. You can refer [1] for a related
>> >> > discussion. In theory the only perf xlator relevant for fsync is
>> >> > write-behind and I am not aware of any issues where fsync is not
>> >> > working.
>> >> > Does glusterfs log file has any messages complaining about writes or
>> >> > fsync
>> >> > failing? Does your application use O_DIRECT? If yes, please note that
>> >> > you
>> >> > need to turn the option performance.strict-o-direct on for
>> write-behind
>> >> > to
>> >> > honour O_DIRECT
>> >> >
>> >> > Also, is it possible to identify nature of corruption - Data or
>> >> > metadata?
>> >> > More detailed explanation will help to RCA the issue.
>> >> >
>> >> > Also, is your application running on a single mount or from multiple
>> >> > mounts?
>> >> > Can you collect strace of your application (strace -ff -T -p <pid> -o
>> >> > <file>)? If possible can you also collect fuse-dump using option
>> >> > --dump-fuse
>> >> > while mounting glusterfs?
>> >> >
>> >> > [1]
>> >> >
>> >> > http://lists.gluster.org/pipermail/gluster-users/2018-Februa
>> ry/033503.html
>> >> >
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Paul
>> >> >> _______________________________________________
>> >> >> Gluster-users mailing list
>> >> >> Gluster-users at gluster.org
>> >> >> http://lists.gluster.org/mailman/listinfo/gluster-users
>> >> >
>> >> >
>> >
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180306/acab7ee9/attachment.html>


More information about the Gluster-users mailing list