[Gluster-users] Slow write times to gluster disk
Pranith Kumar Karampuri
pkarampu at redhat.com
Sat May 13 03:17:11 UTC 2017
On Sat, May 13, 2017 at 8:44 AM, Pranith Kumar Karampuri <
pkarampu at redhat.com> wrote:
>
>
> On Fri, May 12, 2017 at 8:04 PM, Pat Haley <phaley at mit.edu> wrote:
>
>>
>> Hi Pranith,
>>
>> My question was about setting up a gluster volume on an ext4 partition.
>> I thought we had the bricks mounted as xfs for compatibility with gluster?
>>
>
> Oh that should not be a problem. It works fine.
>
Just that xfs doesn't have limits for anything, where as ext4 does for
things like hardlinks etc(At least last time I checked :-) ). So it is
better to have xfs.
>
>
>>
>> Pat
>>
>>
>>
>> On 05/11/2017 12:06 PM, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Thu, May 11, 2017 at 9:32 PM, Pat Haley <phaley at mit.edu> wrote:
>>
>>>
>>> Hi Pranith,
>>>
>>> The /home partition is mounted as ext4
>>> /home ext4 defaults,usrquota,grpquota 1 2
>>>
>>> The brick partitions are mounted ax xfs
>>> /mnt/brick1 xfs defaults 0 0
>>> /mnt/brick2 xfs defaults 0 0
>>>
>>> Will this cause a problem with creating a volume under /home?
>>>
>>
>> I don't think the bottleneck is disk. You can do the same tests you did
>> on your new volume to confirm?
>>
>>
>>>
>>> Pat
>>>
>>>
>>>
>>> On 05/11/2017 11:32 AM, Pranith Kumar Karampuri wrote:
>>>
>>>
>>>
>>> On Thu, May 11, 2017 at 8:57 PM, Pat Haley <phaley at mit.edu> wrote:
>>>
>>>>
>>>> Hi Pranith,
>>>>
>>>> Unfortunately, we don't have similar hardware for a small scale test.
>>>> All we have is our production hardware.
>>>>
>>>
>>> You said something about /home partition which has lesser disks, we can
>>> create plain distribute volume inside one of those directories. After we
>>> are done, we can remove the setup. What do you say?
>>>
>>>
>>>>
>>>> Pat
>>>>
>>>>
>>>>
>>>>
>>>> On 05/11/2017 07:05 AM, Pranith Kumar Karampuri wrote:
>>>>
>>>>
>>>>
>>>> On Thu, May 11, 2017 at 2:48 AM, Pat Haley <phaley at mit.edu> wrote:
>>>>
>>>>>
>>>>> Hi Pranith,
>>>>>
>>>>> Since we are mounting the partitions as the bricks, I tried the dd
>>>>> test writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
>>>>> The results without oflag=sync were 1.6 Gb/s (faster than gluster but not
>>>>> as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
>>>>> fewer disks).
>>>>>
>>>>
>>>> Okay, then 1.6Gb/s is what we need to target for, considering your
>>>> volume is just distribute. Is there any way you can do tests on similar
>>>> hardware but at a small scale? Just so we can run the workload to learn
>>>> more about the bottlenecks in the system? We can probably try to get the
>>>> speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
>>>> me know if that is something you are okay to do.
>>>>
>>>>
>>>>>
>>>>> Pat
>>>>>
>>>>>
>>>>>
>>>>> On 05/10/2017 01:27 PM, Pranith Kumar Karampuri wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 10, 2017 at 10:15 PM, Pat Haley <phaley at mit.edu> wrote:
>>>>>
>>>>>>
>>>>>> Hi Pranith,
>>>>>>
>>>>>> Not entirely sure (this isn't my area of expertise). I'll run your
>>>>>> answer by some other people who are more familiar with this.
>>>>>>
>>>>>> I am also uncertain about how to interpret the results when we also
>>>>>> add the dd tests writing to the /home area (no gluster, still on the same
>>>>>> machine)
>>>>>>
>>>>>> - dd test without oflag=sync (rough average of multiple tests)
>>>>>> - gluster w/ fuse mount : 570 Mb/s
>>>>>> - gluster w/ nfs mount: 390 Mb/s
>>>>>> - nfs (no gluster): 1.2 Gb/s
>>>>>> - dd test with oflag=sync (rough average of multiple tests)
>>>>>> - gluster w/ fuse mount: 5 Mb/s
>>>>>> - gluster w/ nfs mount: 200 Mb/s
>>>>>> - nfs (no gluster): 20 Mb/s
>>>>>>
>>>>>> Given that the non-gluster area is a RAID-6 of 4 disks while each
>>>>>> brick of the gluster area is a RAID-6 of 32 disks, I would naively expect
>>>>>> the writes to the gluster area to be roughly 8x faster than to the
>>>>>> non-gluster.
>>>>>>
>>>>>
>>>>> I think a better test is to try and write to a file using nfs without
>>>>> any gluster to a location that is not inside the brick but someother
>>>>> location that is on same disk(s). If you are mounting the partition as the
>>>>> brick, then we can write to a file inside .glusterfs directory, something
>>>>> like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
>>>>>
>>>>>
>>>>>> I still think we have a speed issue, I can't tell if fuse vs nfs is
>>>>>> part of the problem.
>>>>>>
>>>>>
>>>>> I got interested in the post because I read that fuse speed is lesser
>>>>> than nfs speed which is counter-intuitive to my understanding. So wanted
>>>>> clarifications. Now that I got my clarifications where fuse outperformed
>>>>> nfs without sync, we can resume testing as described above and try to find
>>>>> what it is. Based on your email-id I am guessing you are from Boston and I
>>>>> am from Bangalore so if you are okay with doing this debugging for multiple
>>>>> days because of timezones, I will be happy to help. Please be a bit patient
>>>>> with me, I am under a release crunch but I am very curious with the problem
>>>>> you posted.
>>>>>
>>>>> Was there anything useful in the profiles?
>>>>>>
>>>>>
>>>>> Unfortunately profiles didn't help me much, I think we are collecting
>>>>> the profiles from an active volume, so it has a lot of information that is
>>>>> not pertaining to dd so it is difficult to find the contributions of dd. So
>>>>> I went through your post again and found something I didn't pay much
>>>>> attention to earlier i.e. oflag=sync, so did my own tests on my setup with
>>>>> FUSE so sent that reply.
>>>>>
>>>>>
>>>>>>
>>>>>> Pat
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/10/2017 12:15 PM, Pranith Kumar Karampuri wrote:
>>>>>>
>>>>>> Okay good. At least this validates my doubts. Handling O_SYNC in
>>>>>> gluster NFS and fuse is a bit different.
>>>>>> When application opens a file with O_SYNC on fuse mount then each
>>>>>> write syscall has to be written to disk as part of the syscall where as in
>>>>>> case of NFS, there is no concept of open. NFS performs write though a
>>>>>> handle saying it needs to be a synchronous write, so write() syscall is
>>>>>> performed first then it performs fsync(). so an write on an fd with O_SYNC
>>>>>> becomes write+fsync. I am suspecting that when multiple threads do this
>>>>>> write+fsync() operation on the same file, multiple writes are batched
>>>>>> together to be written do disk so the throughput on the disk is increasing
>>>>>> is my guess.
>>>>>>
>>>>>> Does it answer your doubts?
>>>>>>
>>>>>> On Wed, May 10, 2017 at 9:35 PM, Pat Haley <phaley at mit.edu> wrote:
>>>>>>
>>>>>>>
>>>>>>> Without the oflag=sync and only a single test of each, the FUSE is
>>>>>>> going faster than NFS:
>>>>>>>
>>>>>>> FUSE:
>>>>>>> mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
>>>>>>> of=zeros.txt conv=sync
>>>>>>> 4096+0 records in
>>>>>>> 4096+0 records out
>>>>>>> 4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
>>>>>>>
>>>>>>>
>>>>>>> NFS
>>>>>>> mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
>>>>>>> of=zeros.txt conv=sync
>>>>>>> 4096+0 records in
>>>>>>> 4096+0 records out
>>>>>>> 4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 05/10/2017 11:53 AM, Pranith Kumar Karampuri wrote:
>>>>>>>
>>>>>>> Could you let me know the speed without oflag=sync on both the
>>>>>>> mounts? No need to collect profiles.
>>>>>>>
>>>>>>> On Wed, May 10, 2017 at 9:17 PM, Pat Haley <phaley at mit.edu> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Here is what I see now:
>>>>>>>>
>>>>>>>> [root at mseas-data2 ~]# gluster volume info
>>>>>>>>
>>>>>>>> Volume Name: data-volume
>>>>>>>> Type: Distribute
>>>>>>>> Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
>>>>>>>> Status: Started
>>>>>>>> Number of Bricks: 2
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1: mseas-data2:/mnt/brick1
>>>>>>>> Brick2: mseas-data2:/mnt/brick2
>>>>>>>> Options Reconfigured:
>>>>>>>> diagnostics.count-fop-hits: on
>>>>>>>> diagnostics.latency-measurement: on
>>>>>>>> nfs.exports-auth-enable: on
>>>>>>>> diagnostics.brick-sys-log-level: WARNING
>>>>>>>> performance.readdir-ahead: on
>>>>>>>> nfs.disable: on
>>>>>>>> nfs.export-volumes: off
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/10/2017 11:44 AM, Pranith Kumar Karampuri wrote:
>>>>>>>>
>>>>>>>> Is this the volume info you have?
>>>>>>>>
>>>>>>>> >* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
>>>>>>>> *>>* Volume Name: data-volume
>>>>>>>> *>* Type: Distribute
>>>>>>>> *>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
>>>>>>>> *>* Status: Started
>>>>>>>> *>* Number of Bricks: 2
>>>>>>>> *>* Transport-type: tcp
>>>>>>>> *>* Bricks:
>>>>>>>> *>* Brick1: mseas-data2:/mnt/brick1
>>>>>>>> *>* Brick2: mseas-data2:/mnt/brick2
>>>>>>>> *>* Options Reconfigured:
>>>>>>>> *>* performance.readdir-ahead: on
>>>>>>>> *>* nfs.disable: on
>>>>>>>> *>* nfs.export-volumes: off
>>>>>>>>
>>>>>>>> *
>>>>>>>>
>>>>>>>> I copied this from old thread from 2016. This is distribute
>>>>>>>> volume. Did you change any of the options in between?
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>> Pat Haley Email: phaley at mit.edu
>>>>>>>> Center for Ocean Engineering Phone: (617) 253-6824
>>>>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>>>>>>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/
>>>>>>>> 77 Massachusetts Avenue
>>>>>>>> Cambridge, MA 02139-4301
>>>>>>>>
>>>>>>>> --
>>>>>>> Pranith
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>> Pat Haley Email: phaley at mit.edu
>>>>>>> Center for Ocean Engineering Phone: (617) 253-6824
>>>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>>>>>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/
>>>>>>> 77 Massachusetts Avenue
>>>>>>> Cambridge, MA 02139-4301
>>>>>>>
>>>>>>> --
>>>>>> Pranith
>>>>>>
>>>>>> --
>>>>>>
>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>> Pat Haley Email: phaley at mit.edu
>>>>>> Center for Ocean Engineering Phone: (617) 253-6824
>>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>>>>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/
>>>>>> 77 Massachusetts Avenue
>>>>>> Cambridge, MA 02139-4301
>>>>>>
>>>>>> --
>>>>> Pranith
>>>>>
>>>>> --
>>>>>
>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>> Pat Haley Email: phaley at mit.edu
>>>>> Center for Ocean Engineering Phone: (617) 253-6824
>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>>>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/
>>>>> 77 Massachusetts Avenue
>>>>> Cambridge, MA 02139-4301
>>>>>
>>>>> --
>>>> Pranith
>>>>
>>>> --
>>>>
>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>> Pat Haley Email: phaley at mit.edu
>>>> Center for Ocean Engineering Phone: (617) 253-6824
>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/
>>>> 77 Massachusetts Avenue
>>>> Cambridge, MA 02139-4301
>>>>
>>>> --
>>> Pranith
>>>
>>> --
>>>
>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>> Pat Haley Email: phaley at mit.edu
>>> Center for Ocean Engineering Phone: (617) 253-6824
>>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/
>>> 77 Massachusetts Avenue
>>> Cambridge, MA 02139-4301
>>>
>>> --
>> Pranith
>>
>> --
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>> Pat Haley Email: phaley at mit.edu
>> Center for Ocean Engineering Phone: (617) 253-6824
>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>> MIT, Room 5-213 http://web.mit.edu/phaley/www/
>> 77 Massachusetts Avenue
>> Cambridge, MA 02139-4301
>>
>>
>
>
> --
> Pranith
>
--
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170513/a422e4ea/attachment.html>
More information about the Gluster-users
mailing list