[Gluster-users] Disbalanced load

Joe Julian joe at julianfamily.org
Fri Sep 5 04:47:01 UTC 2014


That is about as far removed from anything useful for troubleshooting as possible. You're reporting a symptom from within a virtualized environment. It's the real systems that have the useful logs. Any errors on the client or brick logs? Libvirt logs? dmesg on the server? Is either cpu bound? In swap? 


On September 4, 2014 9:12:16 PM PDT, "Miloš Kozák" <milos.kozak at lejmr.com> wrote:
>Hi,
>
>I ran few more tests. I moved a file which is an VM image onto
>GlusterFS 
>mount and along the load I got this on console of running VM:
>
>lost page write due to I/O error on vda1
>Buffer I/O error on device vda1, logical block 1049638
>lost page write due to I/O error on vda1
>Buffer I/O error on device vda1, logical block 1049646
>lost page write due to I/O error on vda1
>Buffer I/O error on device vda1, logical block 1049647
>lost page write due to I/O error on vda1
>Buffer I/O error on device vda1, logical block 1049649
>lost page write due to I/O error on vda1
>end_request: I/O error, dev vda, sector 8399688
>end_request: I/O error, dev vda, sector 8399728
>end_request: I/O error, dev vda, sector 8399736
>end_request: I/O error, dev vda, sector 8399776
>end_request: I/O error, dev vda, sector 8399792
>__ratelimit: 5 callbacks suppressed
>EXT4-fs error (device vda1): ext4_find_entry: reading directory #398064
>
>offset 0
>EXT4-fs error (device vda1): ext4_find_entry: reading directory #398064
>
>offset 0
>EXT4-fs error (device vda1): ext4_find_entry: reading directory #132029
>
>offset 0
>
>Do you think it is related to options which are set to the volume?
>
>     storage.owner-gid: 498
>     storage.owner-uid: 498
>     network.ping-timeout: 2
>     performance.io-thread-count: 3
>     cluster.server-quorum-type: server
>     network.remote-dio: enable
>     cluster.eager-lock: enable
>     performance.stat-prefetch: off
>     performance.io-cache: off
>     performance.read-ahead: off
>     performance.quick-read: off
>
>Thanks Milos
>
>
>Dne 14-09-03 v 04:01 PM Milos Kozak napsal(a):
>> I have just tried to copy an VM image (raw) and causes the same
>problem.
>>
>> I have GlusterFS 3.5.2
>>
>>
>>
>> On 9/3/2014 9:14 AM, Roman wrote:
>>> Hi,
>>>
>>> I had some issues with files generated from /dev/zero also. try real
>>> files or /dev/urandom :)
>>> I don't know, if there is a real issue/bug with files generated from
>>> /dev/zero ? Devs should check them out  /me thinks.
>>>
>>>
>>> 2014-09-03 16:11 GMT+03:00 Milos Kozak <milos.kozak at lejmr.com
>>> <mailto:milos.kozak at lejmr.com>>:
>>>
>>>     Hi,
>>>
>>>     I am facing a quite strange problem when I do have two servers
>with
>>>     the same configuration and the same hardware. Servers are
>connected
>>>     by bonded 1GE. I have one volume:
>>>
>>>     [root at nodef02i 103]# gluster volume info
>>>
>>>     Volume Name: ph-fs-0
>>>     Type: Replicate
>>>     Volume ID: f8f569ea-e30c-43d0-bb94-__b2f1164a7c9a
>>>     Status: Started
>>>     Number of Bricks: 1 x 2 = 2
>>>     Transport-type: tcp
>>>     Bricks:
>>>     Brick1: 10.11.100.1:/gfs/s3-sata-10k/__fs
>>>     Brick2: 10.11.100.2:/gfs/s3-sata-10k/__fs
>>>     Options Reconfigured:
>>>     storage.owner-gid: 498
>>>     storage.owner-uid: 498
>>>     network.ping-timeout: 2
>>>     performance.io-thread-count: 3
>>>     cluster.server-quorum-type: server
>>>     network.remote-dio: enable
>>>     cluster.eager-lock: enable
>>>     performance.stat-prefetch: off
>>>     performance.io-cache: off
>>>     performance.read-ahead: off
>>>     performance.quick-read: off
>>>
>>>     Intended to host virtual servers (KVM), the configuration is
>>>     according to the gluster blog.
>>>
>>>
>>>     Currently I have got only one virtual server deployed on top of
>this
>>>     volume in order to see effects of my stress tests. During the
>tests
>>>     I write to the volume mounted through FUSE by dd (currently on
>one
>>>     writing at a moment):
>>>
>>>     dd if=/dev/zero of=test2.img bs=1M count=20000 conv=fdatasync
>>>
>>>
>>>     Test 1) I run dd on nodef02i. Load on  nodef02i is max 1erl but
>on
>>>     the nodef01i around 14erl (I do have 12threads CPU). After the
>write
>>>     is done the load on nodef02i goes down, but the load goes up to
>>>     28erl on nodef01i. 20minutes it stays the same. In the mean time
>I
>>>     can see:
>>>
>>>     [root at nodef01i 103]# gluster volume heal ph-fs-0 info
>>>     Volume ph-fs-0 is not started (Or) All the bricks are not
>running.
>>>     Volume heal failed
>>>
>>>     [root at nodef02i 103]# gluster volume heal ph-fs-0 info
>>>     Brick nodef01i.czprg:/gfs/s3-sata-__10k/fs/
>>>     /__3706a2cb0bb27ba5787b3c12388f4e__bb - Possibly undergoing heal
>>>     /test.img - Possibly undergoing heal
>>>     Number of entries: 2
>>>
>>>     Brick nodef02i.czprg:/gfs/s3-sata-__10k/fs/
>>>     /__3706a2cb0bb27ba5787b3c12388f4e__bb - Possibly undergoing heal
>>>     /test.img - Possibly undergoing heal
>>>     Number of entries: 2
>>>
>>>
>>>     [root at nodef01i 103]# gluster volume status
>>>     Status of volume: ph-fs-0
>>>     Gluster process                                         Port 
>>> Online  Pid
>>>
>------------------------------__------------------------------__------------------
>>>     Brick 10.11.100.1:/gfs/s3-sata-10k/__fs 49152 Y
>>>         56631
>>>     Brick 10.11.100.2:/gfs/s3-sata-10k/__fs 49152 Y
>>>         3372
>>>     NFS Server on localhost                                 2049 Y
>>>       56645
>>>     Self-heal Daemon on localhost                           N/A Y
>>>       56649
>>>     NFS Server on 10.11.100.2                               2049 Y
>>>       3386
>>>     Self-heal Daemon on 10.11.100.2                         N/A 
>>> Y       3387
>>>
>>>     Task Status of Volume ph-fs-0
>>>
>------------------------------__------------------------------__------------------
>>>     There are no active volume tasks
>>>
>>>     This very high load takes another 20-30minutes. During the first
>>>     test I restarted glusterd service after 10minutes because
>everything
>>>     seemed to me that the service does not work, but I could see
>very
>>>     high load on the nodef01i.
>>>     Consequently, the virtual server yields errors about problems
>with
>>>     EXT4 filesystem - MySQL stops.
>>>
>>>
>>>
>>>     When the load culminated I tried to run the same test but from
>>>     opposite direction. I wrote (dd) from nodef01i - test2. Happened
>>>     more or less the same. I gained extremely high load on nodef01i
>and
>>>     minimal load on nodef02i. Outputs from heal were more or less
>the 
>>> same..
>>>
>>>
>>>     I would like to tweak this but I don´t know what I should focus
>on.
>>>     Thank you for help.
>>>
>>>     Milos
>>>
>>>
>>>
>>>     _______________________________________________
>>>     Gluster-users mailing list
>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>>
>>> -- 
>>> Best regards,
>>> Roman.
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://supercolony.gluster.org/mailman/listinfo/gluster-users

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140904/3b5b49f3/attachment.html>


More information about the Gluster-users mailing list