[Gluster-devel] how to set the default read/write block size for all transactions for optimal performance (e.g. anything similar to rsize, wsize nfs options?)

Sun Jul 20 16:51:59 UTC 2008

write-behind is not being used in  your configuration. You need to chain the
performance translators.

avati

I've tested several clustered file systems (OCFS2, XSAN, GFS2) and I
> really like the simplicity (unify/stripe translators) and portability
> of Gluster. However, I'm getting poor performance versus NFS if I use
> a bs=4k with dd but if I use bs=128k then the performance is
> comparable to gigE NFS, still nowhere near the speed directly to the
> storage but that's ok because everything is going over TCP/IP when
> using Gluster. Here's the test on the storage itself (15 disk RAID5 on
> an Infortrend Eonstor A16F-G2221 2GB FC <-> 4GB FC QLogic Switch <->
> 4GB FC QLogic HBA on the server "porpoise" running XFS on the RAID5):
>
> 90 porpoise:/export/eon0/tmp% time dd if=/dev/zero of=testFile bs=4k
> count=500000
> 2048000000 bytes (2.0 GB) copied, 9.37949 s, 218 MB/s
>
> Here's the NFS mount going over gigE (server and client are the same):
>
> porpoise-san:/export/eon0 on /mnt/eon0 type nfs (rw,addr=10.2.179.3)
>
> Here's the test:
>
> 93 porpoise:/mnt/eon0/tmp% time dd if=/dev/zero of=testFile bs=4k
> count=500000
> 2048000000 bytes (2.0 GB) copied, 25.7614 s, 79.5 MB/s
>
> Basically I'm looking for something comparable to the NFS test above
> with Gluster, Here's the mount:
>
> glusterfs             5.1T  3.6G  5.1T   1% /export/glfs
>
> Here's the test:
>
> 88 porpoise:/export/glfs/tmp% time dd if=/dev/zero of=testFile bs=4k
> count=50000
> 204800000 bytes (205 MB) copied, 17.7291 s, 11.6 MB/s
> 0.106u 0.678s 0:17.73 4.3%      0+0k 0+0io 0pf+0w
>
> The data size was reduced for the GlusterFS test because I didn't want
> to wait :) . But if I increase the bs the speed becomes faster:
>
> 99 porpoise:/export/glfs/tmp% time dd if=/dev/zero of=testFile bs=64k
> count=27500
> 1802240000 bytes (1.8 GB) copied, 26.4466 s, 68.1 MB/s
>
> If I increase the bs=128k it the performance is even better:
>
> 100 porpoise:/export/glfs/tmp% time dd if=/dev/zero of=testFile
> bs=128k count=13750
> 1802240000 bytes (1.8 GB) copied, 21.2332 s, 84.9 MB/s
>
> How can I tell the Gluster server or client to use a default
> read/write block size of 128k or more? With NFS there are the rsize
> and wsize options which I believe accomplish the same thing. Here's my
> setup, I've dumped the non-relevant bricks as well:
>
> #### glusterfs-server.vol ####
>
> volume eon0
>  type storage/posix
>  option thread-count 8
>  option cache-size 1024MB
>  option directory /export/eon0
> end-volume
>
> volume eon1
>  type storage/posix
>  option directory /export/eon1
> end-volume
>
> volume eon2
>  type storage/posix
>  option directory /export/eon2
> end-volume
>
> volume glfs-ns
>  type storage/posix
>  option directory /export/glfs-ns
> end-volume
>
> volume writebehind
>  type performance/write-behind
>  #option aggregate-size 131072 # in bytes
>  option aggregate-size 1MB # default is 0bytes
>  option flush-behind on    # default is 'off'
>  subvolumes eon0
> end-volume
>
> volume writebehind
>  type performance/write-behind
>  option aggregate-size 131072 # in bytes
>  subvolumes eon1
> end-volume
>
> volume writebehind
>  type performance/write-behind
>  option aggregate-size 131072 # in bytes
>  subvolumes eon2
> end-volume
>
> volume writebehind
>  type performance/write-behind
>  option aggregate-size 131072 # in bytes
>  subvolumes glfs-ns
> end-volume
>
> volume readahead
>  type performance/read-ahead
>  option page-size 65536 ### in bytes
>  option page-count 16 ### memory cache size is page-count x page-size
> per file
>  subvolumes eon0
> end-volume
>
> volume readahead
>  type performance/read-ahead
>  option page-size 65536 ### in bytes
>  option page-count 16 ### memory cache size is page-count x page-size
> per file
>  subvolumes eon1
> end-volume
>
> volume readahead
>  type performance/read-ahead
>  option page-size 65536 ### in bytes
>  option page-count 16 ### memory cache size is page-count x page-size
> per file
>  subvolumes eon2
> end-volume
>
> volume readahead
>  type performance/read-ahead
>  option page-size 65536 ### in bytes
>  option page-count 16 ### memory cache size is page-count x page-size
> per file
>  subvolumes glfs-ns
> end-volume
>
> volume iothreads
>  type performance/io-threads
>  option thread-count 4  # deault is 1
>  option cache-size 64MB
>  subvolumes eon0
> end-volume
>
> volume iothreads
>  type performance/io-threads
>  option thread-count 4  # deault is 1
>  option cache-size 64MB
>  subvolumes eon1
> end-volume
>
> volume iothreads
>  type performance/io-threads
>  option thread-count 4  # deault is 1
>  option cache-size 64MB
>  subvolumes eon2
> end-volume
>
> volume iothreads
>  type performance/io-threads
>  option thread-count 4  # deault is 1
>  option cache-size 64MB
>  subvolumes glfs-ns
> end-volume
>
> volume server
>  type protocol/server
>  option transport-type tcp/server
>  option auth.ip.eon0.allow 10.2.179.*
>  option auth.ip.eon1.allow 10.2.179.*
>  option auth.ip.eon2.allow 10.2.179.*
>  option auth.ip.glfs-ns.allow 10.2.179.*
>  subvolumes eon0 eon1 eon2 glfs-ns
> end-volume
>
> ####
>
> #### glusterfs-client.vol ####
>
> volume eon0
>  type protocol/client
>  option transport-type tcp/client
>  option remote-host porpoise-san
>  option remote-subvolume eon0
> end-volume
>
> volume eon1
>  type protocol/client
>  option transport-type tcp/client
>  option remote-host porpoise-san
>  option remote-subvolume eon1
> end-volume
>
> volume eon2
>  type protocol/client
>  option transport-type tcp/client
>  option remote-host porpoise-san
>  option remote-subvolume eon2
> end-volume
>
> volume glfs-ns
>  type protocol/client
>  option transport-type tcp/client
>  option remote-host porpoise-san
>  option remote-subvolume glfs-ns
> end-volume
>
> volume writebehind
>  type performance/write-behind
>  #option aggregate-size 131072 # in bytes
>  option aggregate-size 1MB # default is 0bytes
>  option flush-behind on    # default is 'off'
>  subvolumes eon0
> end-volume
>
> volume writebehind
>  type performance/write-behind
>  option aggregate-size 131072 # in bytes
>  subvolumes eon1
> end-volume
>
> volume writebehind
>  type performance/write-behind
>  option aggregate-size 131072 # in bytes
>  subvolumes eon2
> end-volume
>
> volume writebehind
>  type performance/write-behind
>  option aggregate-size 131072 # in bytes
>  subvolumes glfs-ns
> end-volume
>
> volume readahead
>  type performance/read-ahead
>  option page-size 1MB
>  option page-count 2
>  #option page-size 65536 ### in bytes
>  #option page-count 16 ### memory cache size is page-count x page-size per
> file
>  subvolumes eon0
> end-volume
>
> volume readahead
>  type performance/read-ahead
>  option page-size 65536 ### in bytes
>  option page-count 16 ### memory cache size is page-count x page-size per
> file
>  subvolumes eon1
> end-volume
>
> volume readahead
>  type performance/read-ahead
>  option page-size 65536 ### in bytes
>  option page-count 16 ### memory cache size is page-count x page-size per
> file
>  subvolumes eon2
> end-volume
>
> volume readahead
>  type performance/read-ahead
>  option page-size 65536 ### in bytes
>  option page-count 16 ### memory cache size is page-count x page-size per
> file
>  subvolumes glfs-ns
> end-volume
>
> volume io-cache
>  type performance/io-cache
>  option cache-size 64MB             # default is 32MB
>  option page-size 1MB               #128KB is default option
>  #option priority *.h:3,*.html:2,*:1 # default is '*:0'
>  option priority *:0
>  option force-revalidate-timeout 2  # default is 1
>  subvolumes eon0
> end-volume
>
> #volume unify0
> #  type cluster/unify
> #  option scheduler rr # round robin
> #  option namespace glfs-ns
>  #subvolumes eon0 eon1 eon2
> #  subvolumes eon0
> #end-volume
>
> #volume stripe0
> #  type cluster/stripe
> #  option block-size *:1MB
> #  subvolumes eon0 eon1 eon2
> #end-volume
>
> ####
>
> I've tried a very basic server/client setup with no translators to the
> setup above and almost everything in between to try to improve the
> performance. The server/client system is an Apple XServe G5 running
> Gentoo PPC64:
>
> Linux porpoise 2.6.24.4 #6 Sun Jul 20 00:16:04 CDT 2008 ppc64
> PPC970FX, altivec supported RackMac3,1 GNU/Linux
>
> % cat /proc/cpuinfo
> processor       : 0
> cpu             : PPC970FX, altivec supported
> clock           : 2000.000000MHz
> revision        : 3.0 (pvr 003c 0300)
> timebase        : 33333333
> platform        : PowerMac
> machine         : RackMac3,1
> motherboard     : RackMac3,1 MacRISC4 Power Macintosh
> detected as     : 339 (XServe G5)
> pmac flags      : 00000000
> L2 cache        : 512K unified
> pmac-generation : NewWorld
>
> % cat /proc/meminfo
> MemTotal:      2006988 kB
> MemFree:        107864 kB
> Buffers:           676 kB
> Cached:        1775800 kB
> SwapCached:          0 kB
> Active:          46672 kB
> Inactive:      1762528 kB
> SwapTotal:     3583928 kB
> SwapFree:      3583624 kB
> Dirty:               0 kB
> Writeback:           0 kB
> AnonPages:       32744 kB
> Mapped:          10292 kB
> Slab:            65704 kB
> SReclaimable:    50620 kB
> SUnreclaim:      15084 kB
> PageTables:       1180 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   4587420 kB
> Committed_AS:   261212 kB
> VmallocTotal: 8589934592 kB
> VmallocUsed:      6352 kB
> VmallocChunk: 8589928088 kB
>
> Here's what's under the hood:
>
> # lspci
> 0000:f0:0b.0 Host bridge: Apple Computer Inc. U3H AGP Bridge
> 0001:00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X
> Bridge (rev 12)
> 0001:00:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X
> Bridge (rev 12)
> 0001:00:03.0 PCI bridge: Apple Computer Inc. K2 HT-PCI Bridge
> 0001:00:04.0 PCI bridge: Apple Computer Inc. K2 HT-PCI Bridge
> 0001:00:05.0 PCI bridge: Apple Computer Inc. K2 HT-PCI Bridge
> 0001:00:06.0 PCI bridge: Apple Computer Inc. K2 HT-PCI Bridge
> 0001:00:07.0 PCI bridge: Apple Computer Inc. K2 HT-PCI Bridge
> 0001:01:07.0 Class ff00: Apple Computer Inc. K2 KeyLargo Mac/IO (rev 60)
> 0001:02:0b.0 USB Controller: NEC Corporation USB (rev 43)
> 0001:02:0b.1 USB Controller: NEC Corporation USB (rev 43)
> 0001:02:0b.2 USB Controller: NEC Corporation USB 2.0 (rev 04)
> 0001:03:0d.0 Class ff00: Apple Computer Inc. K2 ATA/100
> 0001:03:0e.0 FireWire (IEEE 1394): Apple Computer Inc. K2 FireWire
> 0001:05:0c.0 IDE interface: Broadcom K2 SATA
> 0001:06:02.0 VGA compatible controller: ATI Technologies Inc Radeon
> RV100 QY [Radeon 7000/VE]
> 0001:06:03.0 Fibre Channel: QLogic Corp. ISP2422-based 4Gb Fibre
> Channel to PCI-X HBA (rev 02)
> 0001:06:03.1 Fibre Channel: QLogic Corp. ISP2422-based 4Gb Fibre
> Channel to PCI-X HBA (rev 02)
> 0001:07:04.0 Ethernet controller: Broadcom Corporation NetXtreme
> BCM5704 Gigabit Ethernet (rev 03)
> 0001:07:04.1 Ethernet controller: Broadcom Corporation NetXtreme
> BCM5704 Gigabit Ethernet (rev 03)
>
> With glusterfs-1.3.10 and fuse-2.7.3glfs10 compiled from source. Any
> help would be greatly appreciated.
>
> Thanks,
> Sabuj Pattanayek
> Senior SysAdmin
> http://structbio.vanderbilt.edu
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

-- 
If I traveled to the end of the rainbow
As Dame Fortune did intend,
Murphy would be there to tell me
The pot's at the other end.