[Gluster-devel] Confused on AFR, where does it happen client or server

Tue Jan 8 08:44:34 UTC 2008

was it just one file being read/written? if not please use rr (a more
deterministic scheduler) and share the numbers please.

about the create and delete rate, client side afr is defnitely faster since
the create operations happen parallely (w.r.t to the network - 1x the time.)
but if you have afr on server, it happens serially across the machines (2x
the time.. 1x upto the 1st server, and 1x to the remaining N-1 servers).

not that for such a configuration, unify is not needed, and removing unify
and having plain AFR will increase your create rate further since unify
serializes creates across the namespace and storage.

about throughput, the writes seem to be faster with server AFR (or did my
eyes defeat me with that indentation?) and reads faster with client AFR.

the faster writes might be because of a good job done by the full-duplex
switch. when afr is on the client side, both copies are outbound on the same
outbound channel of the client NIC, effectively serially writing two copies,
but when on the server side, the replication copy is using the outbound
channel of the server NIC  while the main loop is parallely fetching the
next write block in the server's inbound channel. using io-threads between
afr and protocol/client on the replication path might help further.

the slower reads might be because.. well i'm still not sure, maybe you have
secretly applied the striped readv patch for afr? :)

avati

2008/1/8, Sascha Ottolski <ottolski at web.de>:
>
> Am Dienstag 08 Januar 2008 06:06:30 schrieb Anand Avati:
> > > > Brandon,
> > > >  who does the copy is decided where the AFR translator is loaded. if
> > > > you have AFR loaded on the client side, then the client does the two
> > > > writes.
> > >
> > > you
> > >
> > > > can also have AFR loaded on the server side, and handle server do
> the
> > > > replication. Translators can be loaded anywhere (client or server,
> > >
> > > anywhere
> > >
> > > > in the graph). You need to think more on the lines on how you can
> > >
> > > 'program
> > >
> > > > glusterfs' rather than how to 'configure glusterfs'.
> > >
> > > Performance-wise, which is better?  Or does it make sense one way vs.
> > > the other based on number of clients?
> >
> > Depends, if the interconnect between server and client is precious, then
> > have the servers replicate (load afr on server side) with replication
> > happening on a seperate network. This is also good if you  have servers
> > interconnected with high speed networks like infiniband.
> >
> > If your servers are having just one network interface (no seperate
> network
> > for replication), and your client apps are IO bound, then it does not
> > matter where you load AFR; they all would give the same performance.
> >
> > avati
>
> i did a simple test recently, which suggests that there is a significant
> performance difference: I did a comparison of client vs. server
> side afr with bonnie, for a one client and two servers setup with tla
> patch628, connected over GB Ethernet; please see my results below.
>
> There also was a posting on this list with a lot of test results,
> suggesting that server side afr is fastest:
> http://lists.nongnu.org/archive/html/gluster-devel/2007-08/msg00136.html
>
> In my own results though, client-side afr seems to be better in most of
> the test; I should note that I'm not sure if the chosen setup has a
> negative impact on the performance (two servers afr-ing each other), so
> any comments on this would be highly appreciate (I add the configs for
> the tests below).
>
> server side afr (I hope it stays readable):
>
> Version  1.03       ------Sequential Output------ --Sequential Input-
> --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
> %CP  /sec %CP
> stf-db22     31968M 31438  43 35528   0   990   0 32375  43 41107   1
> 38.1   0
>                     ------Sequential Create------ --------Random
> Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> %CP  /sec %CP
>                  16    34   0   416   0   190   0    35   0   511   0
> 227   0
>
>
> client side afr:
>
> Version  1.03       ------Sequential Output------ --Sequential Input-
> --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
> %CP  /sec %CP
> stf-db22     31968M 27583  38 31518   0   862   0 49522  63 56388   2
> 28.0   0
>                     ------Sequential Create------ --------Random
> Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> %CP  /sec %CP
>                  16   418   0  2225   1   948   1   455   0  2305   1
> 947   0
>
>
>
> server side afr config:
>
>
> glusterfs-server.vol.server_afr:
>
>   volume fsbrick1
>     type storage/posix
>     option directory /data1
>   end-volume
>
>   volume fsbrick2
>     type storage/posix
>     option directory /data2
>   end-volume
>
>   volume nsfsbrick1
>     type storage/posix
>     option directory /data-ns1
>   end-volume
>
>   volume brick1
>     type performance/io-threads
>     option thread-count 8
>     option queue-limit 1024
>     subvolumes fsbrick1
>   end-volume
>
>   volume brick2
>     type performance/io-threads
>     option thread-count 8
>     option queue-limit 1024
>     subvolumes fsbrick2
>   end-volume
>
>   volume brick1r
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.99
>     option remote-subvolume brick2
>   end-volume
>
>   volume afr1
>     type cluster/afr
>     subvolumes brick1 brick1r
>     # option replicate *:2 # obsolete with tla snapshot
>   end-volume
>
>   ### Add network serving capability to above bricks.
>   volume server
>     type protocol/server
>     option transport-type tcp/server     # For TCP/IP transport
>     option listen-port 6996              # Default is 6996
>     option client-volume-filename /etc/glusterfs/glusterfs-client.vol
>     subvolumes afr1 nsfsbrick1
>     option auth.ip.afr1.allow * # Allow access to "brick" volume
>     option auth.ip.brick2.allow * # Allow access to "brick" volume
>     option auth.ip.nsfsbrick1.allow * # Allow access to "brick" volume
>   end-volume
>
>
>
>
>
>
> glusterfs-client.vol.test.server_afr:
>
>   volume fsc1
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.10
>     option remote-subvolume afr1
>   end-volume
>
>   volume fsc2
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.99
>     option remote-subvolume afr1
>   end-volume
>
>   volume ns1
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.10
>     option remote-subvolume nsfsbrick1
>   end-volume
>
>   volume ns2
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.99
>     option remote-subvolume nsfsbrick1
>   end-volume
>
>   volume afrns
>     type cluster/afr
>     subvolumes ns1 ns2
>   end-volume
>
>   volume bricks
>     type cluster/unify
>     subvolumes fsc1 fsc2
>     option namespace afrns
>     option scheduler alu
>     option alu.limits.min-free-disk  5%              # Stop creating files
> when free-space lt 5 %
>     option alu.limits.max-open-files 10000
>     option alu.orderdisk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
>     option alu.disk-usage.entry-threshold 2GB          # Units in KB, MB
> and GB are allowed
>     option alu.disk-usage.exit-threshold  60MB         # Units in KB, MB
> and GB are allowed
>     option alu.open-files-usage.entry-threshold 1024
>     option alu.open-files-usage.exit-threshold 32
>     option alu.stat-refresh.interval 10sec
>   end-volume
>
>   volume readahead
>     type performance/read-ahead
>     option page-size 256KB
>     option page-count 2
>     subvolumes bricks
>   end-volume
>
>   volume write-behind
>     type performance/write-behind
>     option aggregate-size 1MB
>     subvolumes readahead
>   end-volume
>
>
> -----------------------------------------------------------------------
>
> client side afr config:
>
>
> glusterfs-server.vol.client_afr:
>
>   volume fsbrick1
>     type storage/posix
>     option directory /data1
>   end-volume
>
>   volume fsbrick2
>     type storage/posix
>     option directory /data2
>   end-volume
>
>   volume nsfsbrick1
>     type storage/posix
>     option directory /data-ns1
>   end-volume
>
>   volume brick1
>     type performance/io-threads
>     option thread-count 8
>     option queue-limit 1024
>     subvolumes fsbrick1
>   end-volume
>
>   volume brick2
>     type performance/io-threads
>     option thread-count 8
>     option queue-limit 1024
>     subvolumes fsbrick2
>   end-volume
>
>   ### Add network serving capability to above bricks.
>   volume server
>     type protocol/server
>     option transport-type tcp/server     # For TCP/IP transport
>     option listen-port 6996              # Default is 6996
>     option client-volume-filename /etc/glusterfs/glusterfs-client.vol
>     subvolumes brick1 brick2 nsfsbrick1
>     option auth.ip.brick1.allow * # Allow access to "brick" volume
>     option auth.ip.brick2.allow * # Allow access to "brick" volume
>     option auth.ip.nsfsbrick1.allow * # Allow access to "brick" volume
>   end-volume
>
>
>
>
>
>
> glusterfs-client.vol.test.client_afr:
>
>   volume fsc1
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.10
>     option remote-subvolume brick1
>   end-volume
>
>   volume fsc1r
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.10
>     option remote-subvolume brick2
>   end-volume
>
>   volume fsc2
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.99
>     option remote-subvolume brick1
>   end-volume
>
>   volume fsc2r
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.99
>     option remote-subvolume brick2
>   end-volume
>
>   volume afr1
>     type cluster/afr
>     subvolumes fsc1 fsc2r
>     # option replicate *:2 # obsolete with tla snapshot
>   end-volume
>
>   volume afr2
>     type cluster/afr
>     subvolumes fsc2 fsc1r
>     # option replicate *:2 # obsolete with tla snapshot
>   end-volume
>
>   volume ns1
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.10
>     option remote-subvolume nsfsbrick1
>   end-volume
>
>   volume ns2
>     type protocol/client
>     option transport-type tcp/client
>     option remote-host 10.10.1.99
>     option remote-subvolume nsfsbrick1
>   end-volume
>
>   volume afrns
>     type cluster/afr
>     subvolumes ns1 ns2
>     # option replicate *:2 # obsolete with tla snapshot
>   end-volume
>
>   volume bricks
>     type cluster/unify
>     subvolumes afr1 afr2
>     option namespace afrns
>     option scheduler alu
>     option alu.limits.min-free-disk  5%              # Stop creating files
> when free-space lt 5 %
>     option alu.limits.max-open-files 10000
>     option alu.orderdisk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
>     option alu.disk-usage.entry-threshold 2GB          # Units in KB, MB
> and GB are allowed
>     option alu.disk-usage.exit-threshold  60MB         # Units in KB, MB
> and GB are allowed
>     option alu.open-files-usage.entry-threshold 1024
>     option alu.open-files-usage.exit-threshold 32
>     option alu.stat-refresh.interval 10sec
>   end-volume
>
>   volume readahead
>     type performance/read-ahead
>     option page-size 256KB
>     option page-count 2
>     subvolumes bricks
>   end-volume
>
>   volume write-behind
>     type performance/write-behind
>     option aggregate-size 1MB
>     subvolumes readahead
>   end-volume
>
>
> Cheers, Sascha
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

-- 
If I traveled to the end of the rainbow
As Dame Fortune did intend,
Murphy would be there to tell me
The pot's at the other end.