[Gluster-devel] seeking advice: how to upgrade from 1.3.0pre4 to tla patch628?
Anand Avati
avati at zresearch.com
Tue Jan 8 09:06:33 UTC 2008
Sascha,
few points -
1. do you really want 4 copies of the NS with AFR? I personally think that
is an overkill. 2 should be sufficient.
2. as you rightly mentioned, it might be the self heal which is slowing
down. Do you have directories with a LOT of files in the immediate level?
the self-heal is being heavily reworked to be more memory and cpu efficient
and will be completed very soon. If you do have a LOT of files in a
directory (not subdirs), then, it would help to recreate the NS offline and
slip it in with the upgraded glusterfs. one half-efficient way:
on each server:
mkdir /partial-ns-tmp
(cd /data/export/dir ; find . -type d) | (cd /partial-ns-tmp ; xargs mkdir
-p)
(cd /data/export/dir ; find . -type f) | (cd /partial-ns-tmp; xargs touch)
now tar the /partial-ns-tmp on each server and extract them over each other
in the name server. I assume you do not have special fifo and device files,
if you do, recreate them like the mkdir too :)
the updated self-heal should handle such cases much better (assuming your
problem is LOTS of files in the same dir and/or LOTS of such dirs).
avati
2008/1/8, Sascha Ottolski <ottolski at web.de>:
>
> Hi list,
>
> after some rather depressing unsuccessful attempts, I'm wondering if
> someone
> has a hint what we could do to accomplish the above task on a productive
> system: anytime we tried it yet, we had to roll back, since the load of
> the
> servers and the clients climbed so high that our application became
> unusable.
>
> Our understanding is, that the introduction of the namespace is killing
> us,
> but did not find a way how get around the problem.
>
> The setup: 4 servers, each has two bricks and a namespace; the bricks are
> on
> separate raid arrays. The client do an afr so that server 1 and 2 mirror
> each
> other, as do as 3 and 4. After that, the four resulting afrs are unified
> (see
> config below). The setup is working so far, but not very stable (i.e. we
> see
> memory leaks on client side). The upgraded version has the four namespaces
> afr-ed as well. We have about 20 clients connected that only and rarely
> write, and 7 clients that only but massively read (that is, apache
> webservers
> serving the images). All machines are connected through GB Ethernet.
>
> May be the source of the problem is, what we store on the cluster: Thats
> about
> 12 mio. images, adding to a size of ~300 GB, in a very very nested
> directory
> structure. So, lots of relatively small files. And we are about to add
> another 15 mio. files of even smaller size, they consume only 50 GB in
> total,
> most files only 1 or 2 KB in size.
>
> Now, if we start the new gluster with a new, empty namespace, it only
> takes
> minutes to have the load on the servers to be around 1.5, and on the
> reading
> clients to jump as high as 200(!). Obviously, no more images get delivered
> to
> connected browers. You can imagine that we did not even remotely thought
> to
> add the load of rebuilding the namespace by force, so all the load seems
> to
> be coming from self-heal.
>
> In an earlier attempt with 1.3.2, this picture didn't change much even
> after a
> forced rebuild of the namespace (which took about 24(!)) hours. Also,
> using
> only one namespace brick and no afr did help (but it became clear that the
> server with the namespace was much more loaded than the others).
>
> So far, we did not find a proper way to simulate the problems on a test
> system, which makes it even harder to find a solution :-(
>
> One idea that comes to mind is, could we somehow prepare the namespace
> bricks
> on the old version cluster, to reduce the necessity of the self-healing
> mechanism after the upgrade?
>
> Thanks for reading this much, I hope I've drawn the picture thoroughly,
> please
> let me know if any thing is missing.
>
>
> Cheer, Sascha
>
>
> server config:
>
> volume fsbrick1
> type storage/posix
> option directory /data1
> end-volume
>
> volume fsbrick2
> type storage/posix
> option directory /data2
> end-volume
>
> volume nsfsbrick1
> type storage/posix
> option directory /data-ns1
> end-volume
>
> volume brick1
> type performance/io-threads
> option thread-count 8
> option queue-limit 1024
> subvolumes fsbrick1
> end-volume
>
> volume brick2
> type performance/io-threads
> option thread-count 8
> option queue-limit 1024
> subvolumes fsbrick2
> end-volume
>
> ### Add network serving capability to above bricks.
> volume server
> type protocol/server
> option transport-type tcp/server # For TCP/IP transport
> option listen-port 6996 # Default is 6996
> option client-volume-filename /etc/glusterfs/glusterfs-client.vol
> subvolumes brick1 brick2 nsfsbrick1
> option auth.ip.brick1.allow * # Allow access to "brick" volume
> option auth.ip.brick2.allow * # Allow access to "brick" volume
> option auth.ip.nsfsbrick1.allow * # Allow access to "brick" volume
> end-volume
>
> -----------------------------------------------------------------------
>
> client config
>
> volume fsc1
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.95
> option remote-subvolume brick1
> end-volume
>
> volume fsc1r
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.95
> option remote-subvolume brick2
> end-volume
>
> volume fsc2
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.96
> option remote-subvolume brick1
> end-volume
>
> volume fsc2r
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.96
> option remote-subvolume brick2
> end-volume
>
> volume fsc3
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.97
> option remote-subvolume brick1
> end-volume
>
> volume fsc3r
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.97
> option remote-subvolume brick2
> end-volume
>
> volume fsc4
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.98
> option remote-subvolume brick1
> end-volume
>
> volume fsc4r
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.98
> option remote-subvolume brick2
> end-volume
>
> volume afr1
> type cluster/afr
> subvolumes fsc1 fsc2r
> end-volume
>
> volume afr2
> type cluster/afr
> subvolumes fsc2 fsc1r
> end-volume
>
> volume afr3
> type cluster/afr
> subvolumes fsc3 fsc4r
> end-volume
>
> volume afr4
> type cluster/afr
> subvolumes fsc4 fsc3r
> end-volume
>
> volume ns1
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.95
> option remote-subvolume nsfsbrick1
> end-volume
>
> volume ns2
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.96
> option remote-subvolume nsfsbrick1
> end-volume
>
> volume ns3
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.97
> option remote-subvolume nsfsbrick1
> end-volume
>
> volume ns4
> type protocol/client
> option transport-type tcp/client
> option remote-host 10.10.1.98
> option remote-subvolume nsfsbrick1
> end-volume
>
> volume afrns
> type cluster/afr
> subvolumes ns1 ns2 ns3 ns4
> end-volume
>
> volume bricks
> type cluster/unify
> subvolumes afr1 afr2 afr3 afr4
> option namespace afrns
> option scheduler alu
> option alu.limits.min-free-disk 5%
> option alu.limits.max-open-files 10000
> option alu.order
> disk-usage:read-usage:write-usage:open-files-usage:disk-speed
> -usage
> option alu.disk-usage.entry-threshold 2GB
> option alu.disk-usage.exit-threshold 60MB
> option alu.open-files-usage.entry-threshold 1024
> option alu.open-files-usage.exit-threshold 32
> end-volume
>
> volume readahead
> type performance/read-ahead
> option page-size 256KB
> option page-count 2
> subvolumes bricks
> end-volume
>
> volume write-behind
> type performance/write-behind
> option aggregate-size 1MB
> subvolumes readahead
> end-volume
>
> volume io-cache
> type performance/io-cache
> option page-size 128KB
> option cache-size 64MB
> subvolumes write-behind
> end-volume
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
--
If I traveled to the end of the rainbow
As Dame Fortune did intend,
Murphy would be there to tell me
The pot's at the other end.
More information about the Gluster-devel
mailing list