[Gluster-users] Optimizing write performance to a few large files in a small cluster

Tue Mar 11 09:37:53 UTC 2014

Alexander:

I have also experienced the stalls you are explaining. This was in a 2 brick setup running replicated volumes used by a 20 node HPC. 

In my case this was solved by: 

* Replace FUSE with NFS
	* This is by far the biggest booster
* RAM disks for the scratch directories (not connected to gluster at all)
	* If you’re not sure where these directories are, run ‘gluster volume top <volume> write list-cnt 10’
* 'tuned-adm profile; tuned-adm profile rhs-high-throughput’ on all storage bricks
* The following volume options
	* cluster.nufa: enable
	* performance.quick-read: on
	* performance.open-behind: on
* Mount option on clients
	* noatime
		* Use only where access time isn’t needed.
		* Major booster for small file writes in my case. Even with the FUSE client.

Hope this helps, 

Regards,
Robin

On 10 Mar 2014, at 19:06 pm, Alexander Valys <avalys at avalys.net> wrote:

> A quick performance question.
> 
> I have a small cluster of 4 machines, 64 cores in total.  I am running a scientific simulation on them, which writes at between 0.1 and 10 MB/s (total) to roughly 64 HDF5 files.  Each HDF5 file is written by only one process.  The writes are not continuous, but consist of writing roughly 1 MB of data to each file every few seconds.    
> 
> Writing to HDF5 involves a lot of reading the file metadata and random seeking within the file,  since we are actually writing to about 30 datasets inside each file.  I am hosting the output on a distributed gluster volume (one brick local to each machine) to provide a unified namespace for the (very rare) case when each process needs to read the other's files.  
> 
> I am seeing somewhat lower performance than I expected, i.e. a factor of approximately 4 less throughput than each node writing locally to the bare drives.  I expected the write-behind cache to buffer each write, but it seems that the writes are being quickly flushed across the network regardless of what write-behind cache size I use (32 MB currently), and the simulation stalls while waiting for the I/O operation to finish.  Anyone have any suggestions as to what to look at?  I am using gluster 3.4.2 on ubuntu 12.04.  I have flush-behind turned on, and have mounted the volume with direct-io-mode=disable, and have the cache size set to 256M.  
> 
> The nodes are connected via a dedicated gigabit ethernet network, carrying only gluster traffic (no simulation traffic).
> 
> (sorry if this message comes through twice, I sent it yesterday but was not subscribed)
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140311/765bb93c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140311/765bb93c/attachment.sig>