[Gluster-users] glusterfs for cloud storage
Wei Dong
wdong.pku at gmail.com
Thu Aug 20 16:15:48 UTC 2009
Hi All,
We are using glusterfs on our lab cluster for a shared storage to save a
large number of image files, about 30 million at the moment. We use
Hadoop for distributed computing, but we are reluctant to store small
files on hadoop for it's low throughput on small files and also the
non-standard filesystem interface (e.g. we won't be able to run convert
on each image to produce a thumbnail if the files are stored in
hadoop). What we do now is to store a list of paths to all images in
hadoop, and use Hadoop streaming to pipe the paths to some script, which
will then read the images from glusterfs filesystem and do the
processing. This has been working for a while so long as glusterfs
doesn't hang, but the problem is that we basically lose all data
locality. We have 66 nodes and the chance that a needed file is on
local disk is only 1/66, and 55/66 of file I/O has to go through
network, which make me very uncomfortable. I'm wondering if there's a
better way of making glusterfs and Hadoop work together to take the
advantage of data locality.
I know that there's a nufa translator which gives high preference to
local drive. This is good enough if the assignment of files to nodes is
fixed. But if we want to assign files to nodes according to the
location of the file, what interface should we use to get the physical
location of the file?
I appreciate all your suggestions.
- Wei Dong
More information about the Gluster-users
mailing list