[Gluster-users] Glusterfs performance with large directories
Arend-Jan Wijtzes
ajwytzes at wise-guys.nl
Wed Oct 15 12:19:35 UTC 2008
We at Wiseguys are looking into GlusterFS to run our Internet Archive.
The archive stores webpages collected by our spiders.
The test setup consists of three data machines, each exporting a volume
of about 3.7TB and one nameserver machine.
File layout is such that each host has it's own directory, for example the
GlusterFS website would be located in:
<fs_root>/db/org/g/www.glusterfd.org/
and each directory will have a small number of potentially large data files.
A similar setup on local disks (without gluster) has proven it's capabilities
over the years.
We use a distributed computing model, each node in the archive runs one
or more processes to update the archive. We use the nufa scheduler to favor
local files and we use a distributed hashing algorithm to prevent data from
moving around nodes (unless the configuration changes of course).
I've included the GlusterFS configuration at the bottom of this e-mail.
Data access and throughput are pretty good (good enough), but calling stat()
on a directory can take extraordinary long. Here is for example a listing
of the .nl top level domain:
vagabond at spider2:~/archive/db/nl$ time ls
0/ 2/ 4/ 6/ 8/ a/ c/ e/ g/ i/ k/ m/ o/ q/ s/ u/ w/ y/
1/ 3/ 5/ 7/ 9/ b/ d/ f/ h/ j/ l/ n/ p/ r/ t/ v/ x/ z/
real 4m28.373s
user 0m0.004s
sys 0m0.000s
The same operation performed directly on the local filesystem of the namespace
node returns almost instantly (also for large directories):
time ls /local.mnt/md0/glfs-namespace/db/nl/a | wc -l
17506
real 0m0.043s
user 0m0.032s
sys 0m0.012s
A trace of the namespace gluster deamon shows that it is performing a
lstat() on all the subdirectories (nl/0/*, nl/1/* etc). Information that
IMO is not needed at this time. In our case the total number of directories
on the filesystem goes into the many millions so this behaviour is hurting
performance.
Now for our questions:
* is this expected to scale to tens of millions of directories?
* is this behaviour a necessity for GlusterFS to operate correctly or is
it some form of performance optimisation? Is it tunable?
* what exactly is the sequency of events to handle a directory listing?
Is this request handled by the namespace node only?
* is there anything we can tune or change to speed up directory access?
Thanks for your time,
Arend-Jan
**** Hardware config ****
data nodes
- 1 x Xeon quad core 2.5 Ghz
- 4 x Barracuda ES.2 SATA 3.0-Gb/s 1-TB Hard Drive
Disks configured in RAID0, 128k chunks
Filesystem XFS
Network: gigabit LAN
namespace node
- 2 x Xeon quad core 2.5 Ghz
- 4 x Cheetah® 15K.5 U320 SCSI Hard Drives
Disks configured in RAID1 (1 mirror, 1 spare)
Filesystem XFS
Network: gigabit LAN
Glusterfs Version: 1.3.11 with Fuse fuse-2.7.3glfs10
Glusterfs Version: 1.4-pre5 with Fuse fuse-2.7.3glfs10
**** GlusterFS data node config ****
volume brick-posix0
type storage/posix
option directory /local.mnt/md0/glfs-data
end-volume
volume brick-lock0
type features/posix-locks
subvolumes brick-posix0
end-volume
volume brick-fixed0
type features/fixed-id
option fixed-uid 2224
option fixed-gid 224
subvolumes brick-lock0
end-volume
volume brick-iothreads0
type performance/io-threads
option thread-count 4
subvolumes brick-fixed0
end-volume
volume brick0
type performance/read-ahead
subvolumes brick-iothreads0
end-volume
volume server
type protocol/server
option transport-type tcp/server
subvolumes brick0
option auth.ip.brick0.allow 10.1.0.*
end-volume
**** GlusterFS namespace config ****
volume brick-posix
type storage/posix
option directory /local.mnt/md0/glfs-namespace
end-volume
volume brick-namespace
type features/fixed-id
option fixed-uid 2224
option fixed-gid 224
subvolumes brick-posix
end-volume
volume server
type protocol/server
option transport-type tcp/server
subvolumes brick-namespace
option auth.ip.brick-namespace.allow 10.1.0.*
end-volume
**** GlusterFS client config ****
volume brick-0-0
type protocol/client
option transport-type tcp/client
option remote-host archive0
option remote-subvolume brick0
end-volume
volume brick-1-0
type protocol/client
option transport-type tcp/client
option remote-host archive1
option remote-subvolume brick0
end-volume
volume brick-2-0
type protocol/client
option transport-type tcp/client
option remote-host archive2
option remote-subvolume brick0
end-volume
volume ns0
type protocol/client
option transport-type tcp/client
option remote-host archivens0
option remote-subvolume brick-namespace
end-volume
volume unify
type cluster/unify
option namespace ns0
option scheduler nufa
option nufa.local-volume-name brick-2-0 # depends on data node of course
option nufa.limits.min-free-disk 10%
subvolumes brick-0-0 brick-1-0 brick-2-0
end-volume
--
Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl
More information about the Gluster-users
mailing list