[Gluster-users] Glusterfs performance with large directories

Wed Oct 15 12:19:35 UTC 2008

We at Wiseguys are looking into GlusterFS to run our Internet Archive.
The archive stores webpages collected by our spiders.
The test setup consists of three data machines, each exporting a volume
of about 3.7TB and one nameserver machine.

File layout is such that each host has it's own directory, for example the
GlusterFS website would be located in:
<fs_root>/db/org/g/www.glusterfd.org/
and each directory will have a small number of potentially large data files.
A similar setup on local disks (without gluster) has proven it's capabilities
over the years.

We use a distributed computing model, each node in the archive runs one
or more processes to update the archive. We use the nufa scheduler to favor 
local files and we use a distributed hashing algorithm to prevent data from
moving around nodes (unless the configuration changes of course).

I've included the GlusterFS configuration at the bottom of this e-mail.

Data access and throughput are pretty good (good enough), but calling stat()
on a directory can take extraordinary long. Here is for example a listing
of the .nl top level domain:

vagabond at spider2:~/archive/db/nl$ time ls
0/  2/  4/  6/  8/  a/  c/  e/  g/  i/  k/  m/  o/  q/  s/  u/  w/  y/
1/  3/  5/  7/  9/  b/  d/  f/  h/  j/  l/  n/  p/  r/  t/  v/  x/  z/

real    4m28.373s
user    0m0.004s
sys     0m0.000s

The same operation performed directly on the local filesystem of the namespace
node returns almost instantly (also for large directories):

time ls /local.mnt/md0/glfs-namespace/db/nl/a | wc -l
  17506

real    0m0.043s
user    0m0.032s
sys     0m0.012s

A trace of the namespace gluster deamon shows that it is performing a
lstat() on all the subdirectories (nl/0/*, nl/1/* etc). Information that
IMO is not needed at this time. In our case the total number of directories
on the filesystem goes into the many millions so this behaviour is hurting
performance.

Now for our questions:

* is this expected to scale to tens of millions of directories?

* is this behaviour a necessity for GlusterFS to operate correctly or is
it some form of performance optimisation? Is it tunable?

* what exactly is the sequency of events to handle a directory listing?
Is this request handled by the namespace node only?

* is there anything we can tune or change to speed up directory access?

Thanks for your time,

Arend-Jan

**** Hardware config ****
data nodes
- 1 x Xeon quad core 2.5 Ghz
- 4 x Barracuda ES.2 SATA 3.0-Gb/s 1-TB Hard Drive
Disks configured in RAID0, 128k chunks
Filesystem XFS
Network: gigabit LAN

namespace node
- 2 x Xeon quad core 2.5 Ghz
- 4 x Cheetah® 15K.5 U320 SCSI Hard Drives
Disks configured in RAID1 (1 mirror, 1 spare)
Filesystem XFS
Network: gigabit LAN

Glusterfs Version: 1.3.11 with Fuse fuse-2.7.3glfs10
Glusterfs Version: 1.4-pre5 with Fuse fuse-2.7.3glfs10

**** GlusterFS data node config ****

volume brick-posix0
 type storage/posix
 option directory /local.mnt/md0/glfs-data
end-volume

volume brick-lock0
 type features/posix-locks
 subvolumes brick-posix0
end-volume

volume brick-fixed0
 type features/fixed-id
 option fixed-uid 2224
 option fixed-gid 224
 subvolumes brick-lock0
end-volume

volume brick-iothreads0
 type performance/io-threads
 option thread-count 4
 subvolumes brick-fixed0
end-volume

volume brick0
 type performance/read-ahead
 subvolumes brick-iothreads0
end-volume

volume server
 type protocol/server
 option transport-type tcp/server
 subvolumes brick0
 option auth.ip.brick0.allow 10.1.0.*
end-volume

**** GlusterFS namespace config ****

volume brick-posix
 type storage/posix
 option directory /local.mnt/md0/glfs-namespace
end-volume

volume brick-namespace
 type features/fixed-id
 option fixed-uid 2224
 option fixed-gid 224
 subvolumes brick-posix
end-volume

volume server
 type protocol/server
 option transport-type tcp/server
 subvolumes brick-namespace
 option auth.ip.brick-namespace.allow 10.1.0.*
end-volume

**** GlusterFS client config ****

volume brick-0-0
 type protocol/client
 option transport-type tcp/client
 option remote-host archive0
 option remote-subvolume brick0
end-volume

volume brick-1-0
 type protocol/client
 option transport-type tcp/client
 option remote-host archive1
 option remote-subvolume brick0
end-volume

volume brick-2-0
 type protocol/client
 option transport-type tcp/client
 option remote-host archive2
 option remote-subvolume brick0
end-volume

volume ns0
 type protocol/client
 option transport-type tcp/client
 option remote-host archivens0
 option remote-subvolume brick-namespace
end-volume

volume unify
 type cluster/unify
 option namespace ns0
 option scheduler nufa
 option nufa.local-volume-name brick-2-0	# depends on data node of course
 option nufa.limits.min-free-disk 10%
 subvolumes brick-0-0 brick-1-0 brick-2-0
end-volume

-- 
Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl