[Gluster-devel] AFR

Mon Feb 19 04:33:55 UTC 2007

,----[ Brent A Nelson writes: ]
| I see that the AFR automatic replication module is now listed as
| complete and scheduled for the February 20th 1.4 release.  While we
| are eagerly awaiting it, is there any documentation somewhere about
| how it operates?
| 
| It sounds like this will provide an easily installed and configured,
| truly redundant (no single points of failure) high-performance
| distributed filesystem, without having to resort to extra layers
| (such as drbd).  In otherwords, filesystem nirvana. ;-)
`----
Yes AFR is complete, but we want to add "selective AFR support"
before the release. (for example *.db 4 copies, *.html 2 copies and
so on). Documentation for AFR will be available before release.

We have delayed the release to 28th. But included striping translator
support additionally. 

,----
| 1) When does replication take place? When a client is writing to a
| file, does it write to two (or more) servers at the same time, or
| does one server replicate the file to the other server after close,
| or...? If a replica server goes down, how does it catch up with
| changes since it was down? Do the clients know to use the good
| replica server while the other server is catching up?
`----
Replication happens on the fly. All writes are simultaneously written
to all the replicated copies. 

,----
| 2) Assuming replicas are in sync, will clients load-balance reads
| across the multiple replicas for increased performance? What about
| writes?
`----
Reading simultaneously from multiple replicas doesn't really help in
performance. Because, if I read first block (say 128KB) from the first
brick and second block from the second brick, I lose the underlying
kernel's read-ahead cache. Second block on second brick will be slower
than first brick. This sequence continues. This strided read makes
kernel I/O go crazy. Also memory is consumed on all the
bricks. Instead if I direct the consecutive I/O to same the brick,
most likely I will read from cache. It makes sense to distribute the
load on file basis. When second file is opened in a mirror, second
brick will be used and so on.

GlusterFS does asynchronous background write. So writing to multiple
bricks is a parallel operation. Write operations are very fast in
GlusterFS (faster than read).

,----
| 3) How are locks handled between replicas?
`----
Locks are operated on all the replicated files simultaneously. 

,----
| 4) GlusterFS looks to be extremely flexible with regards to its
| configuration, but I just want to be sure: if AFR is working with
| multiple nodes, each containing multiple disks as part of a
| filesystem, will we be able to guarantee that replicas will be
| stored on different nodes (i.e., so a node can fail and the data
| will still be fully available).
`----
AFR initially was scheduled for a later release date. Our focus was
more on performance and flexibility. But after hearing the same from
many users, we accelerated the AFR to next release.

We are also hoping that next 1.3 release will be a significant
improvement in performance and reliability. Particularly io-threads,
ib-verbs transport and epoll additions..

,----
| A completely vague, general question:
| 
| I'd like to run standard departmental services across one or more
| redundant, distributed filesystems.  This would include home
| directories and data directories, as well as directories that would
| be shared amongst multiple servers for the express purpose of
| running redundant (and load-leveled, when possible) services (mail,
| web, load-leveled NFS export for any systems that can't mount the
| filesystem directly, hopefully load-leveled SAMBA, etc.).  Would
| GlusterFS (with AFR) be a good match?
`----
Yes I hope. To match different application I/O demands, we kept as
modular as possible. You can easily add only required features and
pick an appropriate I/O scheduler that is best suited for your
needs.

GlusterFS I/O scheduler takes care of load-balancing. There are
different options available (adaptive-least-usage,
non-uniform-file-access, round-robin and random). It is fairly easy to
implement a scheduler module.

,----
| I've been working with Lustre+DRBD+Heartbeat, and it seems like it
| will work, but the complexity of it makes me nervous (it may be
| fragile, and prone to breakage with future updates, especially with
| its strong dependence still upon numerous kernel patches that differ
| by kernel release).  GlusterFS sounds much simpler (thanks to its
| modularity) and is about to get built-in replication...
| 
| Thanks,
| 
| Brent Nelson
| Director of Computing
| Dept. of Physics
| University of Florida
`----
Lustre is by far the best cluster file system. But the same reasons
you have mentioned led to development of GlusterFS. 

Developing a filesystem for one node itself takes years to
mature. When a filesystem spans across servers, it becomes incredibly
complex and unnecessary. Instead with GlusterFS, we cluster multiple
file systems together. This is a radically different design from
lustre. Real focus is required on volume management, I/O scheduling /
load-balancing and more features. For extensibility we borrowed the
translator design from GNU Hurd kernel. It has become extremely easy
for us to add advanced features without compromising performance or
complexity.

-- 
Anand Babu 
GPG Key ID: 0x62E15A31
Blog [http://ab.freeshell.org]              
The GNU Operating System [http://www.gnu.org]