[Gluster-devel] DHT2 trip report

Wed Mar 23 05:35:20 UTC 2016

Hey folks,

This is a report of the discussion regarding DHT2 that took place with Jeff Darcy
and Shyam in Bangalore during February. Many of you have been already following (or
aware to some extent) of what's planned for DHT2, but there's no consolidated document
that explains the design in detail; just peices of information from various talks and
presentations. This report would follow up with the design doc. So, hang in for that.

[Short report of the discussion and what to expect in the design doc..]

DHT2 server component (MDS)
---------------------------
DHT2 client would act as a forwarder/router to the server component. It's the server
component that would drive an operation. Operations may be compound in nature -
locally and/or remote. The "originator" server may forward (sub)operations to another
server in the cluster. This depends on the type of (original) operation.

The server component also takes care of serializing operations if required to ensure
correctness and resilience of concurrent operations. For crash consistency, DHT2 would
first log the operation it's about to perform in a write ahead log (WAL or journal)
followed by performing the operation(s) to the store. WAL records are marked completed
after operations are durable on store. Pending operations are replayed on server restart.

Cluster Map
-----------
This is a versioned "view" of the current state of a cluster. State here refers to nodes
(or probably sub-volumes?) which form a cluster along with it's operational state (up,
down) and weightage. A cluster map is used to distribute objects to a set of nodes.
Every entity in a cluster (client, servers) keep a copy of the cluster map and consult
whenever required (e.g., during distribution). Cluster map versions are monotonically
increasing; epoch number is best suited for such a versioning scheme. A master copy of
the map is maintained by GlusterD (in etcd).

Operation performed by clients (and servers) carry the epoch number of their cached
version of the cluster map. Servers use this information to validate the freshness
of the cluster map.

Write Ahead Log
---------------
DHT2 would make use of journal to ensure crash consistency. It's tempting to reuse
the journaling translator developed for NSR (FDL: Full Data Logging xlator), but
doing that would require FDL handling special cases for DHT2. Furthermore, there
are plans to redesign quota to rely on journals (journaled quota). Also, designing
a system with such tight coupling makes it hard to switch to alternate implementations
(e.g., server side AFR instead of NSR). Therefore, implementing the journal as a regular
file that's treated as any other file by all layers below would

+ provide more control (discard, replay, etc..) to DHT2 server component on the journal

+ enables MDS to use NSR or AFR (server side) without any modifications to journaling part.
  NSR/AFR would treat the journal as a any other file and keep it replicated+consistent.

- restrict taking advantage of storing (DHT2) journal on faster storage (SSD)

Sharding
--------
Introduce the notion of block pointers for inodes. Block pointers are distributed by DHT2
rather than individual files/objects. This changes the translator API extensively. Try to
leverage existing shard implementation and see if the concept of block pointers can be
used in place of treating each shard as a separate file. Treating each shard as a separate
file bloats up the amount of tracking (on the MDS) needed for each file shard.

Size on MDS
-----------
https://review.gerrithub.io/#/c/253517/

Leader Election
---------------
It's important for the server side DHT2 compoenent to know, in a replicated MDS setup, if
a brick is acting as a leader. As of now this information is a part of NSR, but needs to
be carved out as a separate translator in the server.

[More on this in the design document]

Server Graph
------------
DHT2 MDSes "interact" with each other (non-replicated MDC would typically have the client
translator loaded on the server to "talk" to other (N - 1) MDS nodes. When the MDC is
replicated (NSR for example) then:

1) 'N - 1' NSR client component(s) loaded on the server to talk to other replias (when
   a (sub)operation needs to be performed on non local/replica-group) node)

         where, N == number of distribute subvolumes

2) 'N' NSR client component(s) loaded on the client for high availability of distributed
   MDS.

--

As usual, comments are more than welcome. If things are still unclear or you're wanting to
read more, then hang in for the design doc.

Thanks,

                                Venky