[Gluster-devel] [RFC] Improved distribution

Edward Shishkin edward at redhat.com
Mon Apr 16 23:57:35 UTC 2012


Hello GlusterFS developers.

We have found that current DHT translator is suboptimal: the number
of files being moved during re-balancing (caused by small changes in
the set of bricks) can be significantly reduced (see Appendix).

To be precise, we can improve scalability: in the current DHT the amount
of re-balancing work scales as O(M) (M is total number of files in the
compound volume), whereas after changing the hashing technique it will
scale as O(M/N) (N is the number of bricks).

In the document below we first consider simple tables (section 1) and
estimate minimal amount of rebalance work for them. Then we complicate
them with techniques of virtual nodes (section 2) and replication
(section 3), and show that this doesn't worse scalability.

Unfortunately it is impossible to perform the improvements without
format change, so it would be a new translator, which won't understand
layouts created by current DHT (and back).

We will be happy to see any feedbacks on this, and if everything is
OK, to proceed development in this direction (with implementation
details, etc).


Thanks,
Edward.


                            Legend:


C - namespace;
R - 64-bit ring;
N - number of real bricks that compose the volume;
M - number of files in the compound volume;
S - number of virtual components of any real brick;
R - size of preference set (replication level);



          1. Simple DH tables based on consistent hashing



We consider a 64-bit ring R, i.e. a regular 2^64-polygon with vertexes
0, ..., 2^64 - 1. "Ring" means that for every vertex A there can be
found vertexes B, C of R, so that C < A < B. Here "<" and "<=" mean
relations between respective angles (for any vertex Z we consider the
angle composed of O0 and OZ, where O is the center of the polygon).

Then we consider namespace C and any mapping phi: C -> R, so that for
every sequence {c1, c2, ... } of different names {phi(c1), phi(c2),
...) is a pseudo-random variable, which has uniform distribution on R
(see 0*).


Suppose we have a volume composed of N bricks with unique names
B1, B2, ... B_N. During system initialization we construct for this
compound volume N respective unique tokens phi(B1), ..., phi(B_N)
in the ring R, caching them, say, in rb-tree to preserve ordering.

When creating a directory (mkdir(2)) we create respective directories
on all bricks B_1, B_2, ... B_N.

When creating a regular file (creat(2), etc) with a short name F,
we create a respective file only in one brick B = B(F), which is
determined by the following way: phi(B) is the minimal token in the
ring so that phi(F) <= phi(B). This is where the the notion of ring
works: if there is no any such token in the [phi(F), 2^64 - 1], then
we continue search from the vertex 0.

Lookup operation for any regular file F resolves to lookup(F) on the
respective brick B(F).

Deleting a regular file F resolves to deleting a file from the brick
B(F).

Looking for a brick (i.e. calculation F->B(F)) is a well-scalable
operation: log(N) actions is required.

When adding a new brick X = B_(N+1) we set a new unique token phi(X)
to the ring R and move a subset of files from B_j to X, where B_j is
the brick with the smallest phi(B_j), so that phi(X) < phi(B_j).
Namely, every file W of B_j with phi(W) <= phi(X) should be moved to X.
That said, the number of files to be moved during re-balancing is not
larger than a number of files contained in one brick (B_j in our case)

When removing a brick Y = B_k, we first find in R the "closest" brick
B_s, which has minimal phi(B_s), so that phi(Y) < phi(B_s), and move
all files from Y to B_s (no scans is required).


Such hashing technique is called "consistent hashing associated with
the variable phi, which has uniform distribution". This is a
relatively new technique suggested by Karger at al (1*). The main
advantage of this technique is that small changes in the set of bricks
result in small amount of rebalancing work. Namely, adding/removing 1
brick results in moving of only M/N files (2*) (M is total number of
files in the compound volume). This is M/(M/N) = N times better then
with traditional hashing, where we need to move all M files (median
value). In particular, if we have 100 bricks, then with traditional
hashing we'll need to move x100 files more than with consistent one.

To construct a uniform distribution phi we can have any well-mixing
64-hash, say fnv-hash, etc..

Comment 1. The technique of consistent hashing is used in Amazon's
Dynamo (4*)

Comment 2. There is a disadvantage: in this approach all files

/foo
/dir1/foo
/dir1/dir2/foo
...

will be accumulated on the same brick. However it is possible to
"salt" a short file names with gfid (or another id) of respective
directory before hashing, to avoid possible attacks.



                       2. Virtual nodes



The theory above works well for larger number of bricks N. However,
when N is too small (2-3 bricks), then uniform distribution can result
in bad partitioning, so one brick will accumulate much more files then
other ones, which is not good. The graceful solution of this problem
is so-called technique of "virtual nodes": with every brick we set S
tokens on the ring, where S is a number of "virtual components" of a
brick. So, every brick is represented by S unique tokens on the ring
(S >= 1, S is a parameter of the cluster translator).

In the case of S>1 the lookup-a-brick procedure above is not changed:
the difference is that we search in a larger set of tokens (N*S), and
since log(N*S) == log(N) + log(S) == log(N) + const, this search also
scales as log(N), while with a larger number of tokens the
distribution of files becomes more balanced (in terms of the standard
deviation, see (3*) and the Appendix for details. In particular, S=100
provides deviation ~10% of the mean.

Adding a new brick with S > 1 looks like above with the difference
that we steal files of S > 1 different old virtual bricks. Note, that
2 or more of those virtual bricks can represent the same real brick
though. So adding a real brick with S virtual components requires
(M/N)*S scans, however, a median number of files to be moved during
re-balancing is the same (M/N) as in the case of S==1.

Removing a brick with S > 1 virtual components mostly looks like in
the case of S == 1: no scans is requires. The difference is that we
distribute files of the brick to be removed among S virtual bricks
(which correspond to <= S real bricks).



                    3. Replication and Recovery



To achieve high availability and durability we replicate files on
multiple bricks. In our case replication can be implemented as a set
of operations with the same ring R, so we don't create a separate
translator for replication.

We store every file in its so-called preference set of real bricks.
Ordinal number R of this set is the volume option. R is also called
replication level (R = 1 means no replication: every file is stored
only in a single brick).

For every file F its preference set is defined as a set of "closest"
virtual bricks B_(k_1), ... , B(k_R), which represent pairwise
different real bricks, so that B_(k_1) = B(F), and
phi(B_(k_1)) < phi(B_(k_2)) < ... < phi(B_(k_R)).

We don't create 2 replicas of the same file on the same real brick,
so, R shouldn't be larger than N.

If we enable replication (R>1), regular file operations become more
complicated: every such operation is performed for all respective
files located on all bricks of the preference set.

Operations on a set of bricks also become more complicated, but
scalability doesn't suffer. When adding a new brick X = B_(N+1), we

0) set a unique token phi(X) to the ring R.

1) find R closest tokens B_(m_1), ..., B_(m_R), which represent
    different real bricks in the ring, so that B_(m_R) == X,
    and phi(B_(m_1)) < ... < phi(B_(m_R)).

2) find R+1 closest tokens B_(p_0), ..., B_(p_R), which represent
    different real bricks in the ring, so that B_(p_0) == X,
    and phi(B_(p_0)) < ... < phi(B_(p_R)).

3) The new brick X steals a portion of files of B_(p_1) as it has been
    described in section (1) above.

4) The brick B_(p_R) becomes not belonging to the preference set of
    the stolen files, so we need to remove all the respective replicas
    from B_(p_R).

5) X becomes a brick belonging to the preference sets of files stored
    in the bricks B_(m_1),... , B_(m_(R-1)), hence we should create
    respective replicas on X.

So adding a new brick with replication level R > 1 results in

a) moving a portion of files of one brick (step (3) above);
b) replication of files located on on R-1 bricks (step (5) above);
c) deleting replicas of a portion of files of one brick (step (4)).

(a),(b),(c) above can be considered as re-balancing of (R+1)*(M/N) =
const*(M/N) files (when R == 1, then (b),(c) are absent, and we need
to re-balance M/N files, as it was shown in the section 1 above).

Similarly we can show that with level of replication R removing one
brick also leads to re-balancing of const*(M/N) files.


If in our configuration L < R bricks don't respond for some reasons,
then all regular file operations are still defined, however our system
is marked as "unhealthy" (in some papers this state is called "sloppy
quorum"), non-responding bricks are marked as "missed" in the ring and
file operation are performed on other available bricks of the
preference set. In such operations files on the available "non-
primary" bricks are marked as "not synced with the missed replicas".

In the state of sloppy quorum operations like add/remove a node can
be already undefined. For example, when adding a brick we'll need to
steal files from a brick, which doesn't respond.

So we need to return the system back to a "consistent" state, when all
operations are defined. It can be done by the following ways:

1) Make sure that all missed bricks are available again and perform
    L1-recovery. L1-recovery means syncing all marked files with the
    again available bricks, so that resulting consistent system will
    have the same number N of bricks.
2) Add new empty bricks instead of missed ones and perform L2-recovery
    It means filling the new empty bricks with files from other bricks,
    so that resulting consistent system will have the same number N of
    bricks.
3) Remove missed bricks from the ring and perform L3-recovery, so that
    resulting consistent system will have smaller number of nodes (N-L)

Comment 1. For any missed brick we can specify different type of
recovery.

Comment 2. When R == N replication "clogs" the distribution: in this
case our system will work like mirrors: every bricks will contain
the same set of files.



                             APPENDIX



----------------------------------------------------------------------

In 3 distributed hash tables with different hashing techniques

. GlusterFS DHT translator (3.2.5)
. 64-bit ring with phi based on md5, R=1 (no replication), S=7
. 64-bit ring with phi based on md5, R=1 (no replication), S=20

we run the same scenario:

1) Create 100 files ("file00", "file01", ..., "file99") in a volume
    composed of 9 bricks:

    "host:/root/exp0",
    "host:/root/exp1",
    ...

    "host:/root/exp8".

2) Add one brick "host:/root/exp9";
3) re-balance;


I. GlusterFS DHT translator (3.2.5)

----------------------------------------------------------------------

before re-balancing:

exp0: file15  file22  file34  file35  file51  file6  file68  file78
       file8   file81  file89  file94  file95
exp1: file10  file28  file3   file4   file43  file66  file75
exp2: file40  file46  file47  file48  file50  file58  file86  file9
exp3: file12  file13  file32  file37  file54  file55  file7  file71
       file91
exp4: file31  file38  file41  file42  file53  file62  file63  file69
       file93  file97
exp5: file11  file16  file17  file24  file25  file27  file29  file44
       file56  file73  file74  file80  file87  file90
exp6: file0   file1   file2   file33  file36  file49  file57  file59
       file64  file77  file79  file84  file85  file88  file98
exp7: file21  file26  file39  file52  file61  file70  file72  file83
       file92  file99
exp8: file14  file20  file23  file30  file45  file5   file60  file65
       file67  file76  file82  file96

after re-balancing:

exp0: file11  file16  file17  file24  file25  file31  file44  file53
       file62  file69  file73  file80  file87  file93  file97
exp1: file0   file27  file29  file33  file36  file56  file57  file64
       file74  file77  file84  file88  file90  file98
exp2: file1   file2   file39  file49  file59  file72  file79  file85
       file92
exp3: file21  file26  file30  file45  file52  file60  file61  file65
       file70  file83  file99
exp4: file14  file20  file23  file5   file67  file76  file82  file96
exp5: file15  file22  file34  file35  file51  file6   file68  file78
       file8   file81  file89  file94  file95
exp6: file10  file28  file4   file43  file66  file75
exp7: file3   file40  file47  file58  file9
exp8: file12  file13  file32  file37  file46  file48  file50  file7
       file71  file86
exp9: file38  file41  file42  file54  file55  file63  file91


as the result 98 files have been rebalanced (total files scanned 139)



II.  64-bit ring with phi based on md5.
      Every brick has number of virtual components S=7

----------------------------------------------------------------------

before re-balancing:

exp0: file02  file18  file22  file42  file48  file58  file62  file70
exp1: file01  file08  file15  file23  file33  file51  file64  file75
       file82  file85  file86  file87  file95
exp2: file00  file10  file11  file14  file25  file29  file40  file45
       file63  file81  file91  file96
exp3: file09  file16  file19  file21  file24  file28  file32  file35
       file36  file44  file47  file50  file52  file57  file67  file73
       file88  file98
exp4: file27  file49  file53  file55  file69  file97
exp5: file05  file20  file43
exp6: file34  file68  file72  file74  file79
exp7: file03  file04  file26  file39  file41  file54  file60  file71
       file77  file78  file83  file84  file89  file93  file94
exp8: file06  file07  file12  file17  file30  file31  file37  file38
       file46  file56  file59  file61  file65  file66  file76  file80
       file90  file92  file99

after re-balancing:
only the following files has been moved (to exp9):

exp0: file70
exp1: file82  file85
exp2: file00  file14  file25  file40  file45  file91  file96
exp3: file88

as the result 11 files have been rebalanced (total files scanned 51)



III.  64-bit ring with phi based on md5.
       Every brick has number of virtual components S=20

-----------------------------------------------------------------------

before re-balancing:

exp0: file00  file04  file22  file30  file40  file42  file70  file96
exp1: file06  file08  file13  file15  file17  file19  file23  file32
       file33  file78  file81  file86  file95
exp2: file11  file14  file16  file24  file29  file57  file63  file67
       file73  file76
exp3: file02  file10  file12  file18  file21  file28  file35  file51
       file56  file59  file80  file87
exp4: file39  file41  file49  file53  file54  file62  file69  file77
       file83  file84  file91
exp5: file05  file20  file31  file43  file68  file74
exp6: file34  file37  file38  file46  file48  file58  file66  file71
       file75  file79  file88
exp7: file03  file07  file26  file47  file60  file72  file89  file93
       file94
exp8: file01  file09  file25  file27  file36  file44  file45  file50
       file52  file55  file59  file61  file64  file65  file82  file85
       file90  file92  file97  file98  file99

after re-balancing:
only the following files has been moved (to exp9):

exp0: file70
exp6: file88
exp7: file07  file93
exp8: file45  file82  file85

as the result 7 files have been rebalanced (total files scanned 48)


----------------------------------------------------------------------

                         Table with results

           Re-balance results and standard deviations (sd)
           for current gluster DHT translator (3.2.5), and
           for 64-bit rings with S=7 and S=20.


----------------------------------------------------------------------

                DHT(Gluster 3.2.5)  64-bit ring, S=7  64-bit ring, S=20

----------------------------------------------------------------------

files moved            98                 11                 7

files scanned         139                 51                48

sd before             2.8                5.4               3.7

sd after              3.2                5.3               3.2

----------------------------------------------------------------------



----------


(0*) http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29
(1*) Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.; 
Lewin, D.
     (1997). "Consistent Hashing and Random Trees: Distributed Caching 
Protocols
      for Relieving Hot Spots on the World Wide Web"
(2*) M/N is a median value
(3*) 
http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html
(4*) http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html




More information about the Gluster-devel mailing list