[Gluster-devel] [RFC] Improved distribution

Tue Apr 17 08:29:36 UTC 2012

Hi Edward,
- It will be interesting to see the comparison with the rebalance
enhancements which have gone into 3.3 (to get a sense of urgency).
- Even though the proposed scheme is not compatible with existing DHT
(+AFR), I can see ways in which it can be backward compatible with existing
volumes if care is taken in implementation details.

Avati

On Mon, Apr 16, 2012 at 4:57 PM, Edward Shishkin <edward at redhat.com> wrote:

> Hello GlusterFS developers.
>
> We have found that current DHT translator is suboptimal: the number
> of files being moved during re-balancing (caused by small changes in
> the set of bricks) can be significantly reduced (see Appendix).
>
> To be precise, we can improve scalability: in the current DHT the amount
> of re-balancing work scales as O(M) (M is total number of files in the
> compound volume), whereas after changing the hashing technique it will
> scale as O(M/N) (N is the number of bricks).
>
> In the document below we first consider simple tables (section 1) and
> estimate minimal amount of rebalance work for them. Then we complicate
> them with techniques of virtual nodes (section 2) and replication
> (section 3), and show that this doesn't worse scalability.
>
> Unfortunately it is impossible to perform the improvements without
> format change, so it would be a new translator, which won't understand
> layouts created by current DHT (and back).
>
> We will be happy to see any feedbacks on this, and if everything is
> OK, to proceed development in this direction (with implementation
> details, etc).
>
>
> Thanks,
> Edward.
>
>
>                           Legend:
>
>
> C - namespace;
> R - 64-bit ring;
> N - number of real bricks that compose the volume;
> M - number of files in the compound volume;
> S - number of virtual components of any real brick;
> R - size of preference set (replication level);
>
>
>
>         1. Simple DH tables based on consistent hashing
>
>
>
> We consider a 64-bit ring R, i.e. a regular 2^64-polygon with vertexes
> 0, ..., 2^64 - 1. "Ring" means that for every vertex A there can be
> found vertexes B, C of R, so that C < A < B. Here "<" and "<=" mean
> relations between respective angles (for any vertex Z we consider the
> angle composed of O0 and OZ, where O is the center of the polygon).
>
> Then we consider namespace C and any mapping phi: C -> R, so that for
> every sequence {c1, c2, ... } of different names {phi(c1), phi(c2),
> ...) is a pseudo-random variable, which has uniform distribution on R
> (see 0*).
>
>
> Suppose we have a volume composed of N bricks with unique names
> B1, B2, ... B_N. During system initialization we construct for this
> compound volume N respective unique tokens phi(B1), ..., phi(B_N)
> in the ring R, caching them, say, in rb-tree to preserve ordering.
>
> When creating a directory (mkdir(2)) we create respective directories
> on all bricks B_1, B_2, ... B_N.
>
> When creating a regular file (creat(2), etc) with a short name F,
> we create a respective file only in one brick B = B(F), which is
> determined by the following way: phi(B) is the minimal token in the
> ring so that phi(F) <= phi(B). This is where the the notion of ring
> works: if there is no any such token in the [phi(F), 2^64 - 1], then
> we continue search from the vertex 0.
>
> Lookup operation for any regular file F resolves to lookup(F) on the
> respective brick B(F).
>
> Deleting a regular file F resolves to deleting a file from the brick
> B(F).
>
> Looking for a brick (i.e. calculation F->B(F)) is a well-scalable
> operation: log(N) actions is required.
>
> When adding a new brick X = B_(N+1) we set a new unique token phi(X)
> to the ring R and move a subset of files from B_j to X, where B_j is
> the brick with the smallest phi(B_j), so that phi(X) < phi(B_j).
> Namely, every file W of B_j with phi(W) <= phi(X) should be moved to X.
> That said, the number of files to be moved during re-balancing is not
> larger than a number of files contained in one brick (B_j in our case)
>
> When removing a brick Y = B_k, we first find in R the "closest" brick
> B_s, which has minimal phi(B_s), so that phi(Y) < phi(B_s), and move
> all files from Y to B_s (no scans is required).
>
>
> Such hashing technique is called "consistent hashing associated with
> the variable phi, which has uniform distribution". This is a
> relatively new technique suggested by Karger at al (1*). The main
> advantage of this technique is that small changes in the set of bricks
> result in small amount of rebalancing work. Namely, adding/removing 1
> brick results in moving of only M/N files (2*) (M is total number of
> files in the compound volume). This is M/(M/N) = N times better then
> with traditional hashing, where we need to move all M files (median
> value). In particular, if we have 100 bricks, then with traditional
> hashing we'll need to move x100 files more than with consistent one.
>
> To construct a uniform distribution phi we can have any well-mixing
> 64-hash, say fnv-hash, etc..
>
> Comment 1. The technique of consistent hashing is used in Amazon's
> Dynamo (4*)
>
> Comment 2. There is a disadvantage: in this approach all files
>
> /foo
> /dir1/foo
> /dir1/dir2/foo
> ...
>
> will be accumulated on the same brick. However it is possible to
> "salt" a short file names with gfid (or another id) of respective
> directory before hashing, to avoid possible attacks.
>
>
>
>                      2. Virtual nodes
>
>
>
> The theory above works well for larger number of bricks N. However,
> when N is too small (2-3 bricks), then uniform distribution can result
> in bad partitioning, so one brick will accumulate much more files then
> other ones, which is not good. The graceful solution of this problem
> is so-called technique of "virtual nodes": with every brick we set S
> tokens on the ring, where S is a number of "virtual components" of a
> brick. So, every brick is represented by S unique tokens on the ring
> (S >= 1, S is a parameter of the cluster translator).
>
> In the case of S>1 the lookup-a-brick procedure above is not changed:
> the difference is that we search in a larger set of tokens (N*S), and
> since log(N*S) == log(N) + log(S) == log(N) + const, this search also
> scales as log(N), while with a larger number of tokens the
> distribution of files becomes more balanced (in terms of the standard
> deviation, see (3*) and the Appendix for details. In particular, S=100
> provides deviation ~10% of the mean.
>
> Adding a new brick with S > 1 looks like above with the difference
> that we steal files of S > 1 different old virtual bricks. Note, that
> 2 or more of those virtual bricks can represent the same real brick
> though. So adding a real brick with S virtual components requires
> (M/N)*S scans, however, a median number of files to be moved during
> re-balancing is the same (M/N) as in the case of S==1.
>
> Removing a brick with S > 1 virtual components mostly looks like in
> the case of S == 1: no scans is requires. The difference is that we
> distribute files of the brick to be removed among S virtual bricks
> (which correspond to <= S real bricks).
>
>
>
>                   3. Replication and Recovery
>
>
>
> To achieve high availability and durability we replicate files on
> multiple bricks. In our case replication can be implemented as a set
> of operations with the same ring R, so we don't create a separate
> translator for replication.
>
> We store every file in its so-called preference set of real bricks.
> Ordinal number R of this set is the volume option. R is also called
> replication level (R = 1 means no replication: every file is stored
> only in a single brick).
>
> For every file F its preference set is defined as a set of "closest"
> virtual bricks B_(k_1), ... , B(k_R), which represent pairwise
> different real bricks, so that B_(k_1) = B(F), and
> phi(B_(k_1)) < phi(B_(k_2)) < ... < phi(B_(k_R)).
>
> We don't create 2 replicas of the same file on the same real brick,
> so, R shouldn't be larger than N.
>
> If we enable replication (R>1), regular file operations become more
> complicated: every such operation is performed for all respective
> files located on all bricks of the preference set.
>
> Operations on a set of bricks also become more complicated, but
> scalability doesn't suffer. When adding a new brick X = B_(N+1), we
>
> 0) set a unique token phi(X) to the ring R.
>
> 1) find R closest tokens B_(m_1), ..., B_(m_R), which represent
>   different real bricks in the ring, so that B_(m_R) == X,
>   and phi(B_(m_1)) < ... < phi(B_(m_R)).
>
> 2) find R+1 closest tokens B_(p_0), ..., B_(p_R), which represent
>   different real bricks in the ring, so that B_(p_0) == X,
>   and phi(B_(p_0)) < ... < phi(B_(p_R)).
>
> 3) The new brick X steals a portion of files of B_(p_1) as it has been
>   described in section (1) above.
>
> 4) The brick B_(p_R) becomes not belonging to the preference set of
>   the stolen files, so we need to remove all the respective replicas
>   from B_(p_R).
>
> 5) X becomes a brick belonging to the preference sets of files stored
>   in the bricks B_(m_1),... , B_(m_(R-1)), hence we should create
>   respective replicas on X.
>
> So adding a new brick with replication level R > 1 results in
>
> a) moving a portion of files of one brick (step (3) above);
> b) replication of files located on on R-1 bricks (step (5) above);
> c) deleting replicas of a portion of files of one brick (step (4)).
>
> (a),(b),(c) above can be considered as re-balancing of (R+1)*(M/N) =
> const*(M/N) files (when R == 1, then (b),(c) are absent, and we need
> to re-balance M/N files, as it was shown in the section 1 above).
>
> Similarly we can show that with level of replication R removing one
> brick also leads to re-balancing of const*(M/N) files.
>
>
> If in our configuration L < R bricks don't respond for some reasons,
> then all regular file operations are still defined, however our system
> is marked as "unhealthy" (in some papers this state is called "sloppy
> quorum"), non-responding bricks are marked as "missed" in the ring and
> file operation are performed on other available bricks of the
> preference set. In such operations files on the available "non-
> primary" bricks are marked as "not synced with the missed replicas".
>
> In the state of sloppy quorum operations like add/remove a node can
> be already undefined. For example, when adding a brick we'll need to
> steal files from a brick, which doesn't respond.
>
> So we need to return the system back to a "consistent" state, when all
> operations are defined. It can be done by the following ways:
>
> 1) Make sure that all missed bricks are available again and perform
>   L1-recovery. L1-recovery means syncing all marked files with the
>   again available bricks, so that resulting consistent system will
>   have the same number N of bricks.
> 2) Add new empty bricks instead of missed ones and perform L2-recovery
>   It means filling the new empty bricks with files from other bricks,
>   so that resulting consistent system will have the same number N of
>   bricks.
> 3) Remove missed bricks from the ring and perform L3-recovery, so that
>   resulting consistent system will have smaller number of nodes (N-L)
>
> Comment 1. For any missed brick we can specify different type of
> recovery.
>
> Comment 2. When R == N replication "clogs" the distribution: in this
> case our system will work like mirrors: every bricks will contain
> the same set of files.
>
>
>
>                            APPENDIX
>
>
>
> ------------------------------**------------------------------**----------
>
> In 3 distributed hash tables with different hashing techniques
>
> . GlusterFS DHT translator (3.2.5)
> . 64-bit ring with phi based on md5, R=1 (no replication), S=7
> . 64-bit ring with phi based on md5, R=1 (no replication), S=20
>
> we run the same scenario:
>
> 1) Create 100 files ("file00", "file01", ..., "file99") in a volume
>   composed of 9 bricks:
>
>   "host:/root/exp0",
>   "host:/root/exp1",
>   ...
>
>   "host:/root/exp8".
>
> 2) Add one brick "host:/root/exp9";
> 3) re-balance;
>
>
> I. GlusterFS DHT translator (3.2.5)
>
> ------------------------------**------------------------------**----------
>
> before re-balancing:
>
> exp0: file15  file22  file34  file35  file51  file6  file68  file78
>      file8   file81  file89  file94  file95
> exp1: file10  file28  file3   file4   file43  file66  file75
> exp2: file40  file46  file47  file48  file50  file58  file86  file9
> exp3: file12  file13  file32  file37  file54  file55  file7  file71
>      file91
> exp4: file31  file38  file41  file42  file53  file62  file63  file69
>      file93  file97
> exp5: file11  file16  file17  file24  file25  file27  file29  file44
>      file56  file73  file74  file80  file87  file90
> exp6: file0   file1   file2   file33  file36  file49  file57  file59
>      file64  file77  file79  file84  file85  file88  file98
> exp7: file21  file26  file39  file52  file61  file70  file72  file83
>      file92  file99
> exp8: file14  file20  file23  file30  file45  file5   file60  file65
>      file67  file76  file82  file96
>
> after re-balancing:
>
> exp0: file11  file16  file17  file24  file25  file31  file44  file53
>      file62  file69  file73  file80  file87  file93  file97
> exp1: file0   file27  file29  file33  file36  file56  file57  file64
>      file74  file77  file84  file88  file90  file98
> exp2: file1   file2   file39  file49  file59  file72  file79  file85
>      file92
> exp3: file21  file26  file30  file45  file52  file60  file61  file65
>      file70  file83  file99
> exp4: file14  file20  file23  file5   file67  file76  file82  file96
> exp5: file15  file22  file34  file35  file51  file6   file68  file78
>      file8   file81  file89  file94  file95
> exp6: file10  file28  file4   file43  file66  file75
> exp7: file3   file40  file47  file58  file9
> exp8: file12  file13  file32  file37  file46  file48  file50  file7
>      file71  file86
> exp9: file38  file41  file42  file54  file55  file63  file91
>
>
> as the result 98 files have been rebalanced (total files scanned 139)
>
>
>
> II.  64-bit ring with phi based on md5.
>     Every brick has number of virtual components S=7
>
> ------------------------------**------------------------------**----------
>
> before re-balancing:
>
> exp0: file02  file18  file22  file42  file48  file58  file62  file70
> exp1: file01  file08  file15  file23  file33  file51  file64  file75
>      file82  file85  file86  file87  file95
> exp2: file00  file10  file11  file14  file25  file29  file40  file45
>      file63  file81  file91  file96
> exp3: file09  file16  file19  file21  file24  file28  file32  file35
>      file36  file44  file47  file50  file52  file57  file67  file73
>      file88  file98
> exp4: file27  file49  file53  file55  file69  file97
> exp5: file05  file20  file43
> exp6: file34  file68  file72  file74  file79
> exp7: file03  file04  file26  file39  file41  file54  file60  file71
>      file77  file78  file83  file84  file89  file93  file94
> exp8: file06  file07  file12  file17  file30  file31  file37  file38
>      file46  file56  file59  file61  file65  file66  file76  file80
>      file90  file92  file99
>
> after re-balancing:
> only the following files has been moved (to exp9):
>
> exp0: file70
> exp1: file82  file85
> exp2: file00  file14  file25  file40  file45  file91  file96
> exp3: file88
>
> as the result 11 files have been rebalanced (total files scanned 51)
>
>
>
> III.  64-bit ring with phi based on md5.
>      Every brick has number of virtual components S=20
>
> ------------------------------**------------------------------**
> -----------
>
> before re-balancing:
>
> exp0: file00  file04  file22  file30  file40  file42  file70  file96
> exp1: file06  file08  file13  file15  file17  file19  file23  file32
>      file33  file78  file81  file86  file95
> exp2: file11  file14  file16  file24  file29  file57  file63  file67
>      file73  file76
> exp3: file02  file10  file12  file18  file21  file28  file35  file51
>      file56  file59  file80  file87
> exp4: file39  file41  file49  file53  file54  file62  file69  file77
>      file83  file84  file91
> exp5: file05  file20  file31  file43  file68  file74
> exp6: file34  file37  file38  file46  file48  file58  file66  file71
>      file75  file79  file88
> exp7: file03  file07  file26  file47  file60  file72  file89  file93
>      file94
> exp8: file01  file09  file25  file27  file36  file44  file45  file50
>      file52  file55  file59  file61  file64  file65  file82  file85
>      file90  file92  file97  file98  file99
>
> after re-balancing:
> only the following files has been moved (to exp9):
>
> exp0: file70
> exp6: file88
> exp7: file07  file93
> exp8: file45  file82  file85
>
> as the result 7 files have been rebalanced (total files scanned 48)
>
>
> ------------------------------**------------------------------**----------
>
>                        Table with results
>
>          Re-balance results and standard deviations (sd)
>          for current gluster DHT translator (3.2.5), and
>          for 64-bit rings with S=7 and S=20.
>
>
> ------------------------------**------------------------------**----------
>
>               DHT(Gluster 3.2.5)  64-bit ring, S=7  64-bit ring, S=20
>
> ------------------------------**------------------------------**----------
>
> files moved            98                 11                 7
>
> files scanned         139                 51                48
>
> sd before             2.8                5.4               3.7
>
> sd after              3.2                5.3               3.2
>
> ------------------------------**------------------------------**----------
>
>
>
> ----------
>
>
> (0*) http://en.wikipedia.org/wiki/**Uniform_distribution_%**28discrete%29<http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29>
> (1*) Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.;
> Lewin, D.
>    (1997). "Consistent Hashing and Random Trees: Distributed Caching
> Protocols
>     for Relieving Hot Spots on the World Wide Web"
> (2*) M/N is a median value
> (3*) http://weblogs.java.net/blog/**tomwhite/archive/2007/11/**
> consistent_hash.html<http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html>
> (4*) http://www.**allthingsdistributed.com/2007/**10/amazons_dynamo.html<http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html>
>
> ______________________________**_________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/**mailman/listinfo/gluster-devel<https://lists.nongnu.org/mailman/listinfo/gluster-devel>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20120417/41788050/attachment-0003.html>