[Gluster-devel] [RFC] Improved distribution
Anand Avati
anand.avati at gmail.com
Tue Apr 17 08:29:36 UTC 2012
Hi Edward,
- It will be interesting to see the comparison with the rebalance
enhancements which have gone into 3.3 (to get a sense of urgency).
- Even though the proposed scheme is not compatible with existing DHT
(+AFR), I can see ways in which it can be backward compatible with existing
volumes if care is taken in implementation details.
Avati
On Mon, Apr 16, 2012 at 4:57 PM, Edward Shishkin <edward at redhat.com> wrote:
> Hello GlusterFS developers.
>
> We have found that current DHT translator is suboptimal: the number
> of files being moved during re-balancing (caused by small changes in
> the set of bricks) can be significantly reduced (see Appendix).
>
> To be precise, we can improve scalability: in the current DHT the amount
> of re-balancing work scales as O(M) (M is total number of files in the
> compound volume), whereas after changing the hashing technique it will
> scale as O(M/N) (N is the number of bricks).
>
> In the document below we first consider simple tables (section 1) and
> estimate minimal amount of rebalance work for them. Then we complicate
> them with techniques of virtual nodes (section 2) and replication
> (section 3), and show that this doesn't worse scalability.
>
> Unfortunately it is impossible to perform the improvements without
> format change, so it would be a new translator, which won't understand
> layouts created by current DHT (and back).
>
> We will be happy to see any feedbacks on this, and if everything is
> OK, to proceed development in this direction (with implementation
> details, etc).
>
>
> Thanks,
> Edward.
>
>
> Legend:
>
>
> C - namespace;
> R - 64-bit ring;
> N - number of real bricks that compose the volume;
> M - number of files in the compound volume;
> S - number of virtual components of any real brick;
> R - size of preference set (replication level);
>
>
>
> 1. Simple DH tables based on consistent hashing
>
>
>
> We consider a 64-bit ring R, i.e. a regular 2^64-polygon with vertexes
> 0, ..., 2^64 - 1. "Ring" means that for every vertex A there can be
> found vertexes B, C of R, so that C < A < B. Here "<" and "<=" mean
> relations between respective angles (for any vertex Z we consider the
> angle composed of O0 and OZ, where O is the center of the polygon).
>
> Then we consider namespace C and any mapping phi: C -> R, so that for
> every sequence {c1, c2, ... } of different names {phi(c1), phi(c2),
> ...) is a pseudo-random variable, which has uniform distribution on R
> (see 0*).
>
>
> Suppose we have a volume composed of N bricks with unique names
> B1, B2, ... B_N. During system initialization we construct for this
> compound volume N respective unique tokens phi(B1), ..., phi(B_N)
> in the ring R, caching them, say, in rb-tree to preserve ordering.
>
> When creating a directory (mkdir(2)) we create respective directories
> on all bricks B_1, B_2, ... B_N.
>
> When creating a regular file (creat(2), etc) with a short name F,
> we create a respective file only in one brick B = B(F), which is
> determined by the following way: phi(B) is the minimal token in the
> ring so that phi(F) <= phi(B). This is where the the notion of ring
> works: if there is no any such token in the [phi(F), 2^64 - 1], then
> we continue search from the vertex 0.
>
> Lookup operation for any regular file F resolves to lookup(F) on the
> respective brick B(F).
>
> Deleting a regular file F resolves to deleting a file from the brick
> B(F).
>
> Looking for a brick (i.e. calculation F->B(F)) is a well-scalable
> operation: log(N) actions is required.
>
> When adding a new brick X = B_(N+1) we set a new unique token phi(X)
> to the ring R and move a subset of files from B_j to X, where B_j is
> the brick with the smallest phi(B_j), so that phi(X) < phi(B_j).
> Namely, every file W of B_j with phi(W) <= phi(X) should be moved to X.
> That said, the number of files to be moved during re-balancing is not
> larger than a number of files contained in one brick (B_j in our case)
>
> When removing a brick Y = B_k, we first find in R the "closest" brick
> B_s, which has minimal phi(B_s), so that phi(Y) < phi(B_s), and move
> all files from Y to B_s (no scans is required).
>
>
> Such hashing technique is called "consistent hashing associated with
> the variable phi, which has uniform distribution". This is a
> relatively new technique suggested by Karger at al (1*). The main
> advantage of this technique is that small changes in the set of bricks
> result in small amount of rebalancing work. Namely, adding/removing 1
> brick results in moving of only M/N files (2*) (M is total number of
> files in the compound volume). This is M/(M/N) = N times better then
> with traditional hashing, where we need to move all M files (median
> value). In particular, if we have 100 bricks, then with traditional
> hashing we'll need to move x100 files more than with consistent one.
>
> To construct a uniform distribution phi we can have any well-mixing
> 64-hash, say fnv-hash, etc..
>
> Comment 1. The technique of consistent hashing is used in Amazon's
> Dynamo (4*)
>
> Comment 2. There is a disadvantage: in this approach all files
>
> /foo
> /dir1/foo
> /dir1/dir2/foo
> ...
>
> will be accumulated on the same brick. However it is possible to
> "salt" a short file names with gfid (or another id) of respective
> directory before hashing, to avoid possible attacks.
>
>
>
> 2. Virtual nodes
>
>
>
> The theory above works well for larger number of bricks N. However,
> when N is too small (2-3 bricks), then uniform distribution can result
> in bad partitioning, so one brick will accumulate much more files then
> other ones, which is not good. The graceful solution of this problem
> is so-called technique of "virtual nodes": with every brick we set S
> tokens on the ring, where S is a number of "virtual components" of a
> brick. So, every brick is represented by S unique tokens on the ring
> (S >= 1, S is a parameter of the cluster translator).
>
> In the case of S>1 the lookup-a-brick procedure above is not changed:
> the difference is that we search in a larger set of tokens (N*S), and
> since log(N*S) == log(N) + log(S) == log(N) + const, this search also
> scales as log(N), while with a larger number of tokens the
> distribution of files becomes more balanced (in terms of the standard
> deviation, see (3*) and the Appendix for details. In particular, S=100
> provides deviation ~10% of the mean.
>
> Adding a new brick with S > 1 looks like above with the difference
> that we steal files of S > 1 different old virtual bricks. Note, that
> 2 or more of those virtual bricks can represent the same real brick
> though. So adding a real brick with S virtual components requires
> (M/N)*S scans, however, a median number of files to be moved during
> re-balancing is the same (M/N) as in the case of S==1.
>
> Removing a brick with S > 1 virtual components mostly looks like in
> the case of S == 1: no scans is requires. The difference is that we
> distribute files of the brick to be removed among S virtual bricks
> (which correspond to <= S real bricks).
>
>
>
> 3. Replication and Recovery
>
>
>
> To achieve high availability and durability we replicate files on
> multiple bricks. In our case replication can be implemented as a set
> of operations with the same ring R, so we don't create a separate
> translator for replication.
>
> We store every file in its so-called preference set of real bricks.
> Ordinal number R of this set is the volume option. R is also called
> replication level (R = 1 means no replication: every file is stored
> only in a single brick).
>
> For every file F its preference set is defined as a set of "closest"
> virtual bricks B_(k_1), ... , B(k_R), which represent pairwise
> different real bricks, so that B_(k_1) = B(F), and
> phi(B_(k_1)) < phi(B_(k_2)) < ... < phi(B_(k_R)).
>
> We don't create 2 replicas of the same file on the same real brick,
> so, R shouldn't be larger than N.
>
> If we enable replication (R>1), regular file operations become more
> complicated: every such operation is performed for all respective
> files located on all bricks of the preference set.
>
> Operations on a set of bricks also become more complicated, but
> scalability doesn't suffer. When adding a new brick X = B_(N+1), we
>
> 0) set a unique token phi(X) to the ring R.
>
> 1) find R closest tokens B_(m_1), ..., B_(m_R), which represent
> different real bricks in the ring, so that B_(m_R) == X,
> and phi(B_(m_1)) < ... < phi(B_(m_R)).
>
> 2) find R+1 closest tokens B_(p_0), ..., B_(p_R), which represent
> different real bricks in the ring, so that B_(p_0) == X,
> and phi(B_(p_0)) < ... < phi(B_(p_R)).
>
> 3) The new brick X steals a portion of files of B_(p_1) as it has been
> described in section (1) above.
>
> 4) The brick B_(p_R) becomes not belonging to the preference set of
> the stolen files, so we need to remove all the respective replicas
> from B_(p_R).
>
> 5) X becomes a brick belonging to the preference sets of files stored
> in the bricks B_(m_1),... , B_(m_(R-1)), hence we should create
> respective replicas on X.
>
> So adding a new brick with replication level R > 1 results in
>
> a) moving a portion of files of one brick (step (3) above);
> b) replication of files located on on R-1 bricks (step (5) above);
> c) deleting replicas of a portion of files of one brick (step (4)).
>
> (a),(b),(c) above can be considered as re-balancing of (R+1)*(M/N) =
> const*(M/N) files (when R == 1, then (b),(c) are absent, and we need
> to re-balance M/N files, as it was shown in the section 1 above).
>
> Similarly we can show that with level of replication R removing one
> brick also leads to re-balancing of const*(M/N) files.
>
>
> If in our configuration L < R bricks don't respond for some reasons,
> then all regular file operations are still defined, however our system
> is marked as "unhealthy" (in some papers this state is called "sloppy
> quorum"), non-responding bricks are marked as "missed" in the ring and
> file operation are performed on other available bricks of the
> preference set. In such operations files on the available "non-
> primary" bricks are marked as "not synced with the missed replicas".
>
> In the state of sloppy quorum operations like add/remove a node can
> be already undefined. For example, when adding a brick we'll need to
> steal files from a brick, which doesn't respond.
>
> So we need to return the system back to a "consistent" state, when all
> operations are defined. It can be done by the following ways:
>
> 1) Make sure that all missed bricks are available again and perform
> L1-recovery. L1-recovery means syncing all marked files with the
> again available bricks, so that resulting consistent system will
> have the same number N of bricks.
> 2) Add new empty bricks instead of missed ones and perform L2-recovery
> It means filling the new empty bricks with files from other bricks,
> so that resulting consistent system will have the same number N of
> bricks.
> 3) Remove missed bricks from the ring and perform L3-recovery, so that
> resulting consistent system will have smaller number of nodes (N-L)
>
> Comment 1. For any missed brick we can specify different type of
> recovery.
>
> Comment 2. When R == N replication "clogs" the distribution: in this
> case our system will work like mirrors: every bricks will contain
> the same set of files.
>
>
>
> APPENDIX
>
>
>
> ------------------------------**------------------------------**----------
>
> In 3 distributed hash tables with different hashing techniques
>
> . GlusterFS DHT translator (3.2.5)
> . 64-bit ring with phi based on md5, R=1 (no replication), S=7
> . 64-bit ring with phi based on md5, R=1 (no replication), S=20
>
> we run the same scenario:
>
> 1) Create 100 files ("file00", "file01", ..., "file99") in a volume
> composed of 9 bricks:
>
> "host:/root/exp0",
> "host:/root/exp1",
> ...
>
> "host:/root/exp8".
>
> 2) Add one brick "host:/root/exp9";
> 3) re-balance;
>
>
> I. GlusterFS DHT translator (3.2.5)
>
> ------------------------------**------------------------------**----------
>
> before re-balancing:
>
> exp0: file15 file22 file34 file35 file51 file6 file68 file78
> file8 file81 file89 file94 file95
> exp1: file10 file28 file3 file4 file43 file66 file75
> exp2: file40 file46 file47 file48 file50 file58 file86 file9
> exp3: file12 file13 file32 file37 file54 file55 file7 file71
> file91
> exp4: file31 file38 file41 file42 file53 file62 file63 file69
> file93 file97
> exp5: file11 file16 file17 file24 file25 file27 file29 file44
> file56 file73 file74 file80 file87 file90
> exp6: file0 file1 file2 file33 file36 file49 file57 file59
> file64 file77 file79 file84 file85 file88 file98
> exp7: file21 file26 file39 file52 file61 file70 file72 file83
> file92 file99
> exp8: file14 file20 file23 file30 file45 file5 file60 file65
> file67 file76 file82 file96
>
> after re-balancing:
>
> exp0: file11 file16 file17 file24 file25 file31 file44 file53
> file62 file69 file73 file80 file87 file93 file97
> exp1: file0 file27 file29 file33 file36 file56 file57 file64
> file74 file77 file84 file88 file90 file98
> exp2: file1 file2 file39 file49 file59 file72 file79 file85
> file92
> exp3: file21 file26 file30 file45 file52 file60 file61 file65
> file70 file83 file99
> exp4: file14 file20 file23 file5 file67 file76 file82 file96
> exp5: file15 file22 file34 file35 file51 file6 file68 file78
> file8 file81 file89 file94 file95
> exp6: file10 file28 file4 file43 file66 file75
> exp7: file3 file40 file47 file58 file9
> exp8: file12 file13 file32 file37 file46 file48 file50 file7
> file71 file86
> exp9: file38 file41 file42 file54 file55 file63 file91
>
>
> as the result 98 files have been rebalanced (total files scanned 139)
>
>
>
> II. 64-bit ring with phi based on md5.
> Every brick has number of virtual components S=7
>
> ------------------------------**------------------------------**----------
>
> before re-balancing:
>
> exp0: file02 file18 file22 file42 file48 file58 file62 file70
> exp1: file01 file08 file15 file23 file33 file51 file64 file75
> file82 file85 file86 file87 file95
> exp2: file00 file10 file11 file14 file25 file29 file40 file45
> file63 file81 file91 file96
> exp3: file09 file16 file19 file21 file24 file28 file32 file35
> file36 file44 file47 file50 file52 file57 file67 file73
> file88 file98
> exp4: file27 file49 file53 file55 file69 file97
> exp5: file05 file20 file43
> exp6: file34 file68 file72 file74 file79
> exp7: file03 file04 file26 file39 file41 file54 file60 file71
> file77 file78 file83 file84 file89 file93 file94
> exp8: file06 file07 file12 file17 file30 file31 file37 file38
> file46 file56 file59 file61 file65 file66 file76 file80
> file90 file92 file99
>
> after re-balancing:
> only the following files has been moved (to exp9):
>
> exp0: file70
> exp1: file82 file85
> exp2: file00 file14 file25 file40 file45 file91 file96
> exp3: file88
>
> as the result 11 files have been rebalanced (total files scanned 51)
>
>
>
> III. 64-bit ring with phi based on md5.
> Every brick has number of virtual components S=20
>
> ------------------------------**------------------------------**
> -----------
>
> before re-balancing:
>
> exp0: file00 file04 file22 file30 file40 file42 file70 file96
> exp1: file06 file08 file13 file15 file17 file19 file23 file32
> file33 file78 file81 file86 file95
> exp2: file11 file14 file16 file24 file29 file57 file63 file67
> file73 file76
> exp3: file02 file10 file12 file18 file21 file28 file35 file51
> file56 file59 file80 file87
> exp4: file39 file41 file49 file53 file54 file62 file69 file77
> file83 file84 file91
> exp5: file05 file20 file31 file43 file68 file74
> exp6: file34 file37 file38 file46 file48 file58 file66 file71
> file75 file79 file88
> exp7: file03 file07 file26 file47 file60 file72 file89 file93
> file94
> exp8: file01 file09 file25 file27 file36 file44 file45 file50
> file52 file55 file59 file61 file64 file65 file82 file85
> file90 file92 file97 file98 file99
>
> after re-balancing:
> only the following files has been moved (to exp9):
>
> exp0: file70
> exp6: file88
> exp7: file07 file93
> exp8: file45 file82 file85
>
> as the result 7 files have been rebalanced (total files scanned 48)
>
>
> ------------------------------**------------------------------**----------
>
> Table with results
>
> Re-balance results and standard deviations (sd)
> for current gluster DHT translator (3.2.5), and
> for 64-bit rings with S=7 and S=20.
>
>
> ------------------------------**------------------------------**----------
>
> DHT(Gluster 3.2.5) 64-bit ring, S=7 64-bit ring, S=20
>
> ------------------------------**------------------------------**----------
>
> files moved 98 11 7
>
> files scanned 139 51 48
>
> sd before 2.8 5.4 3.7
>
> sd after 3.2 5.3 3.2
>
> ------------------------------**------------------------------**----------
>
>
>
> ----------
>
>
> (0*) http://en.wikipedia.org/wiki/**Uniform_distribution_%**28discrete%29<http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29>
> (1*) Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.;
> Lewin, D.
> (1997). "Consistent Hashing and Random Trees: Distributed Caching
> Protocols
> for Relieving Hot Spots on the World Wide Web"
> (2*) M/N is a median value
> (3*) http://weblogs.java.net/blog/**tomwhite/archive/2007/11/**
> consistent_hash.html<http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html>
> (4*) http://www.**allthingsdistributed.com/2007/**10/amazons_dynamo.html<http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html>
>
> ______________________________**_________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/**mailman/listinfo/gluster-devel<https://lists.nongnu.org/mailman/listinfo/gluster-devel>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20120417/41788050/attachment-0003.html>
More information about the Gluster-devel
mailing list