[Gluster-devel] [RFC ] dictionary optimizations

Xavier Hernandez xhernandez at datalab.es
Wed Sep 4 08:10:22 UTC 2013

Al 04/09/13 02:55, En/na Anand Avati ha escrit:
> On Tue, Sep 3, 2013 at 1:42 AM, Xavier Hernandez 
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>     Al 03/09/13 09:33, En/na Anand Avati ha escrit:
>>     On Mon, Sep 2, 2013 at 7:24 AM, Xavier Hernandez
>>     <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>         Hi,
>>         dict_t structures are widely used in glusterfs. I've some
>>         ideas that could improve its performance.
>>         * On delete operations, return the current value if it exists.
>>         This is very useful when we want to get a value and remove it
>>         from the dictionary. This way it can be done accessing and
>>         locking the dict_t only once (and it is atomic).
>>     Makes sense.
>>         * On add operations, return the previous value if it existed.
>>         This avoids to use a lookup and a conditional add (and it is
>>         atomic).
>>     Do you mean dict_set()? If so, how do you propose we
>>     differentiate between "failure" and "previous value did not
>>     exist"? Do you propose setting the previous value into a pointer
>>     to pointer, and retain the return value as is today?
>     Yes, I'm thinking to something similar to dict_set() (by the way,
>     I would remove the dict_add() function).
> dict_add() is used in unserialization routines where dict_set() for a 
> big set of keys guaranteed not to repeat is very expensive 
> (unserializing would otherwise have a quadratic function as its 
> asymptote). What is the reason you intend to remove it?
Yes, but it is used only inside dict.c itself. It should not be 
published in the dict.h. This is only a cosmetic change but I think it 
would be better. Any other use of dict_add() by a translator will be 
incorrect unless it is made with much care, which is not desirable for 
future maintenance.

>     What you propose would be the simplest solution right now. However
>     I think it would be interesting to change the return value to an
>     error code (this would supply more detailed information in case of
>     failure and we could use EEXIST to know if the value already
>     existed. In fact I think it would be interesting to progressively
>     change the -1 return code of many functions by an error code). The
>     pointer to pointer argument could be NULL if the previous value is
>     not needed.
>     Of course this would change the function signature, breaking a lot
>     of existing code. Another possibility could be to create a
>     dict_replace() function, and possibly make it to fail if the value
>     didn't exist.
> It is best we do not change the meaning of existing APIs, and just add 
> new APIs instead. The new API can be:
> int dict_replace (dict_t *dict, const char *key, data_t *newval, 
> data_t **oldval);
> .. and leave dict_set() as is.
That would be good. I would allow that dict_replace() sets *oldval to 
NULL to indicate that the key did not exist, and maintain the 0/-1 
return code to indicate error (this will maintain homogeneity with other 
APIs, though I still think that an error code would be more useful in 
general, but this would be another topic).

>>         * Always return the data_pair_t structure instead of data_t
>>         or the data itself.
>>         This can be useful to avoid future lookups or other
>>         operations on the same element. Macros can be created to
>>         simplify writing code to access the actual value.
>>     The use case is not clear. A more concrete example will help..
>     Having a data_pair_t could help to navigate from an existing
>     element (getting next or previous. This is really interesting if
>     dict where implemented using a sorted structure like a trie since
>     it would allow to process a set of similar entries very fast, like
>     the trusted.afr.<brick> values for example) or removing or
>     replacing it without needing another lookup (a more detailed
>     analysis would be needed to see how to handle race conditions).
>     By the way, is really the dict_t structure used concurrently ? I
>     haven't analyzed all the code deeply, but it seems to me that
>     every dict_t is only accessed from a single place at once.
> There have been instances of dict_t getting used concurrently, when 
> used as xdata and in xattrop (by AFR). There have been bugs in the 
> past with concurrent dict access.
I missed that. Sorry.

>>         * Use a trie instead of a hash.
>>         A trie structure is a bit more complex than a hash, but only
>>         processes the key once and does not need to compute the hash.
>>         A test implementation I made with a trie shows a significant
>>         improvement in dictionary operations.
>>     There is already an implementation of trie in
>>     libglusterfs/src/trie.c. Though it does not compact (collapse)
>>     single-child nodes upwards into the parent. In any case, let's
>>     avoid having two implementations of tries.
>     I know. The current implementation wastes a lot of memory because
>     it uses an array of 256 pointers, and in some places it needs to
>     traverse the array. Not a b¡g deal, but if it is made many times
>     it could be noticeable. In my test I used a trie with 4 child
>     pointers (with collapsing single-child nodes) that runs a bit
>     faster than the 256 implementation and uses much less memory. I
>     tried with 2, 4, 16 and 256 childs per node, and 4 seems to be the
>     best (at least for dictionary structures) though there are very
>     little difference between 4 and 16 in terms of speed.
> The 256 child pointers give you constant time lookup for the next 
> level child with just an offset indirection. With smaller fan-out, do 
> you search through the list? Can you show an example of this? 
> Collapsing single child node upwards is badly needed though.
For the case of 4 childs I split each byte in 4 elements of 2 bits. This 
makes it very easy to access the next child. It only needs some basic 
logic operations. This is the current implementation for the lookup 
function. It shows how it is accessed:

  *    1: 1 bit per element, 8 elements per byte
  *    2: 2 bits per element, 4 elements per byte
  *    3: 4 bits per element, 2 elements per byte
  *    4: 8 bits per element, 1 element per byte

#define TRIE_DIMENSION      2
#define TRIE_ELEM_BITS      (1 << (TRIE_DIMENSION - 1))
#define TRIE_CHILDS         (1 << TRIE_ELEM_BITS)
#define TRIE_ELEM_MASK      (TRIE_CHILDS - 1)

#define sys_trie_value(_data, _offs) \
     (((_data) >> (((_offs) & TRIE_INDEX_MASK) * TRIE_ELEM_BITS)) & \

#define sys_trie_value_idx(_data, _offs) \
     ({ \
         off_t __tmp_offs = (_offs); \
         sys_trie_value(((_data)[(__tmp_offs) >> TRIE_INDEX_BITS]), \
                        __tmp_offs); \

typedef struct _sys_trie_node
     struct _sys_trie_node * childs[TRIE_CHILDS];
     struct _sys_trie_node * parent;
     void *                  data;
     uint32_t                count;
     uint32_t                length;
     uint8_t                 key[0];
} sys_trie_node_t;

typedef struct _sys_trie
     sys_trie_node_t root;
} sys_trie_t;

sys_trie_node_t * sys_trie_lookup(sys_trie_t * trie,
                                   const uint8_t * key,
                                   size_t length)
     sys_trie_node_t * node;
     size_t len;

     len = length << TRIE_INDEX_BITS;
     for (node = trie->root.childs[sys_trie_value(*key, 0)];
          (node != NULL) && (node->length < len);
          node = node->childs[sys_trie_value_idx(key, node->length)]);

     if ((node != NULL) && (node->length == len) && (node->data != NULL) &&
         (memcmp(node->key, key, length) == 0))
         return node;
     return NULL;

It's true that the collapsing feature is not really used for 
dictionaries, however I think it would be interesting to have it to use 
tries for other purposes that may require a long duration data structure 
that eventually adds and removes items.

>     I agree that it is not good to maintain two implementations of the
>     same thing. Maybe we could change the trie implementation. It
>     should be transparent.
> Yes, I believe the current API can accommodate such internal changes.
>>         * Implement dict_foreach() as a macro (similar to kernel's
>>         list_for_each()).
>>         This gives more control and avoids the need of helper functions.
>>     This makes sense too, but there are quite a few users of
>>     dict_foreach in the existing style. Moving them all over might be
>>     a pain.
>     Maybe we could create a differently named macro to implement this
>     feature and allow the developers to slowly change it. The old
>     implementation could be flagged as deprecated and use the new one
>     for new code. Old code will have enough time to change it until
>     eventually the old implementation is removed.
>     If we make important changes to the dict_t structure, we could
>     replace current functions by macros that use the new
>     implementation but simulates the old behavior.
> Sounds OK.
>>         Additionally, I think it's possible to redefine structures to
>>         reduce the number of allocations and pointers used for each
>>         element (actual data, data_t, data_pair_t and key).
>>     This is highly desirable. There was some effort from Amar in the
>>     past (http://review.gluster.org/3910) but it has been in need of
>>     attention for some time. It would be intersting to know if you
>>     were thinking along similar lines?
>     Yes, it is quite similar though I should analyze it more deeply. I
>     would also try to remove some unused/unneeded fields that are used
>     in very few places, add complexity and can be replaced easily,
>     like extra_free and extra_stdfree in dict_t for example.
> Thanks,
> Avati
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130904/857c5ee3/attachment-0001.html>

More information about the Gluster-devel mailing list