[Gluster-devel] [RFC ] dictionary optimizations

Xavier Hernandez xhernandez at datalab.es
Wed Sep 4 08:27:24 UTC 2013


I would also like to note that each node can store multiple elements. 
Current implementation creates a node for each byte in the key. In my 
implementation I only create a node if there is a prefix coincidence 
between 2 or more keys. This reduces the number of nodes and the number 
of indirections.

Each node stores the key from the beginning to allow very fast lookup by 
a single memcmp().

In normal situations, this implementation is very fast and very 
efficient in space.

Greetings,

Xavi

Al 04/09/13 10:10, En/na Xavier Hernandez ha escrit:
> Al 04/09/13 02:55, En/na Anand Avati ha escrit:
>>
>> On Tue, Sep 3, 2013 at 1:42 AM, Xavier Hernandez 
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>
>>     Al 03/09/13 09:33, En/na Anand Avati ha escrit:
>>>     On Mon, Sep 2, 2013 at 7:24 AM, Xavier Hernandez
>>>     <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>
>>>         Hi,
>>>
>>>         dict_t structures are widely used in glusterfs. I've some
>>>         ideas that could improve its performance.
>>>
>>>         * On delete operations, return the current value if it exists.
>>>
>>>         This is very useful when we want to get a value and remove
>>>         it from the dictionary. This way it can be done accessing
>>>         and locking the dict_t only once (and it is atomic).
>>>
>>>
>>>     Makes sense.
>>>
>>>         * On add operations, return the previous value if it existed.
>>>
>>>         This avoids to use a lookup and a conditional add (and it is
>>>         atomic).
>>>
>>>
>>>     Do you mean dict_set()? If so, how do you propose we
>>>     differentiate between "failure" and "previous value did not
>>>     exist"? Do you propose setting the previous value into a pointer
>>>     to pointer, and retain the return value as is today?
>>     Yes, I'm thinking to something similar to dict_set() (by the way,
>>     I would remove the dict_add() function).
>>
>>
>> dict_add() is used in unserialization routines where dict_set() for a 
>> big set of keys guaranteed not to repeat is very expensive 
>> (unserializing would otherwise have a quadratic function as its 
>> asymptote). What is the reason you intend to remove it?
> Yes, but it is used only inside dict.c itself. It should not be 
> published in the dict.h. This is only a cosmetic change but I think it 
> would be better. Any other use of dict_add() by a translator will be 
> incorrect unless it is made with much care, which is not desirable for 
> future maintenance.
>
>>     What you propose would be the simplest solution right now.
>>     However I think it would be interesting to change the return
>>     value to an error code (this would supply more detailed
>>     information in case of failure and we could use EEXIST to know if
>>     the value already existed. In fact I think it would be
>>     interesting to progressively change the -1 return code of many
>>     functions by an error code). The pointer to pointer argument
>>     could be NULL if the previous value is not needed.
>>
>>     Of course this would change the function signature, breaking a
>>     lot of existing code. Another possibility could be to create a
>>     dict_replace() function, and possibly make it to fail if the
>>     value didn't exist.
>>
>>
>> It is best we do not change the meaning of existing APIs, and just 
>> add new APIs instead. The new API can be:
>>
>> int dict_replace (dict_t *dict, const char *key, data_t *newval, 
>> data_t **oldval);
>>
>> .. and leave dict_set() as is.
> That would be good. I would allow that dict_replace() sets *oldval to 
> NULL to indicate that the key did not exist, and maintain the 0/-1 
> return code to indicate error (this will maintain homogeneity with 
> other APIs, though I still think that an error code would be more 
> useful in general, but this would be another topic).
>
>>
>>>         * Always return the data_pair_t structure instead of data_t
>>>         or the data itself.
>>>
>>>         This can be useful to avoid future lookups or other
>>>         operations on the same element. Macros can be created to
>>>         simplify writing code to access the actual value.
>>>
>>>
>>>     The use case is not clear. A more concrete example will help..
>>>
>>     Having a data_pair_t could help to navigate from an existing
>>     element (getting next or previous. This is really interesting if
>>     dict where implemented using a sorted structure like a trie since
>>     it would allow to process a set of similar entries very fast,
>>     like the trusted.afr.<brick> values for example) or removing or
>>     replacing it without needing another lookup (a more detailed
>>     analysis would be needed to see how to handle race conditions).
>>
>>     By the way, is really the dict_t structure used concurrently ? I
>>     haven't analyzed all the code deeply, but it seems to me that
>>     every dict_t is only accessed from a single place at once.
>>
>>
>> There have been instances of dict_t getting used concurrently, when 
>> used as xdata and in xattrop (by AFR). There have been bugs in the 
>> past with concurrent dict access.
> I missed that. Sorry.
>
>>>         * Use a trie instead of a hash.
>>>
>>>         A trie structure is a bit more complex than a hash, but only
>>>         processes the key once and does not need to compute the
>>>         hash. A test implementation I made with a trie shows a
>>>         significant improvement in dictionary operations.
>>>
>>>
>>>     There is already an implementation of trie in
>>>     libglusterfs/src/trie.c. Though it does not compact (collapse)
>>>     single-child nodes upwards into the parent. In any case, let's
>>>     avoid having two implementations of tries.
>>     I know. The current implementation wastes a lot of memory because
>>     it uses an array of 256 pointers, and in some places it needs to
>>     traverse the array. Not a b¡g deal, but if it is made many times
>>     it could be noticeable. In my test I used a trie with 4 child
>>     pointers (with collapsing single-child nodes) that runs a bit
>>     faster than the 256 implementation and uses much less memory. I
>>     tried with 2, 4, 16 and 256 childs per node, and 4 seems to be
>>     the best (at least for dictionary structures) though there are
>>     very little difference between 4 and 16 in terms of speed.
>>
>>
>> The 256 child pointers give you constant time lookup for the next 
>> level child with just an offset indirection. With smaller fan-out, do 
>> you search through the list? Can you show an example of this? 
>> Collapsing single child node upwards is badly needed though.
> For the case of 4 childs I split each byte in 4 elements of 2 bits. 
> This makes it very easy to access the next child. It only needs some 
> basic logic operations. This is the current implementation for the 
> lookup function. It shows how it is accessed:
>
> /* TRIE_DIMENSION values:
>   *
>   *    1: 1 bit per element, 8 elements per byte
>   *    2: 2 bits per element, 4 elements per byte
>   *    3: 4 bits per element, 2 elements per byte
>   *    4: 8 bits per element, 1 element per byte
>   */
>
> #define TRIE_DIMENSION      2
> #define TRIE_INDEX_BITS     (4 - TRIE_DIMENSION)
> #define TRIE_ELEM_BITS      (1 << (TRIE_DIMENSION - 1))
> #define TRIE_CHILDS         (1 << TRIE_ELEM_BITS)
> #define TRIE_ELEMS_PER_BYTE (1 << TRIE_INDEX_BITS)
> #define TRIE_INDEX_MASK     (TRIE_ELEMS_PER_BYTE - 1)
> #define TRIE_ELEM_MASK      (TRIE_CHILDS - 1)
>
> #define sys_trie_value(_data, _offs) \
>      (((_data) >> (((_offs) & TRIE_INDEX_MASK) * TRIE_ELEM_BITS)) & \
>       TRIE_ELEM_MASK)
>
> #define sys_trie_value_idx(_data, _offs) \
>      ({ \
>          off_t __tmp_offs = (_offs); \
>          sys_trie_value(((_data)[(__tmp_offs) >> TRIE_INDEX_BITS]), \
>                         __tmp_offs); \
>      })
>
> typedef struct _sys_trie_node
> {
>      struct _sys_trie_node * childs[TRIE_CHILDS];
>      struct _sys_trie_node * parent;
>      void *                  data;
>      uint32_t                count;
>      uint32_t                length;
>      uint8_t                 key[0];
> } sys_trie_node_t;
>
> typedef struct _sys_trie
> {
>      sys_trie_node_t root;
> } sys_trie_t;
>
> sys_trie_node_t * sys_trie_lookup(sys_trie_t * trie,
>                                    const uint8_t * key,
>                                    size_t length)
> {
>      sys_trie_node_t * node;
>      size_t len;
>
>      len = length << TRIE_INDEX_BITS;
>      for (node = trie->root.childs[sys_trie_value(*key, 0)];
>           (node != NULL) && (node->length < len);
>           node = node->childs[sys_trie_value_idx(key, node->length)]);
>
>      if ((node != NULL) && (node->length == len) && (node->data != NULL) &&
>          (memcmp(node->key, key, length) == 0))
>      {
>          return node;
>      }
>      return NULL;
> }
>
> It's true that the collapsing feature is not really used for 
> dictionaries, however I think it would be interesting to have it to 
> use tries for other purposes that may require a long duration data 
> structure that eventually adds and removes items.
>
>>     I agree that it is not good to maintain two implementations of
>>     the same thing. Maybe we could change the trie implementation. It
>>     should be transparent.
>>
>>
>> Yes, I believe the current API can accommodate such internal changes.
>>
>>
>>
>>>         * Implement dict_foreach() as a macro (similar to kernel's
>>>         list_for_each()).
>>>
>>>         This gives more control and avoids the need of helper functions.
>>>
>>>
>>>     This makes sense too, but there are quite a few users of
>>>     dict_foreach in the existing style. Moving them all over might
>>>     be a pain.
>>     Maybe we could create a differently named macro to implement this
>>     feature and allow the developers to slowly change it. The old
>>     implementation could be flagged as deprecated and use the new one
>>     for new code. Old code will have enough time to change it until
>>     eventually the old implementation is removed.
>>
>>     If we make important changes to the dict_t structure, we could
>>     replace current functions by macros that use the new
>>     implementation but simulates the old behavior.
>>
>>
>> Sounds OK.
>>
>>
>>>         Additionally, I think it's possible to redefine structures
>>>         to reduce the number of allocations and pointers used for
>>>         each element (actual data, data_t, data_pair_t and key).
>>>
>>>
>>>     This is highly desirable. There was some effort from Amar in the
>>>     past (http://review.gluster.org/3910) but it has been in need of
>>>     attention for some time. It would be intersting to know if you
>>>     were thinking along similar lines?
>>>
>>     Yes, it is quite similar though I should analyze it more deeply.
>>     I would also try to remove some unused/unneeded fields that are
>>     used in very few places, add complexity and can be replaced
>>     easily, like extra_free and extra_stdfree in dict_t for example.
>>
>>
>> Thanks,
>> Avati
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130904/f09b09cc/attachment-0001.html>


More information about the Gluster-devel mailing list