[Gluster-devel] [RFC ] dictionary optimizations
Xavier Hernandez
xhernandez at datalab.es
Wed Sep 4 08:27:24 UTC 2013
I would also like to note that each node can store multiple elements.
Current implementation creates a node for each byte in the key. In my
implementation I only create a node if there is a prefix coincidence
between 2 or more keys. This reduces the number of nodes and the number
of indirections.
Each node stores the key from the beginning to allow very fast lookup by
a single memcmp().
In normal situations, this implementation is very fast and very
efficient in space.
Greetings,
Xavi
Al 04/09/13 10:10, En/na Xavier Hernandez ha escrit:
> Al 04/09/13 02:55, En/na Anand Avati ha escrit:
>>
>> On Tue, Sep 3, 2013 at 1:42 AM, Xavier Hernandez
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>
>> Al 03/09/13 09:33, En/na Anand Avati ha escrit:
>>> On Mon, Sep 2, 2013 at 7:24 AM, Xavier Hernandez
>>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>
>>> Hi,
>>>
>>> dict_t structures are widely used in glusterfs. I've some
>>> ideas that could improve its performance.
>>>
>>> * On delete operations, return the current value if it exists.
>>>
>>> This is very useful when we want to get a value and remove
>>> it from the dictionary. This way it can be done accessing
>>> and locking the dict_t only once (and it is atomic).
>>>
>>>
>>> Makes sense.
>>>
>>> * On add operations, return the previous value if it existed.
>>>
>>> This avoids to use a lookup and a conditional add (and it is
>>> atomic).
>>>
>>>
>>> Do you mean dict_set()? If so, how do you propose we
>>> differentiate between "failure" and "previous value did not
>>> exist"? Do you propose setting the previous value into a pointer
>>> to pointer, and retain the return value as is today?
>> Yes, I'm thinking to something similar to dict_set() (by the way,
>> I would remove the dict_add() function).
>>
>>
>> dict_add() is used in unserialization routines where dict_set() for a
>> big set of keys guaranteed not to repeat is very expensive
>> (unserializing would otherwise have a quadratic function as its
>> asymptote). What is the reason you intend to remove it?
> Yes, but it is used only inside dict.c itself. It should not be
> published in the dict.h. This is only a cosmetic change but I think it
> would be better. Any other use of dict_add() by a translator will be
> incorrect unless it is made with much care, which is not desirable for
> future maintenance.
>
>> What you propose would be the simplest solution right now.
>> However I think it would be interesting to change the return
>> value to an error code (this would supply more detailed
>> information in case of failure and we could use EEXIST to know if
>> the value already existed. In fact I think it would be
>> interesting to progressively change the -1 return code of many
>> functions by an error code). The pointer to pointer argument
>> could be NULL if the previous value is not needed.
>>
>> Of course this would change the function signature, breaking a
>> lot of existing code. Another possibility could be to create a
>> dict_replace() function, and possibly make it to fail if the
>> value didn't exist.
>>
>>
>> It is best we do not change the meaning of existing APIs, and just
>> add new APIs instead. The new API can be:
>>
>> int dict_replace (dict_t *dict, const char *key, data_t *newval,
>> data_t **oldval);
>>
>> .. and leave dict_set() as is.
> That would be good. I would allow that dict_replace() sets *oldval to
> NULL to indicate that the key did not exist, and maintain the 0/-1
> return code to indicate error (this will maintain homogeneity with
> other APIs, though I still think that an error code would be more
> useful in general, but this would be another topic).
>
>>
>>> * Always return the data_pair_t structure instead of data_t
>>> or the data itself.
>>>
>>> This can be useful to avoid future lookups or other
>>> operations on the same element. Macros can be created to
>>> simplify writing code to access the actual value.
>>>
>>>
>>> The use case is not clear. A more concrete example will help..
>>>
>> Having a data_pair_t could help to navigate from an existing
>> element (getting next or previous. This is really interesting if
>> dict where implemented using a sorted structure like a trie since
>> it would allow to process a set of similar entries very fast,
>> like the trusted.afr.<brick> values for example) or removing or
>> replacing it without needing another lookup (a more detailed
>> analysis would be needed to see how to handle race conditions).
>>
>> By the way, is really the dict_t structure used concurrently ? I
>> haven't analyzed all the code deeply, but it seems to me that
>> every dict_t is only accessed from a single place at once.
>>
>>
>> There have been instances of dict_t getting used concurrently, when
>> used as xdata and in xattrop (by AFR). There have been bugs in the
>> past with concurrent dict access.
> I missed that. Sorry.
>
>>> * Use a trie instead of a hash.
>>>
>>> A trie structure is a bit more complex than a hash, but only
>>> processes the key once and does not need to compute the
>>> hash. A test implementation I made with a trie shows a
>>> significant improvement in dictionary operations.
>>>
>>>
>>> There is already an implementation of trie in
>>> libglusterfs/src/trie.c. Though it does not compact (collapse)
>>> single-child nodes upwards into the parent. In any case, let's
>>> avoid having two implementations of tries.
>> I know. The current implementation wastes a lot of memory because
>> it uses an array of 256 pointers, and in some places it needs to
>> traverse the array. Not a b¡g deal, but if it is made many times
>> it could be noticeable. In my test I used a trie with 4 child
>> pointers (with collapsing single-child nodes) that runs a bit
>> faster than the 256 implementation and uses much less memory. I
>> tried with 2, 4, 16 and 256 childs per node, and 4 seems to be
>> the best (at least for dictionary structures) though there are
>> very little difference between 4 and 16 in terms of speed.
>>
>>
>> The 256 child pointers give you constant time lookup for the next
>> level child with just an offset indirection. With smaller fan-out, do
>> you search through the list? Can you show an example of this?
>> Collapsing single child node upwards is badly needed though.
> For the case of 4 childs I split each byte in 4 elements of 2 bits.
> This makes it very easy to access the next child. It only needs some
> basic logic operations. This is the current implementation for the
> lookup function. It shows how it is accessed:
>
> /* TRIE_DIMENSION values:
> *
> * 1: 1 bit per element, 8 elements per byte
> * 2: 2 bits per element, 4 elements per byte
> * 3: 4 bits per element, 2 elements per byte
> * 4: 8 bits per element, 1 element per byte
> */
>
> #define TRIE_DIMENSION 2
> #define TRIE_INDEX_BITS (4 - TRIE_DIMENSION)
> #define TRIE_ELEM_BITS (1 << (TRIE_DIMENSION - 1))
> #define TRIE_CHILDS (1 << TRIE_ELEM_BITS)
> #define TRIE_ELEMS_PER_BYTE (1 << TRIE_INDEX_BITS)
> #define TRIE_INDEX_MASK (TRIE_ELEMS_PER_BYTE - 1)
> #define TRIE_ELEM_MASK (TRIE_CHILDS - 1)
>
> #define sys_trie_value(_data, _offs) \
> (((_data) >> (((_offs) & TRIE_INDEX_MASK) * TRIE_ELEM_BITS)) & \
> TRIE_ELEM_MASK)
>
> #define sys_trie_value_idx(_data, _offs) \
> ({ \
> off_t __tmp_offs = (_offs); \
> sys_trie_value(((_data)[(__tmp_offs) >> TRIE_INDEX_BITS]), \
> __tmp_offs); \
> })
>
> typedef struct _sys_trie_node
> {
> struct _sys_trie_node * childs[TRIE_CHILDS];
> struct _sys_trie_node * parent;
> void * data;
> uint32_t count;
> uint32_t length;
> uint8_t key[0];
> } sys_trie_node_t;
>
> typedef struct _sys_trie
> {
> sys_trie_node_t root;
> } sys_trie_t;
>
> sys_trie_node_t * sys_trie_lookup(sys_trie_t * trie,
> const uint8_t * key,
> size_t length)
> {
> sys_trie_node_t * node;
> size_t len;
>
> len = length << TRIE_INDEX_BITS;
> for (node = trie->root.childs[sys_trie_value(*key, 0)];
> (node != NULL) && (node->length < len);
> node = node->childs[sys_trie_value_idx(key, node->length)]);
>
> if ((node != NULL) && (node->length == len) && (node->data != NULL) &&
> (memcmp(node->key, key, length) == 0))
> {
> return node;
> }
> return NULL;
> }
>
> It's true that the collapsing feature is not really used for
> dictionaries, however I think it would be interesting to have it to
> use tries for other purposes that may require a long duration data
> structure that eventually adds and removes items.
>
>> I agree that it is not good to maintain two implementations of
>> the same thing. Maybe we could change the trie implementation. It
>> should be transparent.
>>
>>
>> Yes, I believe the current API can accommodate such internal changes.
>>
>>
>>
>>> * Implement dict_foreach() as a macro (similar to kernel's
>>> list_for_each()).
>>>
>>> This gives more control and avoids the need of helper functions.
>>>
>>>
>>> This makes sense too, but there are quite a few users of
>>> dict_foreach in the existing style. Moving them all over might
>>> be a pain.
>> Maybe we could create a differently named macro to implement this
>> feature and allow the developers to slowly change it. The old
>> implementation could be flagged as deprecated and use the new one
>> for new code. Old code will have enough time to change it until
>> eventually the old implementation is removed.
>>
>> If we make important changes to the dict_t structure, we could
>> replace current functions by macros that use the new
>> implementation but simulates the old behavior.
>>
>>
>> Sounds OK.
>>
>>
>>> Additionally, I think it's possible to redefine structures
>>> to reduce the number of allocations and pointers used for
>>> each element (actual data, data_t, data_pair_t and key).
>>>
>>>
>>> This is highly desirable. There was some effort from Amar in the
>>> past (http://review.gluster.org/3910) but it has been in need of
>>> attention for some time. It would be intersting to know if you
>>> were thinking along similar lines?
>>>
>> Yes, it is quite similar though I should analyze it more deeply.
>> I would also try to remove some unused/unneeded fields that are
>> used in very few places, add complexity and can be replaced
>> easily, like extra_free and extra_stdfree in dict_t for example.
>>
>>
>> Thanks,
>> Avati
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130904/f09b09cc/attachment-0001.html>
More information about the Gluster-devel
mailing list