[Gluster-devel] Compacting Database to Reduce Tier Migration Times

Wed Jul 27 14:55:09 UTC 2016

Hello,

I have been looking into reducing the time gluster spends on tier
migration by compacting the databases.

Gluster creates a SQLite database on each brick to collect metadata
for files the client(s) touch. This metadata is necessary for tier
migration. At regular intervals, gluster queries the database on each
brick to determine which files to move.

As the database file size increases, so does the time to query the
database. This is due to fragmentation [1]. Therefore, migration slows
down as time goes on.

Solution:

We are asking for feedback on which defragmentation method to use for
the SQLite database. We detail all the options below. Currently, we
are leaning towards using the incremental auto_vacuum option tuned to
removing all free pages. In our tests so far, this option saves us
about as much space as manually calling VACUUM, but is faster overall.

Link to current progress: http://review.gluster.org/15031

Vacuum Types:

- The "VACUUM" option will reorganize the database by inserting all
the data into a new database and copying the contents back into the
old one.  This places all used pages from the same tables next to each
other and any free pages at the end. During the reorganization,
nothing else can edit the database. Therefore, no client can add new
metadata to the database and tier updates will not happen. At worst,
this command will use twice the space of the original database while
defragmentation is underway.

- The "auto_vacuum" option comes in two flavors, "full" and
"incremental". A full auto_vacuum moves ALL free pages in the database
to the end of the file after every commit. To do this, sqlite keeps
some extra metadata in the file to track candidates for
deframentation. However, full auto_vacuum does not elimintate all
fragmentation because there is no guarantee that data from the same
table will remain next to each other after the free page is moved. In
fact, this can make fragmentation worse. However if there are no free
pages to move, this option is a no-op (unlike the "vacuum" option
which always does a full copy of the database)

- "Incremental" auto_vacuum removes N free pages from the file, where
N is user-specified. While called an "auto_vacuum", this version will
only remove the free pages when invoked with a specific pragma,
"incremental_vacuum(N)". Just like full auto_vacuum, sqlite stores
extra metadata in the file to do this. As in auto vacuum, it also does
not guarantee the elimination of fragmentation. However unlike full
auto vacuum, freed pages are deleted, which will shrink the database
size.

Changes to Gluster:

We are adding an option to gluster that activates an underlying
database's compaction on or off.

gluster volume tier <volname> tier-compact <off|on>

At regular intervals, the tier daemon will send a compaction IPC to
the bricks one at a time and compact the database according to the
strategy.  For SQLite, this will change the necessary pragmas and call
VACUUM or incremental_vacuum(N) on the database as necessary.

[1]

SQLite divides its database file into blocks of 4K called pages. A
page can either be free (unused), store data for a table, or store
metadata for SQLite to use.

As a system uses a database over time, free pages can appear between
two used pages for some table A in the database. Pages for some other
table B can also appear between those two pages for table A. In
general, whenever any data not from table A appears between two pages
for table A, we have database fragmentation. Fragmentation hurts
database read times. A database without fragmentation benefits from
sequential reads from disk and may evict table A's data from the
cache.

Regards,
Diogenes