[Gluster-devel] Fwd: [Bug 981456] New: RFE: Please create an "initial offline bulk load" tool for data, for GlusterFS

Justin Clift jclift at redhat.com
Thu Jul 4 19:04:35 UTC 2013


Hi all,

Created an RFE in BZ for an initial "bulk data load" tool for GlusterFS:

  https://bugzilla.redhat.com/show_bug.cgi?id=981456

Was thinking about it from the perspective of SQL databases, how they do
bulk data loading.

Most of the leading databases have some mode where transactions and
triggers can be turned off, and then data "bulk loaded" very quickly.

Can cut *days* of loading time down to hours/minutes.

After looking over the RFE, does anyone have good ideas how we can
achieve it? :)

Regards and best wishes,

Justin Clift


Begin forwarded message:
> From: John Walker <jowalker at redhat.com>
> Subject: Fwd: [Bug 981456] New: RFE: Please create an "initial offline bulk load"	tool for data, for GlusterFS
> Date: 4 July 2013 7:57:38 PM GMT+01:00
> To: Justin clift <jclift at redhat.com>
> 
> Want to forward this to Gluster-devel for comment?
> 
> 
> -------- Original Message --------
> Subject: [Bug 981456] New: RFE: Please create an "initial offline bulk load"	tool for data, for GlusterFS
> From: bugzilla at redhat.com
> To: gluster-bugs at redhat.com
> CC: 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=981456
> 
>            Bug ID: 981456
>           Summary: RFE: Please create an "initial offline bulk load" tool
>                    for data, for GlusterFS
>           Product: GlusterFS
>           Version: pre-release
>         Component: core
>          Severity: medium
>          Priority: unspecified
>          Assignee: amarts at redhat.com
>          Reporter: jclift at redhat.com
>                CC: gluster-bugs at redhat.com
> 
> Description of problem:
> 
>  For new adopters of GlusterFS with a large existing data set, the
>  initial time to load their data into Gluster can take days.
> 
>  We should be able to improve this significantly by creating a
>  specialised "bulk data load" tool for Gluster.
> 
>  So far, people have been able to use rsync() to copy data to the
>  individual bricks in order to achieve something similar.  But it
>  doesn't work with striped nor distributed volumes, where each host only
>  has one part of the total data.
> 
>  This tool should support all Gluster volume types, including both
>  striped and distributed volumes, and set the extended attributes
>  correctly as it goes.
> 
>  To support striped and distributed volumes, it should send the
>  appropriate file data to each host, as the gluster* processes
>  would expect to find it.
> 
>  The tool may need to run while glusterd and glusterfs* are offline, so
>  no conflict occurs during operation.
> 
>  The thinking behind this RFE is from awareness of similar tools for
>  SQL databases.  With a SQL database, if a person loads a large data set
>  using the normal transaction processing (one transaction / commit per insert
>  statement, all triggers fired each time), the data load can take ages.
>  (also days)  So, most SQL databases have the ability to do bulk loading,
>  which disables the transaction features (eg. one commit at start and end,
>  triggers deferred until end of bulk loading).  Each SQL database project
>  / vendor has their own way of doing it, but the high level principle is
>  the same.  
> 
> 
> Version-Release number of selected component (if applicable):
> 
>  Upstream git master, as of Thur 4th July 2013.
> 
> 
> Actual results:
> 
>  Initial loading of data can take days.
> 
> 
> Expected results:
> 
>  Initial loading of data should not be significantly longer than what
>  an rsync() would achieve.
> 
> 
> Additional info:
> 
>  We should save a significant amount of time this way, by cutting out the
>  stat() calls (and similar) that would otherwise occur between hosts during
>  normal Gluster operation.

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift





More information about the Gluster-devel mailing list