[Gluster-devel] Reducing the size of the glusterdocs git repository

Prashanth Pai ppai at redhat.com
Wed May 18 09:35:00 UTC 2016


Hi all,

I tried out BFG tool on my github fork of glusterdocs project.

$ git count-objects -vH | grep size-pack
size-pack: 62.88 MiB

$ cd ..
$ java -jar bfg-1.12.12.jar --delete-files '*.{odp,pdf}' glusterdocs.git
$ cd glusterdocs.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

$ git count-objects -vH | grep size-pack
size-pack: 2.52 MiB

As seen above, the repo size was reduced from around 62MB to 2.5MB.

Caveat:
If we do go with this approach, as git history is re-written, every
contributor will have to re-fork the "cleaned" repo. There are about
60 forks now on github. Consequently, anyone sending a PR will now
have to create a fresh clone.

Is this "reset" worth it given the slight confusion and one-time
inconvenience to contributors ? Thoughts ?

Regards,
 -Prashanth Pai

----- Original Message -----
> From: "Amye Scavarda" <amye at redhat.com>
> To: "Nigel Babu" <nigelb at redhat.com>
> Cc: "Humble Chirammal" <hchiramm at redhat.com>, "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, May 17, 2016 6:14:23 PM
> Subject: Re: [Gluster-devel] Reducing the size of the glusterdocs git	repository
> 
> 
> 
> On Tue, May 17, 2016 at 6:02 PM, Nigel Babu < nigelb at redhat.com > wrote:
> 
> 
> 
> We could potentially setup travis-ci to do builds that'll fail loudly if we
> commit something that throws a warning. I've tried out the possibility here:
> 
> https://travis-ci.org/nigelbabu/glusterdocs/jobs/130816121
> 
> I've purposefully made it fail. Success looks like this:
> 
> https://travis-ci.org/nigelbabu/glusterdocs/jobs/130815368
> 
> We can, in the future, add stuff so that documentation has working links and
> there are no large files checked in. If there's interest happy to send a
> pull request for this.
> 
> I like this a lot.
> It's a way to make sure w'ere not putting in things that haven't been
> thoroughly checked.
> 
> PR welcome.
> 
> - amye
> 
> 
> 
> 
> On Tue, May 17, 2016 at 4:55 PM, Amye Scavarda < amye at redhat.com > wrote:
> 
> 
> 
> 
> On Tue, May 17, 2016 at 3:59 PM, Amye Scavarda < amye at redhat.com > wrote:
> 
> 
> 
> 
> 
> On Tue, May 17, 2016 at 3:56 PM, Niels de Vos < ndevos at redhat.com > wrote:
> 
> 
> On Tue, May 17, 2016 at 02:42:27PM +0530, Amye Scavarda wrote:
> > Hi all,
> > 
> > So we have a new slideshare.net account, GlusterCommunity (
> > http://www.slideshare.net/GlusterCommunity/ ) that connects with the
> > Gluster.org G+ community - and it'll even connect with the YouTube channel!
> > 
> > I've submitted a PR to the glusterdocs repo that will need some review: it
> > removes all of the presentations from the repo and links to slideshare. (
> > https://github.com/gluster/glusterdocs/pull/109 )
> 
> Cool, but note that the size of the repository does not decrease with
> that commit. The git repository will still contain all the presentations
> in the history/log. But not adding any more presentations is a good step
> already :-)
> 
> You are correct, but it will not make the current issue worse. It would help
> if I actually hit 'reply all'.
> 
> 
> > In no way does this mean that anyone needs to use Slideshare to host PDFs
> > of slides, you can use whatever you want. I chose slideshare because there
> > was an older Gluster account that had some Gluster.com presentations and it
> > links with YouTube.
> > 
> > Thoughts?
> 
> Looks good to me, but maybe you can address this comment in the GitHub
> pull request:
> https://github.com/gluster/glusterdocs/pull/109/files#r63498585
> 
> That's why I have you all to proofread.
> 
> 
> One thing I'm noticing, we don't have any sort of CI on Read The Docs. Let me
> see if there's not an easy way to fix that and have TravisCI tell us if
> we're about to merge something with a bunch of borked links.
> -- a
> 
> 
> 
> - amye
> 
> 
> Thanks,
> Niels
> 
> > - amye
> > 
> > 
> > 
> > On Thu, May 12, 2016 at 7:49 PM, Niels de Vos < ndevos at redhat.com > wrote:
> > 
> > > On Thu, May 12, 2016 at 03:55:23PM +0530, Kaushal M wrote:
> > > > On Thu, May 12, 2016 at 1:25 PM, Niels de Vos < ndevos at redhat.com >
> > > > wrote:
> > > > > On Thu, May 12, 2016 at 02:56:52AM -0400, Prashanth Pai wrote:
> > > > >> 
> > > > >> 
> > > > >> > > Right now, even cloning the main docs branch is a huge pain due
> > > to the size
> > > > >> > > of the repo.
> > > > >> > > I think that branching will solve not this problem, and might
> > > make the
> > > > >> > > problem worse.
> > > > >> > 
> > > > >> > Branching would not increase the size of the repository itself.
> > > Only the
> > > > >> > size used on RTD will be bigger as the HTML for different branches
> > > will
> > > > >> > be generated (so contents is there 2x). Cloning the repository is
> > > not
> > > > >> > affected.
> > > > >> > 
> > > > >> > Deleting files (like the presentations) will also not remove them
> > > from
> > > > >> > the git repository. It will stay possible to checkout an older
> > > version
> > > > >> > of the docs from the same repository, all of the history is
> > > downloaded
> > > > >> > once the repository is cloned.
> > > > >> > 
> > > > >> > In order to reduce the size of the repository, you need to create
> > > > >> > a
> > > new
> > > > >> > one, and import the changes without the big files. While importing
> > > > >> > changes from an other (the current) repository, it is possible to
> > > modify
> > > > >> > the changes on the fly and prevent importing the big files. This
> > > keeps
> > > > >> > the history and the credits for the contributors.
> > > > >> 
> > > > >> This is an alternative solution:
> > > > >> https://rtyley.github.io/bfg-repo-cleaner/
> > > > > 
> > > > > Right, I was thinking about git-filter-branch. In the end, I am
> > > > > pretty
> > > > > sure that the old/original repository is not valid anymore. I expect
> > > > > that 'git rebase' is used for the cleaning, and that will change the
> > > > > commit-ids of patches that follow after a 'cleaned' patch.
> > > > > 
> > > > > Mu recommendation for a seperate repository, is only for preventing
> > > > > inconsistencies between the upstream repository (after cleaning) and
> > > the
> > > > > previously cloned/forked repositories that contributors have.
> > > > > 
> > > > >> > Where would you suggest the presentations (and other files?)
> > > > >> > should
> > > get
> > > > >> > located?
> > > > >> 
> > > > >> May be an official Gluster community slideshare or speakerdeck
> > > account ?
> > > > > 
> > > > > Possibly something like this. But we should have a plan for the
> > > existing
> > > > > presentations too. And we have to accept that not everyone presenting
> > > > > about a Gluster (related) topic will use 'our' SaaS instance.
> > > > > 
> > > > >> Git LFS is also also an option but we don't really need versioning
> > > > >> for
> > > > >> presentation files. Git LFS will keep large files in a separate
> > > location
> > > > >> and keep a "pointer" to those in the repo.
> > > > > 
> > > > > I'd prefer something like this. Most of my presentations are written
> > > > > while I'm travelling, so a connected service is not really an option
> > > for
> > > > > me in any case.
> > > > 
> > > > The docs repo should just have links to the presentations.
> > > > They could be hosted on slideshare/speakerdeck, google drive or they
> > > > could be hosted html5 presentations.
> > > > If required we could just host the presentations on
> > > > download.gluster.org
> > > .
> > > > I've seen it being used to host resources for tutorials previously
> > > > (like disk images),
> > > > so hosting the actual presentations shouldn't be too hard.
> > > 
> > > I really do not care where they are hosted. We just can not demand the
> > > use of a SaaS for them. We can offer the option of course, but still
> > > allow presenters to use the tool of their preference.
> > > 
> > > Niels
> > > 
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > > 
> > 
> > 
> > 
> > --
> > Amye Scavarda | amye at redhat.com | Gluster Community Lead
> 
> 
> 
> --
> Amye Scavarda | amye at redhat.com | Gluster Community Lead
> 
> 
> 
> --
> Amye Scavarda | amye at redhat.com | Gluster Community Lead
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> 
> 
> 
> --
> Amye Scavarda | amye at redhat.com | Gluster Community Lead
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel


More information about the Gluster-devel mailing list