[Gluster-devel] Follow: GSoC Proposal for a RESTful/JSON API and server for GlusterFS similar to WebHDFS

Wed Mar 19 02:31:59 UTC 2014

Hi Jay,

Thank you for the interest and response.

A few use cases:
1) Disco (a map-reduce framework written in Erlang and Python) is using
WebHDFS to add HDFS support.
2) WebHDFS is used to provide HDFS access for python and ruby.
3) Hadoop has a FileSystem plugin for WebHDFS -- used when you need to go
through a firewall or other situations where the regular HDFS network
protocol isn't feasible.
4) Spring (the web framework) supports accessing data in HDFS using WebHDFS
5) Fluentd (data collection framework) supports HDFS using WebHDFS as a
plugin.

For your second question:
Just to clarify a few points.  WebHDFS is the server and API.  The main
part of my proposal, as it stands now, is to provide similar functionality
for Gluster.  Hadoop does provide a client for WebHDFS that allows WebHDFS
to be used as an alternative protocol for HDFS.

According to the docs, the WebHDFS API provides complete support for the
FileSystem API.  It's possible that the WebHDFS API could be seen as a
generic HCFS API and that my proposed Gluster RESTful interface API could
implement a compatibility mode for the WebHDFS API so that any client that
can use WebHDFS can use the Gluster RESTful API.  Examples in this case
would include the Hadoop WebHDFS client, Spring, and Fluentd WebHDFS plugin.

If a compatible API is implemented, the Hadoop WebHDFS client with the
Gluster RESTful server could be used in place of the GlusterFS-hadoop
plugin / FUSE client combination.

We would need to discuss whether the FileSystem API (as mirrored by the
WebHDFS API) is, by itself, sufficient for all users of Gluster or not.  If
it is, then we can just implement that and focus the proposal on providing
compatibility with WebHDFS clients.  If not, we can develop an API that
mirrors Gluster semantic and provide a compatibility mode for the WebHDFS
API.

RJ

On Tue, Mar 18, 2014 at 10:06 PM, Jay Vyas <jayunit100 at gmail.com> wrote:

> I definetly like the idea.... Thanks for putting this together RJ.
>
> - what  are the main use cases for webhdfs and how do people currently use
> it in the real world?
>
> - what portions of the FileSystem and FileContext contract does webhdfs
> cover , and can we morph it's client , to make it hcfs compatible, and
> leverage our existing GlusterFS-hadoop plugin ?
>
> I can help mentor it from the perspective of the java integration and API
> usability, and I'm sure we can help to track down some folks on the
> C/gluster side of things is able to help me on the lower level details.
>
> On Mar 18, 2014, at 9:20 PM, RJ Nowling <rnowling at gmail.com> wrote:
>
> Hi all,
>
> I wanted to follow up.  I drafted a proposal for creating a RESTful/JSON
> API and server for GlusterFS similar to WebHDFS.  As the number of big data
> processing and storage systems explode, integration is becoming more
> important.  A language and operating system agnostic RESTful/JSON API and
> server could be helpful for easing integration efforts.
>
> I've pasted the proposal below.  Is there is any interest in the Gluster
> community?  Would anyone be willing to server as a mentor?
>
> Thank you,
> RJ
>
> RESTful/JSON API and Server for GlusterFS
>
> Overview of proposal:
> The goal of the proposal is to create a RESTful/JSON API and server
> (similar to WebHDFS) for GlusterFS.
>
> Need it fulfills:
> Following on the popularity of Hadoop, a number of "big data" processing
> systems (e.g., Berkeley Data Analytics Stack, Storm, Stratophere, Disco)
> are being created and adopted.  These systems are written in a wide range
> of languages such as Java, Scala, Python, and Erlang.
>
> These systems are rarely used in isolation. Maintaining separate
> distributed file systems and databases is laborious, costly, and wasteful.
> Migrating data between separate distributed file systems or databases is
> difficult, error prone, and limits easy access to data when it is needed.
> As a result, there is great interest in integration as exemplified by
> projected such as the Gluster plugin for Hadoop.
>
> Gluster's existing clients (FUSE, libgfapi) are limited to specific
> operating systems (Linux) and/or require bindings for each programming
> language other interest.  Such RESTful/JSON APIs and servers such as
> WebHDFS offer a more general solution that is independent of the client's
> operating system and programming language.  WebHDFS has proven popular and
> is being used by systems such as Disco to add support HDFS.  A RESTful/JSON
> interface and server for could offer similar benefits for Gluster and has
> the potential to be just as popular as WebHDFS.
>
> Any relevant experience you have:
> I am familiar with WebHDFS and Hadoop Gluster plugin. Through my Ph.D.
> research and TA'ing experience, I am familiar with distributed systems
> (e.g., WorkQueue), client-server systems, and RESTful/JSON APIs.  I have
> some experience with CherryPy, a Python web service framework, and using it
> to create a RESTful/JSON servers. I am also familiar with the work in Disco
> to add HDFS support through WebHDFS.
>
> How you intend to implement your proposal:
> Aim 1: Design a RESTful/JSON interface that supports the semantics of
> Gluster.
> The ability to report data locality information will be important for
> other projects that use that information for scheduling workers and tasks.
>
> Aim 2: Create a RESTful/JSON server.
> I will use Python and its libraries such as CherryPy or Flask to develop a
> RESTful server. My preferred option will be to use Python bindings to
> libgfapi as a backend, but I will fall back to using the Gluster FUSE
> client if I run into problems.  A dummy backend that uses the local file
> system will be created for testing purposes. (It would be good to support
> multiple backends.)
>
> Aim 3: Create a RESTful/JSON Python library.
> I will create Python library that uses the RESTful/JOSN interface as a
> backend.
>
> Aim 4: Create Unit Tests and Benchmarks for Several Use Cases
> As part of my effort, I will write unit tests to ensure that the server
> and client library are implemented correctly.  As a good performance will
> be important for adoption, I will also document several use cases and
> perform benchmarks to evaluate the performance of the RESTful/JSON server
> compared with the standard FUSE client.
>
> Aim 5: (Optional and time permitting) Work on integration with a big data
> system a proof-of-concept
> Option 1: Integrate with Hadoop by mimicking the WebHDFS API so that the
> Hadoop WebHDFS client can transparently use the Gluster RESTful API as a
> backend
>
> Option 2: Integrate with the Disco as an Erlang/Python MapReduce
> framework.  Support for HDFS is currently being added using the WebHDFS
> interface.  The WebHDFS work provides a good template for adding Gluster
> support.
>
> --
> em rnowling at gmail.com
> c 954.496.2314
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>

-- 
em rnowling at gmail.com
c 954.496.2314
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140318/f042e6bc/attachment-0001.html>