Pfs is a distributed file system built specifically for the Docker ecosystem. You deploy it with Docker, just like other applications in your stack. Furthermore, MapReduce jobs are specified as Docker containers, rather than .jars, letting you perform distributed computation using any tools you want.
- Fault-tolerant architecture built on CoreOS (implemented)
- Git-like distributed file system (implemented)
- Dockerized MapReduce (not implemented)
No, pfs is at Alpha status. We'd love your help. :)
Pachyderm will eventually be a complete replacement for Hadoop, built on top of a modern toolchain instead of the JVM. Hadoop is a mature ecosystem, so there's a long way to go before pfs will fully match its feature set. However, thanks to innovative tools like btrfs, Docker, and CoreOS, we can build an order of magnitude more functionality with much less code.
Pfs is implemented as a distributed layer on top of btrfs, the same copy-on-write file system that powers Docker. Btrfs already offers git-like semantics on a single machine; pfs scales these out to an entire cluster. This allows features such as:
- Commit-based history: File systems are generally single-state entities. Pfs, on the other hand, provides a rich history of every previous state of your cluster. You can always revert to a prior commit in the event of a disaster.
- Branching: Thanks to btrfs's copy-on-write semantics, branching is ridiculously cheap in pfs. Each user can experiment freely in their own branch without impacting anyone else or the underlying data. Branches can easily be merged back in the main cluster.
- Cloning: Btrfs's send/receive functionality allows pfs to efficiently copy an entire cluster's worth of data while still maintaining its commit history.
The basic interface for MapReduce is a map function and a reduce function.
In Hadoop this is exposed as a Java interface. In Pachyderm, MapReduce jobs are
user-submitted Docker containers with http servers inside them. Rather than
calling a map method on a class, Pachyderm POSTs files to the /map route on
a webserver. This completely democratizes MapReduce by decoupling it from a
single platform, such as the JVM.
Thanks to Docker, Pachyderm can seamlessly integrate external libraries. For example, suppose you want to perform computer
vision on a large set of images. Creating this job is as simple as
running npm install opencv inside a Docker container and creating a node.js server, which uses this library on its /map route.
The easiest way to try out pfs is to point curl at the live instance we have running here: 146.148.77.106. We'll try to keep it up and running throughout the day.
Pfs is designed to run on CoreOS. To start, you'll need a working CoreOS cluster. Currently global containers, which are required by pfs, are only available in the beta channel (CoreOS 444.5.0)
- Google Compute Engine (recommended)
- Amazon EC2
- Vagrant (requires setting up DNS)
SSH in to one of your new machines CoreOS machines.
$ wget https://github.com/pachyderm-io/pfs/raw/master/deploy/static/1Node.tar.gz
$ tar -xvf 1Node.tar.gz
$ fleetctl start 1Node/*The startup process takes a little while the first time you run it because each node has to pull a Docker image.
The easiest way to see what's going on in your cluster is to use list-units
$ fleetctl list-unitsIf things are working correctly, you should see something like:
UNIT MACHINE ACTIVE SUB
announce-master-0-1.service 3817102d.../10.240.199.203 active running
announce-replica-0-1.service 3817102d.../10.240.199.203 active running
master-0-1.service 3817102d.../10.240.199.203 active running
replica-0-1.service 3817102d.../10.240.199.203 active running
router.service 3817102d.../10.240.199.203 active running
Pfs exposes a git-like interface to the file system:
# Write a file to <branch>. Defaults to "master".
$ curl -XPOST localhost/pfs/path/to/file?branch=<branch> -d @local_file# Read a file from <master>.
$ curl localhost/pfs/path/to/file
# Read all files in a directory.
$ curl localhost/pfs/path/to/dir/*
# Read a file from a previous commit.
$ curl localhost/pfs/path/to/file?commit=n# Delete a file from <branch>. Defaults to "master".
$ curl -XDELETE localhost/pfs/path/to/file?branch=<branch># Commit dirty changes to <branch>. Defaults to "master".
$ curl localhost/commit?branch=<branch># Create <branch> from <commit>.
$ curl localhost/branch?commit=<commit>&branch=<branch>Two guys who love data and communities and both happen to be named Joe. We'd love to chat: joey@pachyderm.io jdoliner@pachyderm.io.
Pfs's only dependency is Docker. You can build it with:
pfs$ docker build -t username/pfs .Deploying what you build requires pushing the built container to the central
Docker registry and changing the container name in the .service files from
pachyderm/pfs to username/pfs.