Landon Fuller

Amazon S3 Backups (and a java S3 library)

10 Apr 2007, 19:21 PDT

Introduction

At Three Rings, we use Bacula as a disk-based backup system. We originally implemented off-site backups by rsync'ing snapshots to an out-of-state server, but as the size of our backups headed towards terabytes, we started looking for a solution that didn't require installing a remote multi-terabyte RAID.

Amazon's Simple Storage Service (S3) provides access to Amazon's "highly scalable, reliable, fast, inexpensive data storage infrastructure" -- In brief, S3 provides a limitless amount of internet-accessible storage, accessed by a simple REST (or SOAP) API.

As a part of our (Three Rings) next project, Whirled, I implemented a relatively complete Java library for S3's REST API. Having the code on hand, it made sense to leverage both the code, and S3, for our off-site backups.

Streaming Data to (and from) S3

To back up multiple terabytes of data remotely, I had three requirements:

The data must be streamed directly to S3. (ie, no local copies)
S3 objects can not be larger than 5 gigabytes, so any data must be broken up into smaller blocks.
Data structures must be future-proof; it must be possible to evolve the data format.

The result is S3Pipe, a small utility which implements piping of streams to and from S3:

landonf:s3lib> echo "hello, world" | s3pipe ... upload --stream helloworld

landonf:s3lib> s3pipe ... download --stream helloworld
hello, world

In other words, S3Pipe supports reading data from stdin and writing it to S3, as well as reading data from S3 and writing it to stdout. The implementation:

Breaks input data into 5 megabyte blocks. An input thread places each block into a queue, and an output thread uploads them to S3, preventing starvation of the writer thread. The size of the block queue is limited -- once filled, the input thread will not read data until space becomes available.
Stores streams as future-proof versioned data structures, with support for arbitrary metadata.
Validates block checksums on both upload and download.
Retries block uploads and downloads should a transient failure occur.

I hope that others will assist in incrementally improving the feature set; built-in public-key encryption & signing, simple HMAC support, even more robust transient/network failure recovery.

Download & Use

Both the S3 client library and S3Pipe are available under the BSD license, as a 1.0-beta1 release. This is new software, and I strongly recommend testing any backups -- that said, the code has nearly 100% unit test coverage, has been tested locally, and is in active production use. The primary reason for the beta designator is that I still haven't written documentation!

You can download the library source here (pgp signature). To build s3lib and s3pipe, you will need the Java JDK (1.5+) and Ant (1.7+, available via MacPorts). Inside the source directory, run:

ant dist

This will create dist/s3lib.jar and dist/s3pipe.jar. S3Pipe is a stand-alone jar -- it can be copied anywhere once built. To execute S3Pipe (and print command usage), run:

java -jar dist/s3pipe.jar

S3Pipe requires you to provide your AWS id and key in a seperate properties file. See 'test.properties.example' as an example.

If you'd like javadocs for s3lib, run:

ant javadoc

However, be aware that the API is still subject to change.

S3Pipe in action: S3 Dump and Restore

All of our servers at Three Rings run FreeBSD, and so I decided to use dump(8), with its' built-in support for UFS2 file system snapshots, to dump file system snapshots from our primary Bacula server to Amazon S3.

The dump command makes use of "-P", or "pipe command". When provided a -P option, dump(8) will execute the given sh script, and pipe its output to the child's stdin. The following command line is a big one: it dumps a snapshot of the /export file system, piping the output through gzip for compression, gpg for symmetric encryption, throttle to rate limit the output to 20 Mbps, and S3Pipe, to write the stream to S3.

/sbin/dump -a -P "gzip -c | gpg --symmetric --batch --passphrase-file key.txt | /usr/local/bin/throttle -m 20 \
| java -jar s3pipe.jar --keyfile s3secrets.properties --bucket example.backup.bucket --stream bacula-/export upload" -0Lu /export

Restoration is, of course, almost the same thing in reverse.

java -jar s3pipe.jar --keyfile s3secrets.properties --bucket example.backup.bucket --stream bacula-/export download |\
gpg --decrypt --batch --passphrase-file key.txt | gzip -cd | restore

[/code/s3] — permanent link

Landon Fuller

Home

Contact

Links

Sections

Archives

Subscribe

Amazon S3 Backups (and a java S3 library)

Introduction

Streaming Data to (and from) S3

Download & Use

S3Pipe in action: S3 Dump and Restore