Welcome to End Point’s blog

Ongoing observations by End Point people

Riding the Elasticsearch River on a CouchDB: Part 1

As you may guessed from my perfect tan and rugged good looks, I am Phunk, your river guide. In this multi-part series, I will guide us through an exploration of Elasticsearch, its CouchDB/BigCouch River plugin, its source, the CouchDB document store, and the surrounding flora and fauna that are the Ruby on Rails based tools I created to help the DPLA project manage this ecosystem.

Before we get our feet wet, let's go through a quick safety briefing to discuss the terms I'll be using as your guide on this trip. Elasticsearch: A schema-less, JSON-based, distributed RESTful search engine. The River: An Elasticsearch plugin that automatically indexes changes in your upstream (heh) document store, in real-time. CouchDB: The fault-tolerant, distributed NoSQL database / document store. DPLA: The Digital Public Library of America open source project for which all this work was done.

Let's put on our flotation devices, don our metaphor helmets and cast off.

In an Elasticsearch + River + CouchDB architecture, all things flow from the CouchDB. For the DPLA project, we wanted to manage (create, update and delete) documents in our CouchDB document repository and have those changes automagically reflected in our Elasticsearch index. Luckily, CouchDB publishes a real-time stream (heh) of updates to its documents via its cleverly named "_changes" feed. Each change in that feed is published as a stand-alone JSON document. We'll look at that feed in more detail in a bit.

The River bridges (heh) the gap between CouchDB's _changes feed and Elasticsearch index. The plugin runs inside Elasticsearch, and makes a persistent TCP connection to CouchDB's _changes endpoint. When a new change is published to that endpoint, the River passes the relevant portions of that JSON up to Elasticsearch, which then makes the appropriate change to its index. Let's look at a simple timeline of what the River would see from the _changes feed during the creation of a new document in CouchDB, and then an update to that document:

A document is created in CouchDB, the _changes feed emits:


That same document is updated in CouchDB, the _changes feed emits:


It's tough to tell from this contrived example document, but the _changes feed actually includes the entire source document JSON for creates and updates. (I'll talk more about that in part 2.) From the above JSON examples, the River would pass the inner-most document containing the _id, _rev and my_field data up to Elasticsearch. Elasticsearch uses that JSON to update the corresponding document (keyed by _id) in its search index and voila, the document you updated in CouchDB is now updated in your Elasticsearch search index in real-time.

We have now gotten our feet wet with how a document flows from one end to the other in this architecture. In part 2, we'll dive deeper into the DevOps-heavy care, feeding, monitoring and testing of the River. We'll also look at some slick River tricks that can transform your documents before Elasticsearch gets them, and any other silly River puns I can come up with. I'll also be reading the entire thing in my best David Attenborough impression and posting it on SoundCloud.


Aaron Mildenstein said...

You should be using CouchDB rather than BigCouch.


Database-as-a-service company Cloudant has pledged to breathe life into NoSQL database Apache CouchDB, by pushing the code behind its highly-tuned version, BigCouch into the larger open source project.

The code merger, which has been in the works for some months, allows CouchDB users to reap the rewards from the distributed database, such as cluster management, higher concurrency access performance and compactor technology, from the broader project.

After four years of production, Cloudant has ceased BigCouch development in order to participate fully in the CouchDB project. Development on the Amazon Dynamo-inspired framework began in 2008 and was initially conceived as an easier way of distributing petabytes, after Cloudant had been using CouchDB for CERN’s Large Hadron Collider to manage data.

Brian Gadoury said...

Thanks for the comment, Aaron. The original work that inspired this multi-part article is actually from 2013 and 2014, hence the use of BigCouch.

While I'm enjoying the image of riding down a river on a big couch, it sounds like I should s/BigCouch/CouchDB/g to make the article more relevant. Thanks again!