Bristle Software NoSQL Tips

This page is offered as a service of Bristle Software, Inc. New tips are sent to an associated mailing list when they are posted here. Please send comments, corrections, any tips you'd like to contribute, or requests to be added to the mailing list, to tips@bristle.com.

Details of Tips:

NoSQL Databases
1. Intro to NoSQL Databases
  
  Original Version: 2/10/2012
  Last Updated: 2/10/2012
  Applies to: All NoSQL databases
  
  What are NoSQL databases? How do they differ from SQL databases? What are they good for?
  
  "NoSQL" is an umbrella term that is better interpreted as "Not Only SQL" than as "No SQL". NoSQL databases, also known as "NoSchema" databases, are any data store that is not a traditional relational database (Oracle, DB2, MySQL, Postgres, Sybase, MS SQL Server, etc.) consisting of a well-defined set of tables, each with a well-defined set of columns. They include key-value pairs, document stores, graph databases, XML databases, object stores, etc.
  
  They are generally useful for applications that:
  1. Require extremely fast writes
  2. Tolerate slightly slower queries
  3. Require extremely large amounts of data -- not just KB (kilobytes), MB (megabytes), or GB (gigabytes), but TB (terabytes), PB (petabytes), EB (exabytes), ZB (zettabytes), or YB (yottabytes), each of which is 1000 times the size of the previous
  4. Require very flexible data -- cannot tolerate a fixed schema of pre-specified tables, columns and relationships
  5. Require efficient support for very sparsely populated properties -- columns that would be null for most rows of a traditional SQL database
  6. Have a low need for transactions and rollbacks involving changes to more than one object
  7. Can characterize most objects as either primary independent objects with very little need for queries that "join" them, or as dependent objects that depend on a single primary object
  8. Have a low need for aggregations of data (GROUP BY, HAVING, etc.)
  Various types of NoSQL databases have been invented in recent years to support Web sites with massive numbers of users. Thus, we have Google's BigTable, Facebook's Cassandra (now open-sourced to Apache), Amazon's DynamoDB, as well as several that were created by smaller companies or community efforts to address the same types of needs: MongoDB, CouchDB, etc.
  
  Different NoSQL databases use different strategies to achieve their speed, scalability, and flexibility. Some common approaches are:
  1. Key-value pairs
    The typical "map" approach to a sparse table -- instead of allocating memory for all columns of all rows, use a map, indexed by a key that specifies the row and column ids, and store only those cells that have non-null values. Each object stored in the database, which might have been a single row of a relational database, or a join of rows from multiple tables, is stored as multiple map entries, one per column.
    1. Google BigTable
    2. Apache Cassandra (from Facebook)
    3. Amazon DynamoDB
  2. Document stores
    Store one or more collections of entire tree-structured XML or JSON documents, with support for indexing the trees by the values of various subtrees or leaves.
    1. MongoDB
    2. CouchDB
  For more info, see the NoSQL row of my links page:
  http://bristle.com/~fred/#nosql_db
  
  Thanks to Jonathan Addelston for prompting me to write this tip!
  
  --Fred
MongoDB Tips
1. Intro to MongoDB
  
  Original Version: 5/31/2010
  Last Updated: 4/13/2012
  Applies to: MongoDB 1.8+
  
  Someone asked me recently about my experience with NoSQL databases, so I figured it was time to finally write up these notes about my use of MongoDB on a year-long project.
  
  My experience with MongoDB was all good.
  
  Free open source software.
  
  Very well supported by a very active mailing list, with questions being answered within minutes by the company (10gen) that produced the product, and sometimes extensive back-and-forth dialogs between them and various users as they drill down patiently into the newbie mistakes that the users make. Also, good on-line docs and printed O'Reilly books and local (Philly, DC, NY) MongoDB conferences. You can also buy a paid support contract, if you need it.
  
  Very easy to administer. No install required. Simply download and unzip. Run a MongoDB server on any Mac, Linux, Windows or Solaris box, whether a laptop, desktop or server, by simply typing:
       mongod
  at a command line. Run a client to access it by typing:
       mongo
  Drivers available for Java, JavaScript, C#, C++, and many others. Very easy replication and sharding. For example, to create a master:
       mongod --master
  and to create a slave:
       mongod -slave --source master_ip:master_port
  
  Sharding is just as easy, automatically slicing up the data so that different key ranges are stored on different servers. So are replica sets, where the nodes automatically monitor each other, share data, notice when the primary vanishes, elect a new primary from among the secondaries, and if the primary re-appears, it makes itself a secondary, etc.
  
  Very flexible. The entire database structure is one or more collections of objects with each object being a simple JSON document. No fixed schema. Each collection can contain a mixture of different structures of JSON documents. We stored people, companies, addresses, etc., all in the same collection. Each object in a collection can have its own unique set of fields if you like.
  
  In our case, we wanted more control over the structures of the documents. So we used JSON-Schema to define an explicit dynamic schema that we could easily enforce as needed, while still allowing our users to define their own fields of our objects or to define their own objects.
  
  Very fast, and scalable, and robust. The name "Mongo" is from "humongous". The CERN Large Hadron Collider throws data into MongoDB as fast as it can collect it from its near light-speed experiments. It is trivially easy to define indexes on the data, which happily ignore documents that have no such fields, and rapidly find those that do.
  
  No joins though, so it is a totally different mindset. You tend to embed related subdocs in a parent doc rather than setting up foreign key relationships between multiple normalized tables. That turned out to be very easy and very natural. The JSON documents in the DB map very nicely to domain objects. No need for an ORM because there is no R (relational database) and no need for any M (mapping). There are only O (objects). Nice!
  
  No transactions either, but inserts and updates are atomic, even for very complex documents, so once you've structured the data to not need joins, you don't miss transactions much either.
  
  For more details of our experience, see the video of a 45-minute talk by Mike Brocious, our tech lead:
  http://screencasts.chariotsolutions.com/webpage/how-mongo-db-helps-visibiz-tackle-social-crm
  Or just read the slides:
  http://www.slideshare.net/mikebrocious/mongodb-at-visibiz
  You can also download the audio as an MP3 and play it in the car:
  http://techcast.chariotsolutions.com/philly-ete-2011-podcast-3-how-mongo-db-helps-visibiz-tackle-social-crm
  Also, see the MongoDB row of my links page:
  http://bristle.com/~fred/#mongodb
  
  Thanks to Thor Collard for prompting me to finally write this up!
  
  --Fred
Cassandra Tips
1. Intro to Cassandra
  
  Original Version: 2/10/2012
  Last Updated: 2/10/2012
  Applies to: Cassandra
  
  I haven't used Cassandra yet, but I know a little about it.
  
  It was originally created by Facebook for their use, to handle massive amounts of data, and then open-sourced as an Apache project. It's one of the new NoSQL databases, like MongoDB, CouchDB, etc.
  
  See my links at:
  http://bristle.com/~fred/#nosql_db
  
  Especially, you might want to start with the brief summary at:
  http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  and the quick explanation from Facebook at:
  http://www.facebook.com/note.php?note_id=24413138919
  before going to the much more detailed comparison (100 pages) at:
  http://www.christof-strauch.de/nosqldbs.pdf
  and the official Apache docs at:
  http://cassandra.apache.org/
  
  --Fred
CouchDB Tips
1. Intro to CouchDB
  
  Original Version: 2/10/2012
  Last Updated: 2/10/2012
  Applies to: CouchDB
  
  I haven't used CouchDB yet. It's an open source Apache project, one one of the new NoSQL databases, like MongoDB, Cassandra, etc. See the links at my links page for comparisons of it with other NoSQL databases:
  
  http://bristle.com/~fred/#nosql_db
  
  Especially, you might want to start with the brief summary at:
  http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  and the brief comparison (by the MongoDB team) of CouchDB with MongoDB and MySQL:
  http://www.mongodb.org/display/DOCS/MongoDB,+CouchDB,+MySQL+Compare+Grid
  http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
  before going to the much more detailed comparison (100 pages) at:
  http://www.christof-strauch.de/nosqldbs.pdf
  and the official Apache docs at:
  http://couchdb.apache.org/
  
  --Fred

Bristle Software NoSQL Tips

Table of Contents:

Details of Tips: