Schema-Less Is (Usually) a Lie

People are fond of categorizing trendy database engines by what they're not ... "schema-less" is just one example. Overly broad, negative definitions are not the most helpful thing in the world (my whiteboard is technically schema-less and NoSQL). "Schema less" goes the extra mile -- not only is it overly broad, but it's normally not even true. MongoDB users deal with schemas and our customers frequently run into problems that need to be fixed with schema changes. As a result, we tend to think about schema in two parts, the "data-schema" and the "query-schema".

The Query Schema

MongoDB-based applications have another flavor of schema, we tend to call it the "query schema". The query schema exists in both the application code and the DB engine. Indexes on collections in MongoDB are very much "schema" and are one of the most important concepts to master for an application running at scale. While a fungible data schema can work well, a query schema should be mostly static and ensure that queries match up to indexes as precisely as possible. Getting this portion of a schema wrong can result in all kinds of pain down the road since indexes are intensive to build on large data sets and successful sharding needs a solid base to build on top of.

This is, incidentally, one of the reasons 10gen recommends scaling MongoDB vertically as much as practical. The need to shard early (less than 100GB of data) is normally an indication of a screwed-up query schema, and a sign that sharding will probably be incredibly painful.

Is MongoDB really schema-less?

There are very few databases that are actually schema-less. The term arguably works for Solr/Lucene since fields to be indexed can be defined at the "document" level, but nearly every other data store has a schema defined somewhere. The differentiator between databases is almost always which bits of the schema live in application code vs the database engine. Quality SQL databases have very strong schema support in the database engine. Dynamo type DBs (Cassandra, Riak, etc) mostly push data/query schema to the application level. MongoDB sits between the two and seems to have struck a balance that give developers a lot of power.