March 16th, 2015
Layering a Data Warehouse on MongoDB
Author: Evan Morton
Vistaprint’s Promotional Products site uses MongoDB for its schema flexibility and ease of use. Our analysts need to access some of the site’s data in order to run reports and to cross-reference data across services (for example, “Show me the trends in posted prices and how it affects the ordering of those products”). Having them access MongoDB directly is impractical, however, because:
- Analysts are familiar with SQL, not MongoDB.
- Complex SQL queries can be extremely difficult to implement in Mongo.
- Each Mongo database is a separate source of data, and may not be physically co-located.
- Most importantly, the format and semantics of the MongoDB documents (“document” is the NoSQL equivalent of row or record) contain implementation details that analysts should not be required to learn, and that we want to be able to change without affecting the analysts’ work. In other words, it needs a layer of abstraction.
The solution is to have the best of both worlds: to continue to use our MongoDB backing store for our services, but have free access to the data for ad-hoc querying in SQL, specifically Microsoft SQL Server. Here is how we made this happen.
There are 26 tables in Mongo, with a total of 399 columns, so a fully manual migration to SQL Server would be extremely time-consuming. But many column-specific changes would be necessary in this migration, so a fully automated process would not work.
We designed and implemented a semi-automated, two-layer solution.
Layer 1: Copy and Flatten MongoDB
We created an automatic translator to copy the MongoDB collections to corresponding SQL tables. It runs once per minute, checking the MongoDB oplog (a log of changes, similar to SQL’s transaction log) and updating the SQL tables accordingly. The software is so thoroughly automated that new fields in existing collections are handled automatically. New collections require only minutes of effort.
This works very well, but simply copying each collection to a SQL table does not handle a key difference between NoSQL and SQL Server databases: NoSQL’s document fields can be lists or scalars, but SQL Server can have only scalars. The solution is to “flatten” the NoSQL schema by creating “child tables.” As an example, suppose we’re working with a collection of customer orders in MongoDB. A document in the Order collection has some scalar fields, like BuyerName and City, and a list field, Items. In the data shown, Smith ordered 20 of one item, 3 of another, and 4 of a third. DuPont ordered 10 of one item and 5 of another.
Our automatic translator turns it into two tables, raw_order for the nonlist data in Order, and raw_item for the per-item data. (The “raw_” prefix will be explained below.) They are linked by columns called row_id and parent_row_id, both added by the automatic translator.
The 1-to-many relationship from raw_order to raw_item can be described with:
on raw_item.parent_row_id = raw_order.row_id
The whole flattening concept can work on nested lists. So if, for example, the items within a customer order had a field that was a list, that child table would itself have a child table. In the implementation as it now stands, one of the tables has four grandchildren.
The work on layer 1 (the automatic translator) was done by DBAs who had knowledge of MongoDB and SQL, but they did not need to know the meaning of the MongoDB documents.
Layer 2: User Views in SQL Server
The raw_* tables solve the first three issues at the beginning of this article, but not the fourth, having a layer of abstraction. That is, if the raw_* tables were our end product, the analysts would still see the structure and semantics of the MongoDB collections, merely translated into SQL. We solved this problem with views, a SQL feature often described as virtual tables. We created views into the raw_* tables, and gave the warehouse users read access to those views, but not direct access to the raw_* tables.
The views hide many implementation specifics that the raw_* tables inherit from the MongoDB collections. Additionally, they rationalize naming, fix some imperfections of the underlying data, and contain some counts and sums to reduce the number of joins that users need to do.
The underlying tables were named raw_*, e.g., raw_order, because they are directly from MongoDB. Corresponding views are named without the prefix, e.g. order.
The work on layer 2, the views, was done by the author, a software developer with knowledge of SQL and the domain, but needing little knowledge of MongoDB.
When you make data easier to access, expect side effects. An example involves a SQL-based monitor/alert system our company had long before this project. Once we had the MongoDB data in SQL, we set up monitors/alerts to inform people of some problem states that orders can get into, some mistakes and omissions in entering data, and some other bad conditions.
Some other side effects are less desirable. Sometimes you do not want to expose all of your raw data. For example: some information in MongoDB was proprietary to one of our business partners, and we had to add some security around some of our views. Another downside is that adding the user views means an extra step to making data available. However, we judged it better to have a curator in place who understands the data, rather than making the raw data available and having the analysts draw false conclusions.
The data warehouse has generally been a success. In the future, we would like to cover more cases, so we plan to import some data from service calls instead of dumping from the raw databases. In some cases, we have to do this since the business logic in the service needs to be executed to get the correct data. In addition, there is data external to our team’s systems that would be useful to have (such as currency tables), so we plan on importing some other data sources as well.