Monday, December 27, 2010

Know Your Data

Knowing your data might be the most important quality of the new database administrator (DBA). Just like a marketing person needs to know their customers, a CEO needs to know their market; a DBA needs to truly understand their data. This article is about how knowing your data becomes even more important with cloud databases like SQL Azure Database, and NoSQL style data stores like Cassandra and Windows Azure Storage.

SQL Server

Using a scale up approach for your on-premise SQL Server database, the SQL Server DBA needed:

  • A good working understanding behind the relationships of the data, i.e. how tables related to each other. This allowed the DBA to develop foreign key to primary key relationships, one of the mantras of a relational database.
  • Understanding how data was being inserted and updated into the table allowed the DBA to optimize their queries for locking, partitions for performance, and transaction logs for growth.
  • Understanding how data was queried from the database and how paging worked, allowed the database administrator to tune the database with covered indexes and denormalization of data for better performance.
  • How much data was being added over a period of time, how to scale up the database to grow with the data and how to warehouse that data.

SQL Azure

If you are designing your SQL Azure data store for growth, you need to design in a scale out approach. For more about scale out your SQL Azure database read: Scale Out Your SQL Azure Database. Scaling out divides your data across multiple databases, a concept that SQL Azure calls federation – federation is the concept of sharding in SQL Azure. For more about SQL Azure Federation read: Intro to SQL Azure Federations: SQL Azure Federations Overview.


Federating your data breaks transaction boundaries and pushes SQL Azure towards a NoSQL database structure. The reason I call it a step toward NoSQL is that SQL Azure doesn’t allow you to have transactions that span multiple databases. The parity of transactions to a single database is caused in part because the databases exist on different hardware. If the transaction needs to travel on the network there would be considerable performance degradation. Because SQL Azure is on commodity hardware, the maximum scalable size is 50 Gigabytes. If you need a database bigger than 50 Gigabytes, you need to federate your data across multiple databases and across transaction boundaries. You lose the ACID principles of the database – closer to NoSQL than to a relational database.

The DBA needs to understand more about data in order to properly partition or federate the data correctly. Specifically the DBA needs to know:

  • How to distribute the data into acceptable transaction boundaries.
  • How to find the natural dividing lines of the data so that you can partition it across multiple databases.

Transaction Boundaries

Every query on SQL Azure exists within a transaction context – a principle of ACID relational databases. A transaction is treated in a coherent and reliable way, independent of other transactions. Just like SQL Server, SQL Azure is capable of creating transaction contexts for any statement that you might want to create against the database. And it can rollback any transactions where a failure might prevent the whole transaction from completing.


Regardless of SQL Azure transaction capabilities, typically very few of the statements executed against the database need to have a transaction context. For example, when selecting data with a forward-only, read-only cursor that only row-locks – the transaction context is restricted to the row that is being read. However, what if that table is only updated when the application is offline? The table doesn’t require any locking for read – since there are never any writes. The database doesn’t know this, only the application – or more specifically the application developer.


The pretense of the NoSQL database is that the application developer knows every case where they need a transaction context and every case where they do not. Since true transaction needs are typically very low, the application developer programs all transactions into the client code – and doesn’t require a database that supports transactions. They might even program the rollback code.

Since the NoSQL developer knows the data (some might say better than the relational database developer), they can program their application to use multiple databases to store the data – in other words they know how to scale out using homegrown transactions on the client application.


So they can use commodity hardware in the cloud which is significantly more cost effective than on-premise SQL installations managing the same data volume.

Dividing Lines

If the application is highly dependent on transactions – the developer can keep all transaction to a single database by partitioning the data on transaction boundaries. For example, if the application is supporting software as a service (SaaS) that is used by multiple users in such a way that no user modifies the other user’s data – the user is the transaction boundary. All the data for a user is stored in the same database. This way all the data reads or write happen together in a single database and they don’t span databases. Each database holds a set of users that it can support, however no two databases hold data for the same users.


Knowing your data allows to you “see” the natural transaction boundaries and gives you an edge on scaling out. I foresee the role of the DBA trending away from hardware maintenance and towards database architecture.