Sunday, July 25, 2010

SQL Server and Rich-Sparse Data (Real Estate) – Part I

Over the last two years I have been involved with two clients that deal with Real Estate (different market segments so no conflict of interest for me). I would describe real estate data as being sparse, rich data. The typical scenario is that the data on one house consists of MLS data and County Assessor data. In the case of SQL Server 2008R2,the number of fields involve quickly exceed the number of columns available in a SQL Server table (non-wide 1024 columns), but less than the number of columns of a wide table (30,000).  Wide tables were introduced in SQL Server 2008, so if you are forced to run on older versions of SQL Server the options changes.


In the case of wide tables, you are restricted to 4096 columns for Inserts, Select or Updates (despite having 30000 columns available). The wide table has a column set which is essentially an untyped XML that combines all the sparse columns [more info]. This means that you could use XML if you are running SQL Server 2005. There’s a bit of a dilemma here, using XML gives you the ability to update all 30000 rows by updating one XML column instead of having to sparse and create up to 8 inserts; the other side of the coin is that you would need to generate the computed columns etc from the XML.


For displaying data, I may need to do 8 selects with a wide column but can get all 30K attributes in a single column with a XML column type.


Before diving into these issues, I should cite the operation needs that I often encountered, namely:

  • A need to retrieve the original/earliest record in full
  • A need to list what changed between updates --- only what changed.
  • A need to query the current record.

For discussion purposes, we will assume that the data coming in is either XML or some format that may be converted to Xml with all of the properties being expressible as attributes in XML (i.e. a flaten structure – in some cases, it means doing cross-products of child data). The logic become:

  • Get the key from the Xml
  • If the key exists in the table, go to Update.
  • Insert:
    • Insert into [OriginalXml]
    • Insert into [CurrentXml]
    • Insert record with no attributes into [DeltaXml] with:
      • Received/Processed date
      • VersionNo  =0
      • End of Processing
  • Update:
    • Retrieve record from [CurrentXml]
    • Update [CurrentXml]
    • Walk all attributes in currentRecord and newRecord and retain only those attributes that have changed.  Use a XmlTextReader for performance.
      • Insert into [DeltaXml] the resulting Xml with:
        • VersionNo=Max(VersionNo)+1 for key
    • End of Processing.

Now if a user wants to see the history of a property, it is easy to show, just do a select on DeltaXml and spit out the items, for example:

  1. Dec 11, 2009 <key=”3” AP=”209000” BTH=”4” />
    1. Asking Price Changed: $209,000
    2. Bathrooms Changed: 4
  2. Dec 31, 2009 <key=”3”  BTH=”3” />
    1. Bathrooms Changed: 3
  3. Jan 10, 2010 <key=”3” AP=”199000”  />
    1. Asking Price Changed: $199,000

In the above example, the original Bathroom count may have been 3, it was changed to 4 (clerical error?), and then changed back to 3.  You get performance in retrieving the data because you do not have to walk 30K fields. Additionally, you have the complete history available (update by update) by just applying each deltaRecord to the originalRecord without sucking up massive storage requirements.

In theory, you could discard the original (saving space) and work backwards from the currentRecord but user usage is that they want to see the originalRecord – so I rather omit the overhead and performance hit of doing a reconstruction. Second, having both the [OriginalXml] and [CurrentXml] allows audits to be done: 

  • [OriginalXml] +[DeltaXml] = [CurrentXml]

In Part II I will look at the issues in searching records and performance differences between raw XML and wide tables.

No comments:

Post a Comment