A Robust Pattern for importing diverse data into SQL Server

Last year I had a minor project that was importing a divergent collection of data into SQL Server. The typical file had thousands of records, on occasion million of records that were batch-sequence sensitive. A common C# pattern is to read one record and insert that into SQL Server. This pattern is not robust when the insert program dies unexpectedly because it may insert duplicate records (You can start adding widgets but that is not a KISS solution.).  SQL Utilities such as BCP etc, tend to require a lot of fiddling to get the import right (and tend not to be very tolerant of data changes).

 

The problem was further complicated because there was a variety of uploading machines (isolated domains) which was difficult to update with new code(secured controls machines). In general, there were folders for data to be dropped which were monitored by Windows Services that did the upload.

 

The pattern that I ended up using was simple:

  • Convert each file to Xml in the upload program.
  • Insert the complete Xml into an upload table via a stored procedure
    • The SP checks for duplicates
  • Delete the file if successful.

Processing of the data may now be done using this table; in the upload sequence (or periodically using the file date sequence if needed) etc.  Once uploaded each batch of records can now be examined and process into temp tables that can be applied in a single SqL transaction minimizing blockage on the system.

 

The code below shows how a tab-delimited file is handled.

 

  • If a new type of data is dropped in the upload folder it will still be inserted into the Upload table (there is no need to update the uploading programs). The handling of the data is done in SQL Server – new data is simply collected until you are ready to process it.

 

public static SqlConnection GetConnection()
{
    var ret = new SqlConnection(ConfigurationManager.ConnectionStrings["ZuneAudit"].ConnectionString);
    ret.Open();
    return ret;
}
public static SqlCommand GetSpCmd(string spName)
{
    return new SqlCommand(spName, GetConnection()) { CommandType = CommandType.StoredProcedure, CommandTimeout=300 };
}
public static string UploadData(string xml)
{
    var sp = GetSpCmd("p_UploadData");
    sp.Parameters.AddWithValue("xml", xml);
    return (string) sp.ExecuteScalar();
}
public static void ImportTabDelimitedData(DirectoryInfo folder)
{
    char[] tabsep={'\t'};
    Array.ForEach(folder.GetFiles("*.*"), file =>
    {
        StringWriter sw = new StringWriter();
        using (var reader = new StreamReader(file.FullName))
        {
            string aline;            
            using (var writer = new XmlTextWriter(sw))
            {
                writer.WriteStartDocument(true);
                writer.WriteStartElement("data");
                writer.WriteAttributeString("file", file.Name);
                writer.WriteAttributeString("date", file.LastWriteTime.ToString("s"));
                writer.WriteAttributeString("length", file.Length.ToString());
                while (null != (aline = reader.ReadLine()))
                {
                    writer.WriteStartElement("item");
                    var fields = aline.Split(tabsep, StringSplitOptions.None);
                    var fno = 0;
                    Array.ForEach(fields, field => writer.WriteAttributeString(string.Format("f{0}", fno++), field));
                    writer.WriteEndElement();
                }
                writer.WriteEndElement();
                writer.WriteEndDocument();
            }                    
        }
        if ("Success" == UploadData(sw.ToString()))
        {
            file.Delete();
        }
    });
}

The SQL Table and Stored Procedure is also simple:

CREATE TABLE [dbo].[UploadData](
  [id] [int] IDENTITY(1,1) NOT NULL,
  [Xml] [xml] NOT NULL,
  [FileName] [nvarchar](255) NOT NULL,
  [Length] [int] NOT NULL,
  [Date] [datetime] NOT NULL,
 CONSTRAINT [PK_UploadData] PRIMARY KEY CLUSTERED 
(
  [id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE PROCEDURE p_UploadData
  @Xml as Xml
AS
BEGIN
  Declare @filename nvarchar(255)
  Declare @length int  
  Declare @Date DateTime
  Select @filename=@xml.value('(data/@file)[1]','nvarchar(255)'),
    @length=@xml.value('(data/@length)[1]','int'),
    @Date=@xml.value('(data/@date)[1]','datetime')
  If not Exists (Select 1 From UploadData 
    where @filename=filename 
    and @Date=Date
    and @length=length)    
  BEGIN
    INSERT INTO [UploadData]
           ([Xml]
           ,[FileName]
           ,[Length]
           ,[Date])
     VALUES
           (@Xml
           ,@FileName
           ,@Length
           ,@Date)
           Select 'Success'
  END
  Select 'Duplicate'    
END
GO

For processing the data, it is nice because:

  • if the data format changes unexpectedly -- breaking the code then you can simply stop processing all of the files of that type until the issue is fixed (if a file is not processed, go to the next type) – thus maintaining sequence.
    • All processing decisions are done in Sql Server despite the diverse data sources.
    • If batches has serialization information, this information can be used to both monitor and order the data appropriately.
  • processing may be done in a Sql Transaction which usually is easier

To see what is in the file is also easy:

Select xml.value('(data/@file)[1]','nvarchar(255)'),
    xml.value('(data/@length)[1]','int'),
    xml.value('(data/@date)[1]','datetime') From UploadData

Producing data such as shown below (note that one source file was 254 MEGS!)

FileName Length Date
album 91 863 093    2010-02-12 16:00:28.000
albumjoin    254 826 801 2010-02-12 16:01:36.000

In short, the pattern is simple, robust and flexible.

Comments

Popular posts from this blog

Simple WP7 Mango App for Background Tasks, Toast, and Tiles: Code Explanation

Yet once more into the breech (of altered programming logic)

How to convert SVG data to a Png Image file Using InkScape