Friday, February 5, 2016

Prototyping in MongoDB with the Aggregation Pipeline stage operator $sample

Prototyping in MongoDB with the Aggregation Pipeline stage operator $sample

The World Map as a visual example

In order to show how the random sampling works in the mongoDB query, this NodeJS Express website will show the world map and display random latitude/longitude points on the map. Each refresh of the page will produce new random points. Below the map, the docs will display.



Once the website is up and working with data points, we will play with the query to see how the data points change in response.

The demonstration video is available on YouTube.


Setup steps for the website

Setup

This article assumes you have no mongoDB, no website, and no data. It does assume you have an account on Compose. Each step is broken out and explained. If there is a step you already have, such as the mongoDB with latitude/longitude data or a website that displays it, skip to the next.
  1. get website running, display map with no data
  2. setup the mongoDB+ ssl database
  3. get mock data including latitude and longitude
  4. insert the mock data into database
  5. update database data types
  6. verify world map displays data points

Play

When the website works and the world map displays data points, let's play with it to see how $sample impacts the results.
  1. understand the $sample operator
  2. change the row count
  3. change the aggregation pipeline order
  4. prototype with $sample

System architecture

The data import script is /insert.js. It opens and inserts a json file into a mongoDB collection. It doesn't do any transformation.

The data update script is /update.js. It updates the data to numeric and geojson types.

The server is a nodeJs Express website using the native MongoDB driver. The code uses the filesystem, url, and path libraries. This is a bare-bones express website. The /server/server.js file is the web server, with /server/query.js as the database layer. The server runs at http://127.0.0.1:8080/map/. This address is routed to /public/highmap/world.highmap.html. The data query will be made to http://127.0.0.1:8080/map/data/ from the client file /public/highmap/world.highmap.js.

The client files are in the /public directory. The main web file is /highmap/world.highmap.html. It uses jQuery as the javascript framework, and highmap as the mapping library which plots the points on the world map. The size of the map is controlled by the /public/highmap/world.highmap.css stylesheet for the map id.

Step 1: The NodeJS Express Website

In order to get the website up and going, you need to clone this repository, make sure nodeJS is installed, and install the dependency libraries found in the package.json file.

Todo: install dependencies

npm install

Once the dependencies are installed, you can start the web server.

Todo: start website

npm start

Todo: Request the website to see the world map. The map should display successfully with no data points.

http://127.0.0.1:8080/map/


Step 2: Setup the Compose MongoDB+ Deployment and Database

You can move on to the next section, if you have a mongoDB deployment with SSL to use, and have the following items:
  • deployment public SSL key in the /server/clientcertificate.pem file
  • connection string for that deployment in /server/config.json
Todo: Create a new deployment on Compose for a MongoDB+ database with an SSL connection.




While still on the Compose backoffice, open the new deployment and copy the connection string.

Todo: Copy connection string

You will need the entire connection string in order to insert, update, and query the data. The connection string uses a user and password at the beginning and the database name at the end.




You also need to get the SSL Public key from the Compose Deployment Overview page. You will need to login with your Compose user password in order for the public key to show.



Todo: Save the entire SSL Public key to /server/clientcertificate.pem.

If you save it somewhere else, you need to change the mongodb.certificatefile setting in /server/config.json.

You will also need to create a user in the Deployment's database.




Todo: Create new database user and password. Once you create the user name and user password, edit the connection string for the user, password, and database name.

connection string format

mongodb://USER:PASSWORD@URL:PORT,URL2:PORT2/DATABASENAME?ssl=true

connection string example

mongodb://myname:myuser@aws-us-east-1-portal.2.dblayer.com:10907,aws-us-east-1-portal.3.dblayer.com:10962/mydatabase?ssl=true


Todo: Change the mongodb.url setting in the /server/config.json file to this new connection string.

{
    "mongodb": {
        "data": "/data/mockdata.json",
        "url": "mongodb://DBUSER:DBPASSWORD@aws-us-east-1-portal.2.dblayer.com:10907,aws-us-east-1-portal.3.dblayer.com:10962/DATABASE?ssl=true",
        "collection": "mockdata",
        "certificatefile": "/clientcertificate.pem",
        "sample": {
            "on": true,
            "size": 5,
            "index": 1
        }
    }
}

Step 3: The Prototype Data

If you already have latitude and longitude data, or want to use the mock file included at /data/mockdata.json, you can skip this step.

Use Mockeroo to generate your data. This allows you to get data, including latitude and longitude quickly and easily. Make sure to add the latitude and longitude data in json format.



Make sure you have at least 1000 records for a good show of randomness and save the file as mockdata.json in the data subdirectory.

Todo: Create mock data and save to /data/mockdata.json.


Step 4: Insert the Mock Data into the mockdata Collection

The insert.js file converts the /data/mockdata.json file into the mockdata collection in the mongoDB database.

Note: This script uses the native MongoDB driver and the filesystem node package. The Mongoose driver can also use the ssl connection and the $sample operator. If you are using any other driver, you will need to check for both ssl and $sample.

The configuration is kept in the /server/config.json file. Make sure it is correct for your mongoDB url, user, password, database name, collection name and mock data file location. The configuration is read in and stored in the privateconfig variable of the insert.js script.

The mongos section of the config variable is for the SSL mongoDB connection. You shouldn't need to change any values.

insert.js


var MongoClient = require('mongodb').MongoClient,  
  fs = require('fs'),
  path = require('path');

var privateconfig = require(path.join(__dirname + '/config.json'));
var ca = [fs.readFileSync(path.join(__dirname + privateconfig.mongodb.certificatefile))];
var data = fs.readFileSync(path.join(__dirname + privateconfig.mongodb.data), 'utf8');
var json = JSON.parse(data);

MongoClient.connect(privateconfig.mongodb.url, {
    mongos: {
        ssl: true,
        sslValidate: true,
        sslCA: ca,
        poolSize: 1,
        reconnectTries: 1
    },
}, function (err, db) {
    if (err) {
        console.log(err);
    } else {
        console.log("connected");
        db.collection(privateconfig.mongodb.collection).insert(json, function (err, collection) {
            if (err) console.log((err));
            db.close();
            console.log('finished');
        }); 
    }  
});



Todo: Run the insert script.

node insert.js
If you create an SSL database but don't pass the certificate, you won't be able to connect to it. You will get a sockets closed error.

Once you run the script, make sure you can see the documents in the database's mockdata collection.


Step 5: Convert latitude and longitude from string to floats

The mock data's latitude and longitude are strings. Use the update.js file to convert the strings to floats as well as create the geojson values.

update.js
var MongoClient = require('mongodb').MongoClient,  
  fs = require('fs'),
  path = require('path');

var privateconfig = require(path.join(__dirname + '/config.json'));
var ca = [fs.readFileSync(path.join(__dirname + privateconfig.mongodb.certificatefile))];

MongoClient.connect(privateconfig.mongodb.url, {
    mongos: {
        ssl: true,
        sslValidate: true,
        sslCA: ca,
        poolSize: 1,
        reconnectTries: 1
    },
}, function (err, db) {
    if (err) console.log(err);
    if (db) console.log("connected");
       
    db.collection(privateconfig.mongodb.collection).find().each(function(err, doc) {       
        if (doc){
            
            console.log(doc.latitude + "," + doc.longitude);
            
            var numericLat = parseFloat(doc.latitude);
            var numericLon = parseFloat(doc.longitude);
            
            doc.latitude = numericLat;
            doc.longitude = numericLon;
            doc.geojson= { location: { type: 'Point', coordinates : [numericLat, numericLon]}}; // convert field to string
            db.collection(privateconfig.mongodb.collection).save(doc);
            
        } else {
            db.close();
        }
    });
    console.log('finished');
});



Todo: Run the insert script
node update.js
Once you run the script, make sure you can see the documents in the database's mockdata collection with the updated values.


Step 6: Verify world map displays points of latitude and longitude

Refresh the website several times. This should show different points each time. The variation of randomness should catch your eye. Is it widely random, or not as widely random as you would like?


Todo: Refresh several times
http://127.0.0.1:8080/map/?rows=5



The warning of the $sample behavior says the data may duplicate within a single query. On this map that would appear as less than the number of requested data points. Did you see that in your tests?

How $sample impacts the results

Now that the website works, let's play with it to see how $sample impacts the results.
  1. understand the $sample code in /server/query.js
  2. change the row count
  3. change the aggregation pipeline order
  4. prototype with $sample

Step 1: Understand the $sample operator in /server/query.js

The $sample operator controls random sampling of the query in the aggregation pipeline.
The pipeline used in this article is a series of array elements in the arrangeAggregationPipeline function in the /server/query.js file. The first array element is the $project section which controls what data to return.

arrangeAggregationPipeline()


  
  
var aggregationPipeItems = [
        { $project: 
            {
                last: "$last_name",
                first: "$first_name",
                lat: "$latitude",
                lon:  "$longitude",
                Location: ["$latitude", "$longitude"],
                _id:0 
            }
        },
        { $sort: {'last': 1}} // sort by last name


The next step in the pipeline is the sorting of the data by last name. If the pipeline runs this way (without $sample), all documents are returned and sorted by last name.

The location of $sample is controlled by the pos value in the url. If pos isn't set, the position defaults to 1. If it is set to 1 of the zero-based array, it will be applied between $project and $sort, at the second position. If the code runs as supplied, the set of data is randomized, documents are selected, then the rows are sorted. This would be meaningful in both that the data is random, and returned sorted.

Note: In order for random sampling to work, you must use it in connection with 'rows' in the query string.

We will play with the position in step 3.

Step 2: Change the row count

The count of rows is a parameter in the url to the server, when the data is requested. Change the url to indicate 10 rows returned.

Todo: request 10 rows, with sorting applied after

http://127.0.0.1:8080/map/?rows=10



Step 3: Change the aggregation pipeline order

The aggregation pipeline order is a parameter in the url to the server. You can control it with the 'pos' name/value pair. The following url is the same as Step 2 but the aggregation pipeline index is explicitly set.

Todo: request 10 rows, with sorting applied after
http://127.0.0.1:8080/map/?rows=10&pos=1

Note: Only 0, 1, and 2 are valid values




The results below the map should be sorted.



If the $sample position is moved to the 0 position, still before the sort is applied, the browser shows the same result.

Todo: request 10 rows, with sorting applied after

http://127.0.0.1:8080/map/?rows=10&pos=0

But, however, if the $sample is the last item (pos=2), the entire set is sorted, then 5 rows are selected. The results are no longer sorted.

Todo: request 10 rows, with sorting applied before

http://127.0.0.1:8080/map/?rows=10&pos=2



Note that while the documents are returned, they are not in sorted order.

If they are in sorted order, it isn't because they were sorted, but because the random pick happened that way on accident, not on purpose.

Step 4: Prototype with $sample

The mongoDB $sample operator is a great way to to try out a visual design without needing all the data. At the early stage of the design, a quick visual can give you an idea if you are on the right path.

The map with data points works well for 5 or 10 points but what about 50 or 100?

Todo: request 500 rows

http://127.0.0.1:8080/map/?rows=500



The visual appeal and much of the meaning of the data is lost in the mess of the map. Change the size of the points on the map.

Todo: request 500 rows, with smaller points on the map using 'radius' name/value pair

http://127.0.0.1:8080/map/?rows=500&radius=2


Summary

The $sample aggregation pipeline operator in mongoDB is a great way to build a prototype testing with random data. Building the page so that the visual design is controlled by the query string works well for quick changes with immediate feedback.

Enjoy the new $sample operator. Leave comments about how you have or would use it.

Thursday, January 21, 2016

Frugal Cloud The In-Memory versus SSD Paging File

Many people remember the days when you could use a USB memory stick to boost the performance of Windows. This memory caused me to ask the question: Is there a potential cost saving with little performance impact by going sparse on physical memory and configuring a paging file.

For windows folks:

For Linux, see TechTalk Joe post. Note his "I/O requests on instance storage does not incur a cost. Only EBS volumes have I/O request charges." so it is not recommended to do if you are running with EBS only.

This approach is particularly significant when you are "just over" one the offering levels. 


For some configurations, you will not get a CPU boost - by using the paging files. I know recent experience with a commercial SAAS actually had high memory usage but very log CPU (3-5%, even during peak times!). Having 1/2 or even 1/4 the CPUs would not peg the CPU. The question then becomes whether the Paging File on a SSD drive would significantly drop performance (whether you can strip for extra performance across multiple SSD on cloud instances, is an interesting question). This is a question that can only be determined experimentally.

  • How the paging file is configured and the actual usage of memory by the application is key. Often 80-90% of the usage hits only 10% of the memory (Pareto rule). The result could be that the median (50%ile) time may be unchanged -- and time may increase only along the long tail of the response distribution (say top 3% may be longer).
These factors cannot be academically determined. They need to be determine experimentally.

If performance is acceptable, there is an immediate cost saving because when new instances are created due to load, they are cheaper instances.

Bottom line is always: Experiment,stress, time and compare cost. Between the pricing models, OS behaviors and application behaviors, there is no safe rule of thumb!

Second Rule: Always define SLA as the Median (50%-ile) and never as an average.  Web responses are long-tailed which makes the average(mean) very volatile. The median is usually very very stable.

Sunday, January 17, 2016

Sharding Cloud Instances

Database sharding has been with us for many years. The concept of cloud instance sharding has not been discussed much. There is a significant financial incentive to do.

Consider a component that provides address and/or postal code validation around the world. For the sake of illustration, let us consider 10 regions that each have the same volume of data.

Initial tests found that it took 25 GB of data to load all of them in memory. Working of AWS EC2 price list, we find that a m4.2xlarge is needed to run it, at $0.479/hr. This gives us 8 CPUs.

If we run with 10 @ 2.5 GB instead, we end up with 10 t2.medium, each with 2 CPU and a cost of $0.052/hr, or $0.52/hr -- which on first impression is more expensive, except we have 20 CPUs instead of 8 CPU. We may have better performance. If one of these instances is a hot spot (like US addresses), then we may end up with  9 instances that each support one region each and perhaps 5 instances supporting the US. As a single instance model, we may need 5 instances.

In this case, we could end up with

  • Single Instance Model: 5 * $0.479 = $2.395/hr with 40 CPU
  • Sharded Instances Model: (9 + 5) * $0.052 = $1.02/hr with 28 CPU
We have moved from one model being 10% more expensive to the other model being 100% as expensive.

Take Away

Flexibility in initial design to support independent cloud instances with low resource requirements as well as sharding may be a key cost control mechanism for cloud applications.

It is impossible to a-priori determine the optimal design and deployment strategy. It needs to determined by experiment. To do experiments cheaply, means that the components and architecture must be designed to support experimentation.

In some ways, cloud computing is like many phone plans -- you are forced to pay for a resource level and may not used all of resources that you pay for. Yes, the plans have steps, but if you need 18 GB of memory you may have to also pay for 8 CPUs that will never run more than 5% CPU usage (i.e. a single CPU is sufficient). Designing to support flexibility of cloud instances is essential for cost savings.

Saturday, January 16, 2016

An Financially Frugal Architectural Pattern for the Cloud

I have heard many companies complain about how expensive the cloud is becoming as they moved from development to production systems. In theory, the saved costs of greatly reduced staffing of Site Reliability Engineers and reduced hardware costs should compensate -- key word is should.  In reality, this reduction never happens because they are needed to support other systems that will not be migrated for years.

There is actually another problem, the architecture is not designed for the pricing model.

In the last few years there have been many changes in the application environment, and I suspect many current architectures are locked into past system design patterns. To understand my proposal better, we need to look at the patterns thru the decades.

The starting point is the classic client server: Many Clients - one Server - one database server (possibly many databases)
As application volume grew, we ended up with multiple servers to handle multiple clients but retaining a single database.
Many variations arose, especially with databases - federated, sharding etc. The next innovation was remote procedure calls with many dialects such as SOAP, REST, AJAX etc. The typical manifestation is shown below.
When the cloud came along,the above architecture was too often just moved off physical machines to cloud machines without any further examination.

Often they will be minor changes, if a queue service was being used with the onsite service concurrent with the application server, it may be spawned off to a separate cloud instance. Applications are often design for the past model of all on one machine. It is rare when an existing application is moved to the cloud that it is design-refactored significantly. I have also seen new cloud base application be implemented in the classic single machine pattern.

The Design Problem

The artifact architecture of an application consisting of dozens, often over 100 libraries (for example C++ dll's), It's a megalith rooted in the original design being for one PC. 

Consider the following case: Suppose that instead of running these 100 libraries on a high end cloud machines with say 20 instances, you run each library on it's own light-weight machine? Some libraries may only need two or three light-weight machines to handle the load. Others may need 20 instances because it is computationally intense and a hot spot. If you are doing auto-scaling, then the time to spin-up a new instance is much less when instances are library based -- because it is only one library. 

For the sake of argument, suppose that each of the 100 libraries require 0.4 GB to run. So to load all of them in one instance we are talking 40GB (100 x 0.4).

Looking at the current AWS EC2 pricing, we could use 100 instances of the t2.nano and have $0.0065 x 100 = $0.65/hour for all 100 instances with 1 CPU each (100 CPU total). The 40GB would require c3.8xlarge at $1.68/hour, 3 times the cost and only 32 cores instead of 100 cores. Three times the cost and 1/3 of the cores... sounds like our bill could be 9 times what is needed.


What about scaling, with the megalith, you have to spin up a new complete instance. With the decomposition into library components, you only need to spin up new instances of the library that needs it. In other words, scaling up become significantly more expensive with the megalith model.

What is another way to describe this? Microservices

This is a constructed example but it does illustrate that moving the application to the cloud may require appropriate redesign with a heavy focus on building components to run independently on the cheapest instances. Each swarm of these component-instances are load balanced with very fast creation of new instances.

Having a faster creation of instances actually save more money because the triggering condition can be set higher (and thus triggered less often - less false positives). You want to create instances so they are there when the load build to require them. The longer the time it takes to load the instance, the longer lead time you need need, which means the lower on the build curve you must set the trigger point. 

There is additional savings for deployments, because you can deploy at the library level to specific machines instead of having to deploy a big image. Deploys are faster, rollbacks are faster.

Amazon actually does this approach internally with hundreds of services (each on their own physical or virtual machine) backing their web site. A new feature is rarely integrated into the "stack", instead it is added as a service that can actually be turned off or on on production by setting appropriate cookies in the production environment. There is limited need for a sandbox environment because the new feature is not there for the public -- only for internal people that know how to turn it on.

What is the key rhetorical question to keep asking?

Why are we having most of the application on one instance instead of "divide and save money"?  This question should be constantly asked during design reviews.

In some ways, a design goal would be to design the application so it could run on a room full of PI's.

This design approach does increase complexity -- just like multi-threading and/or async operations adds complexity but with significant payback. The process of designing libraries to minimize the number of inter-instances call while striving to minimize the resource requirements is a design challenge that will likely require mathematical / operations research skills.

How to convert an existing application?

A few simple rules to get the little gray cells firing:
  • Identify methods that are static - those are ideal for mini-instances
  • Backtrack from these methods into the callers and build up clusters of objects that can function independently.
    • There may be refactoring because often designs go bad under pressure to deliver functionality
    • You want to minimize external (inter-component-instances) calls from each of these clusters
  • If the system is not dependent on dozens of component-instance deployments there may be a problem.
    • If changing the internal code of a method requires a full deployment, there is a problem
One of the anti-patterns for effective-frugal cloud base design is actually object-orientated (as compared to cost-orientated) design. I programmed in Simula and worked in GPSS -- the "Adam and Eve" of object programming. All of the early literature was based on the single CPU reality of computing then. I have often had to go in and totally refactor an academically correct objective system design in order to get performance. Today, a refactor would also need to get lower costs.

The worst case system code that I refactored for performance was implemented as an Entity Model in C++, a single call from a web front end went thru some 20 classes/instances in a beautiful conceptual model, with something like 45 separate calls to the database. My refactoring resulted in one class and a single stored procedure (whose result was cached for 5 minutes before rolling off or being marked stale).

I believe that similar design inefficiencies are common in cloud architecture.

When you owned the hardware, each machine increased labor cost to create, license, update and support. You have considerable financial and human pressure to minimize machines. When you move to the cloud with good script automation, having 3 instances or 3000 instances should be approximately the same work. You actually have financial pressure to shift to the model that minimizes costs -- this will often be with many many more machines.






Monday, January 4, 2016

A simple approach to getting all of the data out of Atlassian Jira

One of my current projects is getting data out of Jira into a DataMart to allow fast (and easy) analysis. A library such as TechTalk.JiraRestClient provides a basic foundation but there is a nasty gotcha. Jira can be heavily customized, often with different projects having dozen of different and unique custom fields. So how can you do one size fits all?

You could go down path of modifying the above code to enumerate all of the custom fields (and then have continuous work keeping them in sync) or try something like what I do below: exploiting that JSON and XML are interchangeable and XML in a SQL Server database can actually be real sweet to use.

Modifying JiraRestClient

The first step requires downloading the code from GitHub and modifying it.
In JiraClient.cs method  EnumerateIssuesByQueryInternal add the following code.
               var issues = data.issues ?? Enumerable.Empty<Issue>();
                var xml=JsonConvert.DeserializeXmlNode( response.Content,"json");
                // Insert all of the XML-JSON into 
                foreach (var issue in issues)
                {
                   var testNode=xml.SelectSingleNode(string.Format("//issues/key[text()='{0}']/..", issue.key));
                    if(testNode !=null)
                    {
                        issue.xml = testNode.OuterXml;
                    };

                }

You will also need to modify the issue class to include a string, "xml".  The result is an issue class containing all of the information from the REST response. 

Moving Issues into a Data Table

Once you have the issue by issue REST JSON response converted to XML, we need to move it to our storage. My destination is SQL server and I will exploit SQL Table variables to make the process simple and use set operations. In short, I move the enumeration of issues into a C# data table so I may pass the data to SQL Server.

                   var upload = new DataTable();
                   // defining columns omitted
                    var data = client.GetIssues(branch);
                    foreach (var issue in data)
                        try
                        {
                            var newRow = upload.NewRow();
                            newRow[Key] = issue.key;
                            newRow[Self] = issue.self;
                            // Other columns extracted
                            newRow[Xml] = issue.xml;
                            upload.Rows.Add(newRow);
                        }
                        catch (Exception exc)
                        {
                            Console.WriteLine(exc);
                        }
                    

The upload code is also clean and simple:

         using (var cmd = new SqlCommand { CommandType = CommandType.StoredProcedure, CommandText ="Jira.Upload1"})
                                using (cmd.Connection = MyDataConnection)
                                {
                                    cmd.Parameters.AddWithValue("@Data", upload);
                                     cmd.ExecuteNonQuery();
                                }

SQL Code

For many C# developers, SQL is an unknown country, so I will go into some detail. First, we need to define the table in SQL, just match the DataTable in C# above (same column names in same sequence is best)

    CREATE TYPE [Jira].[JiraUpload1Type] AS TABLE(
[Key] [varchar](max) NULL,
[Assignee] [varchar](max) NULL,
[Description] [varchar](max) NULL,
[Reporter] [varchar](max) NULL,
[Status] [varchar](max) NULL,
[Summary] [varchar](max) NULL,
[OriginalEstimate] [varchar](max) NULL,
[Labels] [varchar](max) NULL,
[Self] [varchar](max) NULL,
[XmlData] [Xml] Null
        )

Note: that I use (max) always -- which is pretty much how the C# datatable sees each column. Any data conversion to decimals will be done by SQL itself.

Second, we create the stored procedure. We want to update existing records and insert missing records. The code is simple and clean

    CREATE PROC  [Jira].[Upload1] @Data [Jira].[JiraUpload1Type] READONLY
    AS 
    Update Jira.Issue SET
      [Assignee] = D.Assignee
     ,[Description] = D.Description
     ,[Reporter] = D.Reporter
     ,[Status] = D.Status
     ,[Summary] = D.Summary
     ,[OriginalEstimate] = D.OriginalEstimate
     ,[Labels] = D.Labels
     ,[XmlData] = D.XmlData
    From @Data D
    JOIN Jira.Issue S ON D.[Key]=S.[Key]

    INSERT INTO [Jira].[Issue]
           ([Key]
           ,[Assignee]
           ,[Description]
           ,[Reporter]
           ,[Status]
           ,[Summary]
           ,[OriginalEstimate]
           ,[Labels]
  ,[XmlData])
    SELECT D.[Key]
           ,D.[Assignee]
           ,D.[Description]
           ,D.[Reporter]
           ,D.[Status]
           ,D.[Summary]
           ,D.[OriginalEstimate]
           ,D.[Labels]
  ,D.[XmlData]
    From @Data D
    LEFT JOIN Jira.Issue S ON D.[Key]=S.[Key]
    WHERE S.[Key] Is Null

All of the Json is now in XML and can be search by Xpath

Upon executing the above, we see our table is populated as shown below. The far right column is XML.This is the SQL Xml data type and contains the REST JSON converted to XML for each issue.

The next step is often to add computed columns using the SQL XML and a xpath. An example of a generic solution is below.

So what is the advantage?

No matter how many additional fields are added to Jira, you have 100% data capture here. There is no need to touch the Extract Transform Load (ETL) job. You can create (and index) the data in the XML in SQL server, or just hand back the XML to whatever is calling it.  While SQL Server 2016 supports JSON, XML is superior because of the ability to do XPaths into it as well as indices.

In many implementations of JIRA, the number of fields can get unreal.. as shown below

With the same data table, you could create multiple views that contain computed columns showing precisely the data that you are interested in.

Example of Computed column definitions

[ProductionReleaseDate]  AS ([dbo].[GetCustomField]('customfield_10705',[XmlData])),
[EpicName]  AS ([dbo].[GetCustomField]('customfield_10009',[XmlData])),
[Sprint]  AS ([dbo].[GetCustomField]('customfield_10007',[XmlData])),

With this Sql Function doing all of the work:

    CREATE FUNCTION [dbo].[GetCustomField]
    (
    @Name varchar(32),
    @Data Xml
    )
    RETURNS varchar(max)
    AS
    BEGIN
    DECLARE @ResultVar varchar(max)
    SELECT  @ResultVar = c.value('customfieldvalues[1]','varchar(max)') FROM     @Data.nodes('//customfield[@id]') as t(c)
    WHERE c.value('@id','varchar(50)')=@Name
    RETURN @ResultVar
    END


The net result is clean flexible code feeding into a database with very quick ability to extend. 

You want to expose a new field? it's literally a one liner to add it as a column to the SQL Server table or view. Consider creating custom views on top of the table as a clean organized solution.

Monday, November 16, 2015

Effortless getting data out of a JSON REST response

One of the pains with using REST is getting the data from JSON into something usable. There is a very simple solution:  Take the JSON, pass it to a magic black box and get a dataset back that has foreign keys and other joys.   That sounds very nice --

  • Make the JSON REST call and then
  • Query or filter data tables to do future processing. No need to define classes to deserialize the JSON into.....
The code is horrible and shown below...
using System;
using System.Linq;
using System.Data;
using System.Xml;
using Newtonsoft.Json;
namespace Avalara.AvaTax.JsonUtilities
{
    public static class Utility
    {
        public static DataSet ConvertJsonToDataSet(string jsonText, string rootElementName)
        {
            var xd1 = new XmlDocument();
            xd1 =JsonConvert.DeserializeXmlNode( jsonText,rootElementName);
            var result = new DataSet();
            result.ReadXml(new XmlNodeReader(xd1));
            return result;
        }  
    }
}

To put this into a fuller context, consider the code below that does a get to any REST JSON url and returns a dataset

public static DataSet GetDataSet(string url, string rootName = "myroot")
{
    var webRequest = (HttpWebRequest)WebRequest.Create(url);
    webRequest.Method = "GET";
    webRequest.ContentType = "application/json, *.*";
    webRequest.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:28.0) Gecko/20100101 Firefox/28.0";
    webRequest.Headers.Add("AUTHORIZATION", Authorization);
    var webResponse = (HttpWebResponse)webRequest.GetResponse();
    if (webResponse.StatusCode != HttpStatusCode.OK) Console.WriteLine("{0}", webResponse.Headers);
    var json = String.Empty;
    using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
    {
        json = reader.ReadToEnd();
        reader.Close();
    }
    // We must name the root element
    return DropUnusedTables(Utility.ConvertJsonToDataSet(json, rootName));
}

No longer do you need to deserialize to hand constructed classes to consume the data.

An example of the risk of not versioning REST

A while back I was contacted to solve a quasi-nasty issue. An existing REST implementation had been updated with an extra field being added to the response. This worked fine for 99% of the consumers, but for one consumer it broke. This consumer wrote a package that they sold on to others and their customers were screaming.

The reason it broke was that it was not coded to handled additional fields. Technically, REST is an architectural pattern without standards. Robustness in handling extra fields and data is what most developers would expect -- but that is not a requirement of REST. It is hopeful thinking.

If the REST was well versioned, this issue would not have arisen. It did arise.

While this consumer can patch their code, getting the patch to all of their customers was problematic hence there was a need to do an immediate fix, somehow. Fortunately, their package allows the REST Url to be specified and that allow a simple quick solution. Create a "Relay Website" up on Azure that relays the data from the customers and remove this extra field in the response. All of the data was in JSON which reduced the scope of the issue.

The code was actually trivial (using Newtonsoft.Json.Linq;). As you can see, it is easy to eliminate as many fields as desired by just adding case statements:


   public class CorgiController : ApiController
    {
        [HttpPost]
        public JObject Get()
        {
            var jresponse = RestEcho.EchoPost();
            foreach (JToken item in jresponse.Children())
                WalkChildrenAndRemove(item);
            return jresponse;
        }


        [HttpPost]
        public JObject Cancel()
        {
            return RestEcho.EchoPost();
        }

        private void WalkChildrenAndRemove(JToken jitem)
        {
            if (jitem is JProperty)
            {
                var prop = jitem as JProperty;
                switch (prop.Name)
                {
                    case "Foobar": jitem.Remove(); break;
                    default:
                        foreach (JToken item in jitem.Children())
                            WalkChildrenAndRemove(item);
                        break;
                }
            }
            else if (jitem is JArray)
            {
                var arr = (JArray)jitem;
                foreach (JToken item in arr)
                    WalkChildrenAndRemove(item);
            }
            else
            {
                foreach (JToken item in jitem.Children().ToArray())
                    WalkChildrenAndRemove(item);
            }
        }
    }
}
With the RestEcho class being also trivial,

  public static class RestEcho
    {
        public static JObject EchoPost()
        {
            var url = GetServer() + HttpContext.Current.Request.Path;
            var stream = new StreamReader(HttpContext.Current.Request.InputStream);
            var body = stream.ReadToEnd();
            var value = JObject.Parse(body);
            // Get the login and password sent
            HttpContext httpContext = HttpContext.Current;
            NameValueCollection headerList = httpContext.Request.Headers;
            var testHeader = headerList.Get("Authorization");
            if (testHeader == null || testHeader.Length < 7)
            {
                HttpContext.Current.Response.StatusCode = 401;
                HttpContext.Current.Response.StatusDescription = "Basic Authentication Is Required";
                HttpContext.Current.Response.Write("Failed to Authenticate");
                HttpContext.Current.Response.End();
            }
            // remove "BASIC " from field
            var authorizationField = headerList.Get("Authorization").Substring(6);
            HttpClient client = new HttpClient();
            client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Basic", authorizationField);
            var response = client.PostAsJsonAsync(url, value).Result;
            var jresponse = new JObject();
            try
            {
                jresponse = (JObject)response.Content.ReadAsAsync().Result;
            }
            catch
            {

            }
            return jresponse;
        }
  
This pattern can also be used to reduce a version X to version 0 with likely less coding than alternative approaches -- after all, you just have to add case statements if the structure is the same.

This was tossed up onto Azure with running costs being well less than $10/month. 

Happy customer. Happy customer's customers.