Thursday, January 21, 2016

Frugal Cloud The In-Memory versus SSD Paging File

Many people remember the days when you could use a USB memory stick to boost the performance of Windows. This memory caused me to ask the question: Is there a potential cost saving with little performance impact by going sparse on physical memory and configuring a paging file.

For windows folks:

For Linux, see TechTalk Joe post. Note his "I/O requests on instance storage does not incur a cost. Only EBS volumes have I/O request charges." so it is not recommended to do if you are running with EBS only.

This approach is particularly significant when you are "just over" one the offering levels. 


For some configurations, you will not get a CPU boost - by using the paging files. I know recent experience with a commercial SAAS actually had high memory usage but very log CPU (3-5%, even during peak times!). Having 1/2 or even 1/4 the CPUs would not peg the CPU. The question then becomes whether the Paging File on a SSD drive would significantly drop performance (whether you can strip for extra performance across multiple SSD on cloud instances, is an interesting question). This is a question that can only be determined experimentally.

  • How the paging file is configured and the actual usage of memory by the application is key. Often 80-90% of the usage hits only 10% of the memory (Pareto rule). The result could be that the median (50%ile) time may be unchanged -- and time may increase only along the long tail of the response distribution (say top 3% may be longer).
These factors cannot be academically determined. They need to be determine experimentally.

If performance is acceptable, there is an immediate cost saving because when new instances are created due to load, they are cheaper instances.

Bottom line is always: Experiment,stress, time and compare cost. Between the pricing models, OS behaviors and application behaviors, there is no safe rule of thumb!

Second Rule: Always define SLA as the Median (50%-ile) and never as an average.  Web responses are long-tailed which makes the average(mean) very volatile. The median is usually very very stable.

Sunday, January 17, 2016

Sharding Cloud Instances

Database sharding has been with us for many years. The concept of cloud instance sharding has not been discussed much. There is a significant financial incentive to do.

Consider a component that provides address and/or postal code validation around the world. For the sake of illustration, let us consider 10 regions that each have the same volume of data.

Initial tests found that it took 25 GB of data to load all of them in memory. Working of AWS EC2 price list, we find that a m4.2xlarge is needed to run it, at $0.479/hr. This gives us 8 CPUs.

If we run with 10 @ 2.5 GB instead, we end up with 10 t2.medium, each with 2 CPU and a cost of $0.052/hr, or $0.52/hr -- which on first impression is more expensive, except we have 20 CPUs instead of 8 CPU. We may have better performance. If one of these instances is a hot spot (like US addresses), then we may end up with  9 instances that each support one region each and perhaps 5 instances supporting the US. As a single instance model, we may need 5 instances.

In this case, we could end up with

  • Single Instance Model: 5 * $0.479 = $2.395/hr with 40 CPU
  • Sharded Instances Model: (9 + 5) * $0.052 = $1.02/hr with 28 CPU
We have moved from one model being 10% more expensive to the other model being 100% as expensive.

Take Away

Flexibility in initial design to support independent cloud instances with low resource requirements as well as sharding may be a key cost control mechanism for cloud applications.

It is impossible to a-priori determine the optimal design and deployment strategy. It needs to determined by experiment. To do experiments cheaply, means that the components and architecture must be designed to support experimentation.

In some ways, cloud computing is like many phone plans -- you are forced to pay for a resource level and may not used all of resources that you pay for. Yes, the plans have steps, but if you need 18 GB of memory you may have to also pay for 8 CPUs that will never run more than 5% CPU usage (i.e. a single CPU is sufficient). Designing to support flexibility of cloud instances is essential for cost savings.

Saturday, January 16, 2016

An Financially Frugal Architectural Pattern for the Cloud

I have heard many companies complain about how expensive the cloud is becoming as they moved from development to production systems. In theory, the saved costs of greatly reduced staffing of Site Reliability Engineers and reduced hardware costs should compensate -- key word is should.  In reality, this reduction never happens because they are needed to support other systems that will not be migrated for years.

There is actually another problem, the architecture is not designed for the pricing model.

In the last few years there have been many changes in the application environment, and I suspect many current architectures are locked into past system design patterns. To understand my proposal better, we need to look at the patterns thru the decades.

The starting point is the classic client server: Many Clients - one Server - one database server (possibly many databases)
As application volume grew, we ended up with multiple servers to handle multiple clients but retaining a single database.
Many variations arose, especially with databases - federated, sharding etc. The next innovation was remote procedure calls with many dialects such as SOAP, REST, AJAX etc. The typical manifestation is shown below.
When the cloud came along,the above architecture was too often just moved off physical machines to cloud machines without any further examination.

Often they will be minor changes, if a queue service was being used with the onsite service concurrent with the application server, it may be spawned off to a separate cloud instance. Applications are often design for the past model of all on one machine. It is rare when an existing application is moved to the cloud that it is design-refactored significantly. I have also seen new cloud base application be implemented in the classic single machine pattern.

The Design Problem

The artifact architecture of an application consisting of dozens, often over 100 libraries (for example C++ dll's), It's a megalith rooted in the original design being for one PC. 

Consider the following case: Suppose that instead of running these 100 libraries on a high end cloud machines with say 20 instances, you run each library on it's own light-weight machine? Some libraries may only need two or three light-weight machines to handle the load. Others may need 20 instances because it is computationally intense and a hot spot. If you are doing auto-scaling, then the time to spin-up a new instance is much less when instances are library based -- because it is only one library. 

For the sake of argument, suppose that each of the 100 libraries require 0.4 GB to run. So to load all of them in one instance we are talking 40GB (100 x 0.4).

Looking at the current AWS EC2 pricing, we could use 100 instances of the t2.nano and have $0.0065 x 100 = $0.65/hour for all 100 instances with 1 CPU each (100 CPU total). The 40GB would require c3.8xlarge at $1.68/hour, 3 times the cost and only 32 cores instead of 100 cores. Three times the cost and 1/3 of the cores... sounds like our bill could be 9 times what is needed.


What about scaling, with the megalith, you have to spin up a new complete instance. With the decomposition into library components, you only need to spin up new instances of the library that needs it. In other words, scaling up become significantly more expensive with the megalith model.

What is another way to describe this? Microservices

This is a constructed example but it does illustrate that moving the application to the cloud may require appropriate redesign with a heavy focus on building components to run independently on the cheapest instances. Each swarm of these component-instances are load balanced with very fast creation of new instances.

Having a faster creation of instances actually save more money because the triggering condition can be set higher (and thus triggered less often - less false positives). You want to create instances so they are there when the load build to require them. The longer the time it takes to load the instance, the longer lead time you need need, which means the lower on the build curve you must set the trigger point. 

There is additional savings for deployments, because you can deploy at the library level to specific machines instead of having to deploy a big image. Deploys are faster, rollbacks are faster.

Amazon actually does this approach internally with hundreds of services (each on their own physical or virtual machine) backing their web site. A new feature is rarely integrated into the "stack", instead it is added as a service that can actually be turned off or on on production by setting appropriate cookies in the production environment. There is limited need for a sandbox environment because the new feature is not there for the public -- only for internal people that know how to turn it on.

What is the key rhetorical question to keep asking?

Why are we having most of the application on one instance instead of "divide and save money"?  This question should be constantly asked during design reviews.

In some ways, a design goal would be to design the application so it could run on a room full of PI's.

This design approach does increase complexity -- just like multi-threading and/or async operations adds complexity but with significant payback. The process of designing libraries to minimize the number of inter-instances call while striving to minimize the resource requirements is a design challenge that will likely require mathematical / operations research skills.

How to convert an existing application?

A few simple rules to get the little gray cells firing:
  • Identify methods that are static - those are ideal for mini-instances
  • Backtrack from these methods into the callers and build up clusters of objects that can function independently.
    • There may be refactoring because often designs go bad under pressure to deliver functionality
    • You want to minimize external (inter-component-instances) calls from each of these clusters
  • If the system is not dependent on dozens of component-instance deployments there may be a problem.
    • If changing the internal code of a method requires a full deployment, there is a problem
One of the anti-patterns for effective-frugal cloud base design is actually object-orientated (as compared to cost-orientated) design. I programmed in Simula and worked in GPSS -- the "Adam and Eve" of object programming. All of the early literature was based on the single CPU reality of computing then. I have often had to go in and totally refactor an academically correct objective system design in order to get performance. Today, a refactor would also need to get lower costs.

The worst case system code that I refactored for performance was implemented as an Entity Model in C++, a single call from a web front end went thru some 20 classes/instances in a beautiful conceptual model, with something like 45 separate calls to the database. My refactoring resulted in one class and a single stored procedure (whose result was cached for 5 minutes before rolling off or being marked stale).

I believe that similar design inefficiencies are common in cloud architecture.

When you owned the hardware, each machine increased labor cost to create, license, update and support. You have considerable financial and human pressure to minimize machines. When you move to the cloud with good script automation, having 3 instances or 3000 instances should be approximately the same work. You actually have financial pressure to shift to the model that minimizes costs -- this will often be with many many more machines.






Monday, January 4, 2016

A simple approach to getting all of the data out of Atlassian Jira

One of my current projects is getting data out of Jira into a DataMart to allow fast (and easy) analysis. A library such as TechTalk.JiraRestClient provides a basic foundation but there is a nasty gotcha. Jira can be heavily customized, often with different projects having dozen of different and unique custom fields. So how can you do one size fits all?

You could go down path of modifying the above code to enumerate all of the custom fields (and then have continuous work keeping them in sync) or try something like what I do below: exploiting that JSON and XML are interchangeable and XML in a SQL Server database can actually be real sweet to use.

Modifying JiraRestClient

The first step requires downloading the code from GitHub and modifying it.
In JiraClient.cs method  EnumerateIssuesByQueryInternal add the following code.
               var issues = data.issues ?? Enumerable.Empty<Issue>();
                var xml=JsonConvert.DeserializeXmlNode( response.Content,"json");
                // Insert all of the XML-JSON into 
                foreach (var issue in issues)
                {
                   var testNode=xml.SelectSingleNode(string.Format("//issues/key[text()='{0}']/..", issue.key));
                    if(testNode !=null)
                    {
                        issue.xml = testNode.OuterXml;
                    };

                }

You will also need to modify the issue class to include a string, "xml".  The result is an issue class containing all of the information from the REST response. 

Moving Issues into a Data Table

Once you have the issue by issue REST JSON response converted to XML, we need to move it to our storage. My destination is SQL server and I will exploit SQL Table variables to make the process simple and use set operations. In short, I move the enumeration of issues into a C# data table so I may pass the data to SQL Server.

                   var upload = new DataTable();
                   // defining columns omitted
                    var data = client.GetIssues(branch);
                    foreach (var issue in data)
                        try
                        {
                            var newRow = upload.NewRow();
                            newRow[Key] = issue.key;
                            newRow[Self] = issue.self;
                            // Other columns extracted
                            newRow[Xml] = issue.xml;
                            upload.Rows.Add(newRow);
                        }
                        catch (Exception exc)
                        {
                            Console.WriteLine(exc);
                        }
                    

The upload code is also clean and simple:

         using (var cmd = new SqlCommand { CommandType = CommandType.StoredProcedure, CommandText ="Jira.Upload1"})
                                using (cmd.Connection = MyDataConnection)
                                {
                                    cmd.Parameters.AddWithValue("@Data", upload);
                                     cmd.ExecuteNonQuery();
                                }

SQL Code

For many C# developers, SQL is an unknown country, so I will go into some detail. First, we need to define the table in SQL, just match the DataTable in C# above (same column names in same sequence is best)

    CREATE TYPE [Jira].[JiraUpload1Type] AS TABLE(
[Key] [varchar](max) NULL,
[Assignee] [varchar](max) NULL,
[Description] [varchar](max) NULL,
[Reporter] [varchar](max) NULL,
[Status] [varchar](max) NULL,
[Summary] [varchar](max) NULL,
[OriginalEstimate] [varchar](max) NULL,
[Labels] [varchar](max) NULL,
[Self] [varchar](max) NULL,
[XmlData] [Xml] Null
        )

Note: that I use (max) always -- which is pretty much how the C# datatable sees each column. Any data conversion to decimals will be done by SQL itself.

Second, we create the stored procedure. We want to update existing records and insert missing records. The code is simple and clean

    CREATE PROC  [Jira].[Upload1] @Data [Jira].[JiraUpload1Type] READONLY
    AS 
    Update Jira.Issue SET
      [Assignee] = D.Assignee
     ,[Description] = D.Description
     ,[Reporter] = D.Reporter
     ,[Status] = D.Status
     ,[Summary] = D.Summary
     ,[OriginalEstimate] = D.OriginalEstimate
     ,[Labels] = D.Labels
     ,[XmlData] = D.XmlData
    From @Data D
    JOIN Jira.Issue S ON D.[Key]=S.[Key]

    INSERT INTO [Jira].[Issue]
           ([Key]
           ,[Assignee]
           ,[Description]
           ,[Reporter]
           ,[Status]
           ,[Summary]
           ,[OriginalEstimate]
           ,[Labels]
  ,[XmlData])
    SELECT D.[Key]
           ,D.[Assignee]
           ,D.[Description]
           ,D.[Reporter]
           ,D.[Status]
           ,D.[Summary]
           ,D.[OriginalEstimate]
           ,D.[Labels]
  ,D.[XmlData]
    From @Data D
    LEFT JOIN Jira.Issue S ON D.[Key]=S.[Key]
    WHERE S.[Key] Is Null

All of the Json is now in XML and can be search by Xpath

Upon executing the above, we see our table is populated as shown below. The far right column is XML.This is the SQL Xml data type and contains the REST JSON converted to XML for each issue.

The next step is often to add computed columns using the SQL XML and a xpath. An example of a generic solution is below.

So what is the advantage?

No matter how many additional fields are added to Jira, you have 100% data capture here. There is no need to touch the Extract Transform Load (ETL) job. You can create (and index) the data in the XML in SQL server, or just hand back the XML to whatever is calling it.  While SQL Server 2016 supports JSON, XML is superior because of the ability to do XPaths into it as well as indices.

In many implementations of JIRA, the number of fields can get unreal.. as shown below

With the same data table, you could create multiple views that contain computed columns showing precisely the data that you are interested in.

Example of Computed column definitions

[ProductionReleaseDate]  AS ([dbo].[GetCustomField]('customfield_10705',[XmlData])),
[EpicName]  AS ([dbo].[GetCustomField]('customfield_10009',[XmlData])),
[Sprint]  AS ([dbo].[GetCustomField]('customfield_10007',[XmlData])),

With this Sql Function doing all of the work:

    CREATE FUNCTION [dbo].[GetCustomField]
    (
    @Name varchar(32),
    @Data Xml
    )
    RETURNS varchar(max)
    AS
    BEGIN
    DECLARE @ResultVar varchar(max)
    SELECT  @ResultVar = c.value('customfieldvalues[1]','varchar(max)') FROM     @Data.nodes('//customfield[@id]') as t(c)
    WHERE c.value('@id','varchar(50)')=@Name
    RETURN @ResultVar
    END


The net result is clean flexible code feeding into a database with very quick ability to extend. 

You want to expose a new field? it's literally a one liner to add it as a column to the SQL Server table or view. Consider creating custom views on top of the table as a clean organized solution.