Monday, September 5, 2016

An interesting Interview Question: Fibonacci Sequence

Write a function to calculate the nth Fibonacci Sequence is a common interview question and often the solution is something like

 

int Fib(int n)

{

   if(n < 1) return 1;

   return Fib(n-1) + Fib(n-2);

}

 

The next question is to ask for n=100, how many items will be on the stack. The answer is not 100 but actually horrible! It is closer to 2^100.

take the first call – we start a stack on Fib(99) and one on Fib(98). There is nothing to allow Fib(99) to borrow the result of Fib(98).  So one step is two stack items to recurse.  Each subsequent call changes one stack item into 2 items.   For example

  • 2 –> call [Fib(1), Fib(0)]
  • 3 –> calls [ Fib(2)->[Fib(1), Fib(0)], Fib(1) –> Fib(0) ]
  • 4 –> calls [ Fib(3)->[[[ Fib(2)->[Fib(1), Fib(0)], Fib(1) –> Fib(0) ]], Fib(2)->[Fib(1), Fib(0)], Fib(1) –> Fib(0) ]

Missing this issue is very often seen with by-rote developers (who are excellent for some tasks).

 

A better solution is to cache the values as each one is computed – effectively creating a lookup table. You are trading stack space for memory space.

 

Placing constraints on memory and stack space may force the developer to do some actual thinking. A solution that conforms to this is shown below

 

  private static long Fibonacci(int n) {
        long a = 0L;
        long b = 1L;
        for (int i = 31; i >= 0; i—)  //31 is arbitrary, see below

        {

            long d = a * (b * 2 - a);
            long e = a * a + b * b;
            a = d;
            b = e;
            if ((((uint)n >> i) & 1) != 0) {
                long c = a + b;
                a = b;
                b = c;
                }
           }
        return a;
    }

 

The output of the above shows what is happening  and suggests that the ”31”  taking the log base 2 of N can likely be done to improve efficiency

image

for 32:

image

for 65

image

for 129

image

 

What is the difference in performance for the naive vs the latter?

I actually did not wait until the naive solution finished… I aborted at 4 minutes

image

The new improved version was 85 ms, over a 3000 fold improvement.

Take Away

This question:

  1. Identify if a person knows what recursion is and can code it.
  2. Identify if he understands what the consequence of recursion is and how it will be executed(i.e. think about what the code does)
    1. Most recursion questions are atomic (i.e. factorial) and not composite (recursion that is not simple)
  3. Is able to do analysis of a simple mathematical issue and generate a performing solution.

Sunday, August 28, 2016

Apple Store Passbook UML Diagrams and Error Messages

While working on a recent project, a major stumbling block was a lack of clear documentation of what happened where. This was confirmed when I attempted to search for some of the messages returned to the Log REST points by iPhone.. There were zero hits!

 

image

 

In terms of a Store Card, let us look at the apparent Sequence Diagram

 

image

 

Log Errors Messages Seen and Likely Meaning

  • Passbook Inactive or Deleted or some one changed Auth Token
    • [2016-08-28 11:57:01 -0400] Unregister task (for device ceed8761e584e814ed4fe73cbb334ee9, pass type pass.com.reddwarfdogs.card.dev, serial number 85607BFE98D91A-765F7B05-D5E4-4B32-B16D-69C2038EF522; with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Authentication failure
    • [2016-08-28 20:44:25 +0700] Register task (for device 19121d6b570b31a3fa56dbd45411c933, pass type pass.com.reddwarfdogs.card.dev, serial number 85607BFE98D91A-765F7B05-D5E4-4B32-B16D-69C2038EF522; with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Authentication failure
    • [2016-08-24 10:04:38 +0800] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/passbook): Update requested for unregistered serial number 8C6772F099D51AA3-7A32F5FB-F7F8-4285-A2A2-79FC66DF942C
  • Bad Record Keeping in your application
    • [2016-08-23 19:58:35 -0700] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/passbook): Server ignored the 'if-modified-since' header (Tue, 23 Aug 2016 16:54:10 GMT) and returned the full unchanged pass data for serial number '8C6771F89ED51DAA-AAF3100E-C365-4CCD-8C95-ADC974F52894'.
    • [2016-08-23 16:49:38 -0700] Get pass task (pass type pass.com.reddwarfdogs.card.dev, serial number 8C6771F89ED31FAE-57ED753A-8464-408E-95EF-CEF75DBB30D6, if-modified-since Tue, 09 Aug 2016 21:57:32 GMT; with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Received invalid pass data (The pass cannot be read because it isn’t valid.)
      • Cause: Corruption OR change of Certificate used to sign Passbook
    • [2016-08-23 13:56:44 -0700] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/passbook): Server requested update to serial number '8C6771F89ED41BAC-FFBF3B69-98F1-4F2A-A8B7-5AF457558EE7', but the pass was unchanged.
    • [2016-08-23 11:58:25 -0700] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/passbook): Device received spurious push. Request for passesUpdatedSince '20160823180851' returned no serial numbers. (Device = 2c04d18e5f8480f97bb9318b4065dba0)
    • [2016-08-08 10:23:57 -0700] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/v1/passbook): Response to 'What changed?' request included 1 serial numbers but the lastUpdated tag (20160808172351) remained the same.
      • Cause: Duplicate push notification sent to a device or logic error. If the tag is   1234, then the server logic should be > 1234 and NOT >=1234
  • Apple gives little guidance to status code and how the iphone will react
    • [2016-08-23 15:46:33 +0700] Get serial #s task (for device 6f175696d73dec465c561f4d3ee2dfe7, pass type pass.com.reddwarfdogs.card.dev, last updated (null); with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Unexpected response code 504
    • [2016-08-23 01:42:53 -0700] Get serial #s task (for device 2c04d18e5f8480f97bb9318b4065dba0, pass type pass.com.reddwarfdogs.card.dev, last updated 20160823083910; with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Unexpected response code 408
    • [2016-08-08 18:53:00 +0800] Get serial #s task (for device 726996d0f44f44b19f157aa0824f64cf, pass type pass.com.reddwarfdogs.card.dev, last updated (null); with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Unexpected response code 596

I suspect there are more messages – I have just not stumbled across them yet.

Friday, August 26, 2016

Solving PushSharp.Apple Disconnect Issue

While doing a load test of a new Apple Passbook application, I suddenly saw some 200K transmissions errors from my WebApi application. Searching the web I found that a “high” rate of connect/disconnect to Apple Push Notification Service being reported as causing APNS to do a forced disconnect.

 

While Apple does have a limit (very very high) on the number of notifications before they will refuse connections for an hour, the limit for connect/disconnect is much lower. After some playing around a bit I found that if I persisted the connection via a static, I no longer have this issue.

 

Below is a sample of the code.

  • Note: we disconnect and reconnect whenever an error happens (I have not seen an error yet) 

 

using Newtonsoft.Json.Linq;

using PushSharp.Apple;

using System;

using System.Collections.Generic;

using System.Security.Cryptography.X509Certificates;

using System.Text;

namespace RedDwarfDogs.Passbook.Engine.Notification

{

    public class AppleNotification : INotification

    {

        private readonly IPassbookSettings _passbookSettings;

        private readonly ILogger_logger;

        private static ApnsServiceBroker _apnsServiceBroker;

        private static object lockObject = new object();

        public AppleNotification(ILogger logger,IPassbookSettings passbookSettings)

        {

            _logger= Guard.EnsureArgumentIsNotNull(logger, "logger");

            _passbookSettings = Guard.EnsureArgumentIsNotNull(passbookSettings, "passbookSettings");

        }

        public void SendNotification(HashSet<string> deviceTokens)

        {

            if (deviceTokens == null || deviceTokens.Count == 0)

            {

                return;

            }

            try

            {

                _logger.Write("PassbookEngine_SendNotification_Apple");

                // Create a new broker if needed

                if (_apnsServiceBroker == null)

                {

                    X509Certificate2 cert = _passbookSettings.ApplePushCertificate;

                    if (cert == null)

                        throw new InvalidOperationException("pushThumbprint certificate is not installed or has invalid Thumbprint");

                      var config = new ApnsConfiguration(ApnsConfiguration.ApnsServerEnvironment.Production,

                        _passbookSettings.ApplePushCertificate, false);

                    _logger.Write("PassbookEngine_SendNotification_Apple_Connect");

                    _apnsServiceBroker = new ApnsServiceBroker(config);

                    // Wire up events

                    _apnsServiceBroker.OnNotificationFailed += (notification, aggregateEx) =>

                    {

                        aggregateEx.Handle(ex =>

                        {

                            _logger.Write("Apple Notification Failed", "Direct", ex);

                            _logger.Write("PassbookEngine_SendNotification_Apple_Error");

                            // See what kind of exception it was to further diagnose

                            if (ex is ApnsNotificationException)

                            {

                                var notificationException = (ApnsNotificationException)ex;

                                var apnsNotification = notificationException.Notification;

                                var statusCode = notificationException.ErrorStatusCode;

                            }

                            _logger.Write("SendNotification", "PushToken Rejected", ex);

                            // We reset to null to recreate / connect

                            Restart();

                            return true;

                        });

                    };

                    _apnsServiceBroker.OnNotificationSucceeded += (notification) =>

                    {

                    };

                    // Start the broker

                }

                var sentTokens = new StringBuilder();

                lock (lockObject)

                {

                    _apnsServiceBroker.Start();

                    foreach (var deviceToken in deviceTokens)

                    {

                        if (string.IsNullOrWhiteSpace(deviceToken) || deviceToken.Length < 32 || deviceToken.Length > 256 || deviceToken.Contains("-"))

                        {

                            //Invalid Token, keep in Apple's good books                   

                            // We use GUID's thus - for faking pushtokens. Do not send them to apple

                            // We do not want to be get black listed

                        }

                        else

                        {

                            // Queue a notification to send

                            var nofification = new ApnsNotification

                            {

                                DeviceToken = deviceToken,

                                Payload = JObject.Parse("{\"aps\":{\"badge\":7}}")

                            };

                            try

                            {

                                _apnsServiceBroker.QueueNotification(nofification);

                                sentTokens.AppendFormat("{0} ", deviceToken);

                            }

                            catch (System.InvalidOperationException)

                            {

                                // Assuming already in queue

                            }

                        }

                    }

                    try

                    {

                        //duplicate signals may occur

                        _apnsServiceBroker.Stop();

                    }

                    catch { }

                }

                var auditLog = new Log

                {

                    Message = sentTokens.ToString(),

                    RequestHttpMethod = "Post"

                };

                _logger.Write("Passbook", PassbookLogMessageCategory.SendNotification.ToString(),

                    "PassbookAudit", "Passbook", auditLog);

                return;

            }

            catch (Exception exc)

            {

                // We swallow notification exceptions - for example APSN is off line. Allow rest of processing to work.

                _logger.Write("SendNotification", "One or more notifications via Apple (APNS) failed", exc);

                Restart();

                _apnsServiceBroker = null; //force a reset

            }

        }

        private void Restart()

        {

            if (_apnsServiceBroker != null)

            {

                try

                {

                    //duplicate signals may occur

                    _apnsServiceBroker.Stop();

                }

                catch { }

                _logCounterWrapper.Increment("PassbookEngine_SendNotification_Apple_Restart");

                _apnsServiceBroker = null;

            }

        }

    }

}

Sunday, August 7, 2016

Taking Apple PkPasses In-House–Working Notes

This year I had a explicit, yet vague, project assigned to me: Move our Apple PkPass from a third party provider to our own internal system. The working environment was the Microsoft Stack with C# and a little googling found that the first 90% of the work could be done by nuget, namely:

  • Install-Package dotnet-passbook
  • Install-Package PushSharp

Created a certificate file on the apple developer site and we are done … easy project… not quite

 

Unfortunately both in-house expertise and 3rd part expertise involved in the original project had moved on. Welcome to reverse engineering black boxes.

 

The Joy of Certificates!

Going to http://www.apple.com/certificateauthority/  open a can of worms. The existing instructions assumed you have a Mac not Windows 10.

The existing instructions found on the web(https://tomasmcguinness.com/2012/06/28/generating-an-apple-ios-certificate-using-windows/)  broke due to some change with Windows or Apple in April 2016 ( apple forum, stack overflow). The solution was Unix on windows via https://cygwin.com/install.html and going the unix route to generate pfx files.

 

The second issue was connected with how we run our IIS servers and the default instructions for installing certificate for dotnet-passbook were not mutually compatible. The instructions said that the certs needed to be install in the Intermediate Certification Authorities – after a few panic hours deploying to load hosts with problems, we discovered that we had to Import to Personal to get dotnet-passbook to work.

The next issue we encountered was that of invisible characters coming along when we copy the thumbprint to our C# code. We implemented a thumbprint check that verified both the length (40) and also walk the characters insuring that all were in range. After this, we verified that we could find the matching certificate. All of this was done on website load. . an error was thrown, the site would not load.

 

This saved us triage time on every new deployment:with an

  • We identify if a thumbprint is ‘corrupt’
  • We verified that the expected certificate is there

The last issue impacts big shops: The certificate should be 100% owned by Dev Ops and never installed on a dev or test machine. This means that alternative certs are needed in those environment. Each cert with have a different thumbprint – hence lots of web.config transformation substituting in the correct thumbprint for the environment. The real life production cert should be owned by dev ops (or security)  with a very strong password that they and they alone know.

 

The Joys of Authentication Tokens

Security review for in-house required that the authentication tokens be a one way hash (SHA384 or higher) and be unique per PkPasses. The existing design used Guids for serial numbers and thus we used a Guid for the authentication token when the pass was first created.  We can never recreate an existing PkPass because we do not know the authentication token, just the hash.  When a request comes in for the latest path, we hash the authentication token sent in the authentication header and compare it to the hash. We then persist it in memory and insert it into the PkPass Json,  then we Zip and Sign the new PkPass.  Security is happy.

 

Now when it comes to the 3rd party provider, we were fortunate that they stored the authentication tokens in plain text, so it was just a hash and save the hash into our database. If they had hashed (as they should have), then we would need to replicate their hash method. If it was a SHA1 and SHA-2 was required by our security, then we would need to do some fancy footwork to migrate the hash, i.e.

  1. add a “SHA” column iWn our table,
  2. when a new request comes in examine the SHA value
  3. if it is “1” then use the authentication token presented and authenticated to create a SHA-2 hash and update the SHA column to “2”
  4. if it is “2” then authenticate appropriately.

This will allow us to track the uplift rate to SHA-2. At some point security would likely say “delete the SHA1 PkPass records”. This is easy because we have tracked them.

 

Push Notifications

This went easy except for missing that a Push Certificate is NOT used for PKPass files. Yes, it is not used.  It is used for registered 3rd party developed Apple applications. The certificate used for connecting to the Apple Push Notification Service (APNS) is the certificate used to sign the PkPass files. There is no separate push notification certificate. Also, using PushSharp, you must set “validate certificate” to false, or an exception will be thrown.

 

The pushTokens are device identifiers and APNS does not provide feedback if the device still exists (one of my old phones exists, but is at the bottom of an outdoor privy in a national park…), is turned off, or is out of communication.  The author of PushSharp, Redth, has done an excellent description of the problem here. The logical way to keep the history in check is to track when each pass is last retrieved and then periodically delete the push notifications for devices where none of the associated passes have been retrieved in the last year.  You will have “dead” push tokens in some circumstances.

 

I have a pkPass, my iPhone got destroyed. I installed the pkPass on the new phone. The old iPhone push token will never be eliminated while I maintain my PkPass. Why? because we do not know which iPhone is getting updates!

 

Minor hiccup

The get serial number since API call had a gotcha dealing with modified since query parameters. Apple documentation suggest that a date be used and we originally code it up assuming that this was a http if-modified-since header. QAing on a iPhone clarified that it was a query parameter and not a http header. We simply moved the same date there and encountered two issues:

  • We had a time-offset issue, our code was working off the database local time and our code deeming it to be universal time…. (which a http header would be)
  • Our IIS security settings did not like seeing a “:” in a query parameter. We resolved by used “yyyyMMddHHmmss” format

The real gotcha that was stated in the apple documentation was that this is an arbitrary token  that is daisy chained from one call to the next. It did not need to be a date. A date is a logical choice, but it is not required to be a date.

 

The value received in the last get serial numbers response is what is sent in the next get serial numbers request. Daisy chaining. The iPhone does nothing but echo it back.

Avoiding a Migraine

The dotnet-passbook code puts into the Json, the pass type identifier name in the certificate regardless of what you passed in. This is good and wise and secure. It has an unfortunate side effect, the routing

devices/{deviceLibraryIdentifier}/registrations/{passTypeIdentifier} and passes/{passTypeIdentifier}/{serialNumber}

is determined by this pass type identifier. If you are running a site and passes come from passes/foobar/1234, but your certificate name is “JackShyte” then the Json in the pass returned would read JackShyte. When the iPhone gets a push token, it would then construct the url for the update as passes/JackShyte/1234 … which will likely return a 404. The PkPass will never be updated unless you create additional routings!!

 

The solution that I took was to compare the {passTypeIdentifier} in the routing to the certificate. If they did not match, then 404 immediately and log an exception. While it is technically possible to “unwind” such a foul up, the path is not pretty.

 

Migration

The key for migration is a stepped approach

  1. Deploy your new solution and test it, correct any issues that you find in the production environment
  2. Deploy the application or mechanism for creating new PkPasses (this could be part of 1), so all new passes use the in-house system
  3. Update your data from the third party provider with authentication tokens (or their hash) and serial numbers. You want to do this after 2, because you want this list to be closed (no new passes created on the third party system)
  4. Have the 3rd party provider change the WebServiceUrl to the in-house solution. In theory, a Moved response to the in house system would also work (I have not tested this with an iPhone).
  5. Since the 3rd party wants to shut down in time, then you must send out a push notification to every push token you have.  You will likely want to throttle this if you have a large numbers of push tokens (in my case, 30 million) because every push token could result in a request for a new PkPass file.
    1. This may need to be repeated to insure adequate coverage for devices off line or abroad without data plans

Bottom Line

The original design worked, but there was a ton of details that had to be sorted out. I have omitted the nightmares that QA had trying to validate stuff, especially the migration portions.

Monday, June 6, 2016

One Migration Strategy to Microservices

The concepts of microservices is nice, but if you have a complex existing system the path is neither obvious or easy. I have seen Principal Architects throw up their hands and persuade the business that we need to build a new replacement system and that the old system is impossible to work with. This path tends to lead into overruns and often complete failures – I have recently seen that happen at a firm: “Just one year needed to deliver…” and three years later it was killed off because it had not been delivered.  The typical reported in industry literature statistics of 80—90% failure are very believable.

 

Over decades, I have seen many failrues (usually on the side lines).  On the other hand, for a planned phrase migration I have seen repeated success. Often success or failure seem to be determined by the agile-ness of the management and technical leads coupled with the depth of analysis before the project start. Unfortunately deep analysis ends up with a waterfall like specification that result in locked-step development and no agile-ness around issues. Similarly, agile often result in superficial analysis (the time horizon for analysis is often just the end of the next sprint)  with many components failing to fit together properly over time!

 

This post is looking at a heritage system and seeing how it can be converted to a microservices framework in an evolutionary manner. No total rewrite, just a phrased migration ending with a system that is close to a classic pro-forma microservice system.

 

I made several runs at this problem, and what I describe below “feels good” – which to me usually mean a high probability of success with demonstrable steps at regular intervals.

 

Example System

I am going to borrow a university system template from my days working for Blackboard.  We have teachers, non-teaching staff, students, classes, building, security access cards, payment cards, etc.  At one point, components were in Delphi, C#, Java, C++ etc with the databases in SQL Server and Oracle. Not only is data shared, but permissions often need to be consistent and appropriate.

 

I have tried a few running starts of microservicing  such a design, and at present, my best suggestion is this:

  • Do NOT extend the microservicing  down to the database – there is a more elegant way to proceed
  • Look at the scope of the microservices API very carefully – this is a narrow path that can explode into infinite microservices or a resurrection of legacy patterns

Elegant Microservice Database Access

Do not touch the database design at the start. You are just compounding the migration path needlessly at the start. Instead, for each microservice create a new database login that is named for the microservice and has (typically) CRUD permissions to:

  • A table
  • A subset of columns in a table
  • An updateable view
  • A subset of columns in an updateable view

We term this the Crud-Columns. There is a temptation to incorporate multiple Crud-columns into one microservice – the problem is simple, what is the objective criteria to stop incorporating more Crud-Columns into this single microservice? If you go to one microservice for each Crud-Columns, then by counting the tables you have an estimate of the number of microservices that you will likely end up with…  oh… that may be a lot! At this point of time, you may really want to consider automatic code generation of many microservices – similar to what you see with Entity-Frameworks, except this would be called Microservices-Framework.

 

This microservice may also have Read only permissions to other tables.  This other tables read only access  may be transitory for the migration. Regardless of final resolution, these tables must be directly related to the CRUD columns, and used to determine CUD decisions. At some future time, these rest calls to these read only tables may be redirected elsewhere (for example using a Moved to directive to a reporting microservices).

 

Oh, I have introduced a new term “reporting microservices”.  This is a microservice with one or read Read Api’s – multiple calls may be exposed depending on filtering, sorting or user permissions.

 

Microservices are not domain level APIs but at sub-domains or even sub-sub-domains. You should not be making small steps, instead, put on your seven-league boots!

American Trucking Industry 1952 Ad - Seven League Boots…

 

Tracking microservices

Consider creating a table where every database column is enumerated out and the microservice having CRUD over it is listed.

i.e.

  • Server.Database.Table.Schema.Column –> CRUD – >Microservice Name

 

The ideal (but likely impractical goal) is to have just one Microservice per specified column. That is a microservices may have many CUD columns, but a column will have only one CUD microservice ( N columns :: 1 Microservice).

 

Similarly, a table with

  • Server.Database.Table.Schema.Column –> R– >Microservice

can be used as a heat map to refactor as the migration occurs. We want to reduce hot spots (i.e. the number of Read microservices per column).

 

Building Microservices from Database Logins

Defining the actions that a microservice login can do cascades into a finite set of possible APIs. We are avoiding trying to define a microservice and then get the database access to support it. We are effectively changing the usual process upside down.

 

Instead of the typical path of asking the client what it needs for an API (to keep it’s life simple), we are insuring that there is a collection of APIs that satisfies its needs – although these may be complicated to call. What we need to return to the classical simplicity is intermediate APIs.

 

Intermediate APIs

Intermediate APIs are APIs are do not have explicit  database CUD rights. They are intended to be helper APIs that talk to the database microservices above and present a simpler API to clients. They will call the above APIs to change the database. They may also be caching APIs and database reporting APIs.

 

A Walk Thru

Using the university model cited above, the first naïve step could be to create a

  • Teacher API
  • Student API
  • Class API

If you bring in column permissions you find that these can be decomposed further. The reason that there may be a single row in the database for each of the above comes from Relational Database Design Normalization theory.  Instead, we should try to decompose according to user permission sets. For example:

  • Teacher API
    • Teacher MetaData API i.e. name,
    • Teacher Address Info API
    • Teacher Salary Info API
    • Teacher HR API
    • Teacher Card Access API
  • Student API
    • Student MetaData API, i.e. name,
    • Student Address Info API
    • Student Tuition API
    • Student Awards API
    • Student Card Access API

Our wishful state is that if you are authorized for an API, there is no need to check for further permissions. As I said, wishful. If you apply this concept strictly then you will likely end up with an unmanageable number of APIs that would be counter productive. This would be the case for an enterprise class system. For less complex systems, like customer retail systems, the number of permissions sets may be greatly reduced.

 

With the Blackboard system (when I was working on it), we were enabling support for hundred of thousands permission sets that often contains hundred of permission each (i.e. each person had their own set, each set contains permissions to access building, Uris, copying machines, etc).

 

An Intermediate API may be ClassAssignmentViewer. In this API, information from Student Metadata API, Teacher Metadata API and other APIs. Alternatively, it may be directly read only from the database.

 

Next Step

Once you have the microservices defined, you can start looking at segmenting the data store to match the microservices. When you leave a classic relational database, you may need to deal with issues such as referential integrity and foreign keys between microservices. If you have the microservice and the database login permissions pre-defined, then these issues are a magnitude simpler.

Bottom Line

The above is a sketch of what I discovered about migration process by trying several different approaches and seeing ongoing headaches, or, massive and risky refactoring.

 

With the above, you can start with a small scope and implement it. The existing system keeps functioning and you have created a parallel access point to the data. As functioning sets are completed, you can cut over to some microservices while the rest is running on the classic big api approach.  You can eventually have the entire system up in parallel and then do a cut over to these microservices stubs. Over time, you may wish to decouple the data stores but that can be done later. You need to isolate the CUD first into microservice to be above to do that step.

Saturday, May 28, 2016

Theory about Test Environments

Often my career has faced dealing with an arbitrary environment to test in. This environment preceded my arrival, and often was still there at my departure with many developers became fatalistic towards this arbitrary environment.  This is not good.

 

The Rhetorical Goal Recomposed

“We use our test environment to verify that our code changes will work as expected”

While this assures upper management, it lacks specifics to evaluate if the test environment is appropriate or complete. A more objective measurement would be:

  • The code changes perform as specified at the six-sigma level of certainty.

This then logically cascades into sub-measurements:

  • A1: The code changes perform as specified at the highest projected peak load for the next N year (typically 1-2) at the six-sigma level of certainty.
  • A2: The code changes perform as specified on a fresh created (perfect) environment  at the six-sigma level of certainty.
  • A3: The code changes perform as specified on a copy of production environment with random data at the six-sigma level of certainty.

The last one is actually the most critical because too often there is bad data from bad prior released code (which may have be rolled back – but the corrupted data remained!) . There is a corollary:

  • C1: The code changes do not need to perform as specified when the environment have had its data corrupted by arbitrary code and data changes that have not made it to production. In other words, ignore a corrupted test environment

 

Once thru is not enough!

Today’s systems are often multi-layers with timeouts, blockage under load and other things making the outcome not a certainty but a random event. Above, I cited six sigma – this is a classic level sought in quality assurance of mechanical processes.

 

“A six sigma process is one in which 99.99966% of all opportunities to produce some feature of a part are statistically expected to be free of defects (3.4 defective features per million opportunities).”

 

To translate this into a single test context – the test must run 1,000,000 times and fail less than4 times. Alternatively, 250,000 times with no failures.

 

Load testing to reach six-sigma

Load testing will often result in 250,000 calls being made. In some cases, it may mean that the load test may need to run for 24 hours instead of 1 hour. There are some common problem with many load tests:

  • The load test does not run on a full copy of the production environment – violates A3:
  • The same data is used time and again for the tests – thus A3: the use of random data fails.
    • If you have a system that has been running for 5 years, then the data should be selected based on user created data with 1/5 from each year
    • If the system has had N releases, then the data should be selected on user created data with 1/n from each release period

Proposal for a Conforming Pattern

Preliminary development (PD) is done on a virgin system each day. By virgin I mean that databases and other data stores are created from scripts and populated with perfect data. There may be super user data but no common user data.  This should be done by an automated process. I have seen this done in some firms and it has some real benefits:

  • Integration tests must create (instead of borrow) users
    • Integration tests are done immediately after build – the environment is confirmed before any developers arrive at work.
    • Images of this environment could be saved to allow faster restores.
  • Performance is good because the data store is small
  • A test environment is much smaller and can be easily (and cheaply) created on one or more cloud services or even VMs
  • Residue from bad code do not persist (often reducing triage time greatly) – when a developer realized they have accidentally jacked the data then they just blow away the environment and recreate it

After the virgin system is built, the developer’s “release folder scripts” are executed – for example, adding new tables, altering stored procedures, adding new data to system tables. Then the integration tests are executed again. Some tests may fail. A simple solution that I have seen is for these tests to call into the data store to get the version number and add an extension to NUnit that indicate that this test applies to before of after this version number. Tests can then be excluded that are expected to fail (and also identified for a new version to be written).

 

Integration development(ID) applies to the situation where there may be multiple teams working on stuff that will go out in a single release. Often it is more efficient to keep the teams in complete isolation for preliminary development – if there are complexities and side-effects than only one team suffers. A new environment is created then each teams’ “release folder scripts” are executed and tests are executed.

i.e. PD+PD+….+PD = ID

This keeps the number of moving code fragments controlled.

 

Scope of Testing in PD and ID

A2 level is as far as we can do in this environment. We cannot do A1 or A3.

 

SmokeTest development (STD) means that an image of the production data base is made available to the integration team and they can test the code changes using real data. Ideally, they should regress with users  created during each release period so artifact issues can be identified. This may be significant testing, but is not load testing because we do not push up to peak volumes.

Tests either creates a new user (in the case of PD and ID) or searches for a random user that was created in release cycle 456 in the case of STD. Of course, code like SELECT TOP 1 *… should not be used, rather all users retrieved and one randomly selected.

 

This gets us close to A3: if we do enough iterations.

 

Designing Unit Tests for multiple Test Environment

Designing a UserFactory with a signature such as

UserFactory.GetUser(UserAttributes[] requiredAttributes)

can simplify the development of unit tests that can be used across multiple environments. This UserFactory reads a configuration file which may have  properties such as

  • CreateNewUser=”true”
  • PickExistingUser=”ByCreateDate”
  • PickExistingUser=”ByReleaseDate”
  • PickExistingUser=”ByCreateDateMostInactive”

In the first case, a user is created with the desired attributes.  In other cases, the attributes are used to filter the production data to get a list of candidates to randomly pick from.

 

In stressing scenarios when we want to test for side-effects due to concurrent operation by the same user, then we could use the current second to select the same user for all tests starting in the current second.

 

Developers Hiding Significant Errors – Unintentional

At one firm, we successfully established the following guidance:

  • Fatal: When the unexpected happen – for example, the error that was thrown was not mapped to a known error response (i.e. Unexpected Server Error should not be returned)
  • Error: When an error happens that should not happen, i.e. try catch worked to recover the situation…. but…
  • Warning: When the error was caused by customer input. The input must be recorded into the log (less passwords). This typically indicates a defect in UI, training or child applications
  • Info: everything else, i.e. counts
  • Debug: what ever

We also implemented the ability to change the log4net settings on the fly – so we could, in production, get every message for a short period of time (massive logs)

Load Stress with Concurrency

Correct load testing is very challenging and requires significant design and statistics to do and validate the results.

 

One of the simplest implementation is to have a week old copy of the database, capture all of the web request traffic in the last week and do a play back in a reduced time period. With new functionality extending existing APIs then we are reasonably good – except we need to make sure that we reach six-sigma level – i.e.  was there at least 250,000 calls???  This can be further complicated if the existing system has a 0.1% error rate. A 0.1% error rate means 250 errors are expected on average, unfortunately this means that detecting a 1 error in 250,000 calls difference is impossible from a single run (or even a dozen runs). Often the first stage is to drive error rates down to near zero on the existing code base. I have personally (over several months) a 50K/day exception logging rate to less than 10. It can be done – just a lot of systematic slow work (and fighting to get these not business significant bug fixes into production). IMHO, they are business significant: they reduce triage time, false leads, bug reports, and thus customer experience with the application.

 

One of the issues is whether the 250,000 calls applies to the system as a whole – or just the method being added or modified? For true six-sigma, it needs to be the method modified – sorry! And if there are 250,000 different users (or other objects) to be tested, then random selection of test data is required.

 

I advocate the use of PNUnit (Parallel Nunit) on multiple machines with a slight twist. In the above UserFactory.Get() described above, we randomly select the user, but  for stress testing, we could use the seconds (long) and modular it with the number of candidate users and then execute the tests. This approach intentionally creates a situation where concurrent activity will generated, potentially creating blocks, deadlocks and inconsistencies.

 

There is a nasty problem with using integration tests mirroring the production distribution of calls. Marking tests appropriately may help, the test runner can them select the tests to simulate the actual production call distribution and rates. Of course, this means that there is data on the call rates and error rates from the production system.

 

Make sure that you are giving statistically correct reports!

 

The easy question to answer is “Does the new code make the error rate statistically worst?” Taking our example above of 0.1% error we had 250 errors being expected. If we want to have 95% confidence then we would need to see 325 errors to deem it to be worst. You must stop and think about this, because of the our stated goal was less than 1 error in 250,000 – and we ignore 75 more errors as not being significant!!! This is a very weak criteria. It also makes clear that driving down the back ground error rate is essential. You cannot get strong results with a high background error rate, you may only be able to demonstrate 1 sigma defect rate.

 

In short, you can rarely have a better sigma rate than your current rate unless you fix the current code base to have a lower sigma rate.

Thursday, May 12, 2016

The sad state of evidence based development management patterns

I have been in the development game for many decades. I did my first programs using APL/360 and Fortran (WatFiv) at the University of Waterloo, and have seen and coded a lot of languages over the years (FORTH, COBOL, Asm, Pascal, B,C, C++, SAS, etc).

 

My academic training was in Operations Research – that is mathematical optimization of business processes. Today, I look at the development processes that I see and it is dominantly “fly by the seats of the pants”, “everybody is doing it” or “academic correctness”. I am not talking about waterfall or agile or scrum. I am not talking about architecture etc. Yet is some ways I am. Some processes assert Evidence Based Management, yet fails to deliver the evidence of better results. Some bloggers detail the problems with EBM.  A few books attempt to summarize the little research that has occurred, such as "Making Software: What Really Works and Why we Believe It"

 

As an Operation Research person, I would define the optimization problem facing a development manager or director or lead as follows:

  • Performance (which often comes at increased man hours to develop and operational costs)
  • Scalability (which often comes at increased man hours to develop and operational costs)
  • Cost to deliver
  • Accuracy of deliverable (Customer satisfaction)
  • Completeness of deliverable
  • Elapsed time to delivery (shorter time often exponentially increase cost to deliver and defect rates)
  • Ongoing operational costs (a bad design may result in huge cloud computing costs)
  • Time for a new developer to become efficient across the entire product
  • Defect rate
    • Number of defects
    • ETA from reporting to fix
  • Developer resources
    • For development
    • For maintenance

All of these factors interact. For evidence, there are no studies and I do not expect them to be. Technology is changing too fast, there is huge differences between projects, and any study will be outdated before it is usable. There is some evidence that we can work from.

Lines of Code across a system

Lines of code directly impacts several of the above.

  • Defect rate is a function of the number of lines of code ranging from 200/100K to 1000/100K lines [source] which is scaled by developer skill level. Junior or new developers will have a higher defect rate.
  • Some classic measures defined in the literature, for example, cyclomatic complexity. Studies find a positive correlation between cyclomatic complexity and defects: functions and methods that have the highest complexity tend to also contain the most defects.
  • Time to deliver is often a function of the lines of code written.

There is a mistaken belief that lines of code is an immutable for a project. In the early 2000’s I lead a rewrite of a middle tier and backend tier (with the web front end being left as is), the original C++/SQL server code base was 474,000 lines of code and was the result of 25 man years of coding. With a team of 6 new (to the application) developers sent over from India and 2 intense local developer, we recreated these tiers with 100% api compliance in just 25,000 lines of code in about 8 weeks. 25 man years –> 1 man year. a 20 fold decrease in code base. And the last factor was an increase in concurrent load by 20 fold. 

 

On other projects I have seen massive copy and paste (with some minor change) that result in code bloat. When a bug is discovered it was often only fixed in some of the pastes. Martin Fowler describes Lines of Code as a measure of developer productivity as useless; the same applies to lines of code in a project.  A change of programming language can result in a 10 fold drop (or increase) in lines of code. A change of a developer can also result in a similar change – depending on skill sets.

 

Implementation Design

The use of Object-Relational Mapping (ORM) can often result in increased lines of code, defects, steeper learning curves and greater challenges addressing performance issues. A simple illustration is to move all addresses in Washington State from a master table to a child table. In SQL Server, TSQL – it is a one line statement, calling this from SQL it amounts to 4 lines of C# code. Using an ORM, this can quickly grow to 100-200 lines. ORMs came along because of a shortage of SQL developer skills. As with most things, it carry hidden costs that are omitted in the sales literature!

 

“Correct academic design” does not mean effective (i.e. low cost) development. One of the worst systems (for performance and maintenance) that I have seen was absolutely beautifully designed with a massive array of well defined classes – which unfortunately ignored the database reality.  Many calls of a single method cascaded through these classes and resulted in 12 – 60 individual sql queries being executed against the database.  Most of the methods could be converted to a wrapper on a single stored procedure with a major improvement of performance. The object hierarchy was flattened (or downsized!).

 

I extend the concept of cyclomatic complexity to the maximum stack depth in developer written code.  The greater the depth, the longer it takes to debug (because the developer has to walk through the stack) and likely to write. The learning curve goes up. I suggest a maximum depth of 7 (less than cyclomatic complexity), ideally 5. This number comes out of research for short term memory (wikipedia). Going beyond seven significantly increases the effort that a developer needs to make to understand the stack. On the one hand, having a deep hierarchy of objects looks nice academically – but it is counterproductive for efficient coding. Seven is a magic number to keep asking “Why do we have more than seven ….”

Developer Skill Sets

Many architects suffer from the delusion that all developers are as skilled as they are, i.e. IQs over 145.  During my high school teaching years, I was assigned both gifted classes and challenged classes – and learn to present appropriately to both. In some cities (for example Stockholm, Sweden) – 20% of the work force is in IT. This means that the IQ of the developers likely range from 100 upwards. When an application is released, the support developers likely will end up with an average IQ around 100. The question must be asked, how simple is the code to understand for future enhancements and maintenance?

 

If a firm has a policy of significant use of off-shore or contractor resources, there are  further challenges:

  • A high percentage of the paid time is in ramp-up mode
  • There is a high level of non- conformity to existing standards and practices.
    • Higher defect rate, greater time for existing staff to come up to speed on the code
  • Size of team and ratio of application-experienced versus new developer can greatly alter delivery scheduled (see Brook’s law

Pseudo coding different architecture rarely happens. It has some advantages – if you code up the most complex logic and then ask the question – “ A bug happens and nothing comes back, what are the steps to isolated the issue with certainty?” The architecture with the least diagnostic steps may be the more efficient one.

 

Last, the availability now and in the future of developers with the appropriate skills.  The industry is full of technology that was hot and promised the moon and then were disrupted by a new technology (think of Borland Delphi and Pascal!). I often do a weighted value composed of years since launch, popularity at the moment and trend to refine choices (and in some cases to say No to a developer or architect that want to play with the latest and greatest!). Some sites are DB-Engine Ranking and PYPL.  After short listing, then it’s a matter of coding up some complex examples in each and counting lines of code needed.

Specification Completeness And Stability

On one side, I have worked with a few PMs that deliver wonderful specifications (200-500 pages) that had no change-orders between the first line of code being written and final delivery a year later. What was originally handed to developers was not changed. Work was done in sprints. The behavior and content of every web page was detailed. There was a clean and well-reviewed dictionary of terms and meanings. Needless to say, delivery was prompt, on schedule, etc.

 

On the other side, I have had minor change-requests which mutated constantly. The number of lines of code written over all of these changes were 20x the number of lines of code finally delivered.

Concurrent Development

Concurrent development means that two or more set of changes were happening to the same code base. At one firm we had several git-hub forks: Master,Develop, Sprint, Epic and Saga. The title indicate when the changes were expected to be propagated to master. It worked reasonably, but often I ended up spending two days resolving conflicts and debugging bugs that were introduced whenever I attempted to get forks in sync. Concurrent development increases overhead exponentially according to the number of independent forks are active. Almost everything in development has exponential cost with size, there is no economy of scale in development.

 

On the flip side, at Amazon using the microservices model, there were no interaction between feature requests. Each API was self contained and would evolve independently. If an API needed another API changed, then the independent API would be changed, tested and released. The dependent API then was developed against the released independent API. There was no code-juggling act. Each code base API was single development and self-contained. Dependencies were by API not libraries and code bases.

 

Bottom Line

Controlling costs and improving delivery depends greatly on the preparation work IMHO -- namely:

  • Specification stability and completeness
  • Architectural / Design being well crafted for the developer population
  • Minimum noise (i.e. no concurrent development, change orders, change of priorities)
  • Methodology (Scrum, Agile, Waterfall, Plan Driven) is of low significance IMHO – except for those selling it and ‘true believers’.

On the flip side, often the business will demand delivery schedules that add technical debt and significantly increase ongoing costs.

 

A common problem that I have seen is solving this multiple dimension problem by looking at just one (and rarely two) dimensions and discovering the consequences of that decision down stream.  I will continue to add additional dimensions as I recall them from past experience.