Thursday, October 20, 2016

Docker for Angular 2 devs

Docker is a Virtual Environment
Docker containers are great for adding new developers to existing projects, or for learning new technologies without polluting your existing developer machine/host. Docker allows you to put a fence around the environment while still using it.

Why Docker for Angular 2?
Docker is an easy way to get up and going on a new stack, environment, tool, or operating system without having to learn how to install and configure the new stack. A collection of docker images are available from Docker Hub ranging from simple to incredibly complex -- saving you the time and energy.

Angular 2 examples frequently include a Dockerfile in the repository which makes getting the example up and running much quicker -- if you don't have to focus on package installation and configuration.

The base Angular 2 development stack uses Node, TypeScript, Typings, and a build system (such as SystemJs or Webpack). Instead of learning each stack element before/while learning Angular 2, just focus on Angular 2 itself -- by using a Dockerfile to bring up a working environment.

The repositories for Angular 2 projects will have a package.json file at the root which is common for NodeJs/NPM package management projects. The Docker build will install the packages in the package management system as part of the build. The build can also transpile the typescript code , and start a static file web server -- if the package.json has a start script.

In order to get a new development environment up and a new project in the browser, you just need to build the Dockerfile, then run it. Running these two commands at the terminal/cli saves you time in find and learning the Angular 2 stack, and then building and running the project.

The Angular 2 Quickstart
For this article, I use the Angular 2 Quickstart repository including the Dockerfile found in the repository.

I use a Macintosh laptop. If you are using a Windows-based computer/host, you may have more or different issues than this article.

Docker via Terminal/Cli
I prefer the code-writing environment and web browser already installed and configured on my developer laptop/host. I configure the Docker container to share the hosted files. The changes are reflected in the container – and I run the Angular 2 project in watch mode so the changes immediately force a recompile in the container.

Viewing the Angular Project in a Browser
Since the Angular 2 project is a website, I access the container by the port and map the container's port to the host's port – so access to the running Angular project is from a web browser on the host laptop with http://localhost:3000.

Install Docker
Before you install Docker, make sure you have a bit of space on the computer. Docker, like Vagrant
and VirtualBox, uses a lot of space.

Go to Docker and install it. Start Docker up.

Check Docker
Open a terminal/cli and check the install worked and Docker started by requesting the Docker version


docker –v
>Docker version 1.12.1, build 6f9534c 

If you get a docker version as a response, you installed and started Docker correctly.

Images and Containers
Docker Images are defined in the Dockerfile and represent the virtual machine to be built. The instantiation of the image is a container. You can have many containers based on one image.

Each image is named and each container can also be named. You can use these names to indicate ownership (who created it), as well as base image (node), and purpose (xyzProject).

Pick a naming schema for your images and containers and stick with it.

I like to name my images with my github name and the general name such as dfberry/quickstart. I like to name the containers with as specific a name as possible such as ng2-quickstart.

The list of containers (running or stopped) shows both names which can help you organize find the container you want.

The Angular 2 Docker Image
The fastest way to get going with Docker for Angular 2 projects is to use the latest node as your base image -- which is also what the Angular 2 quickstart uses.

The image has the latest node, npm, and git. Docker hub hosts the base image and Node keeps it up to date.

Docker's philosophy is that the containers are meant to execute then terminate with the least privileges possible. In order to make a container work as a development container (i.e. stay up and running), I'll show some not-best-practice choices. This will allow you to get up and going quickly. When you understand the Docker flow, you can implement your own security.

The Docker Images
Docker provides no images on installation. I can see that using the command


docker images 

When I build the nodejs image, it will appear in the list with information about the image.



For now, the two most important columns are the REPOSITORY and IMAGE ID. The REPOSITORY field is the image name I used to build the image. My naming schema indicates my user account (dfberry) and the base image or purpose (node). This helps me find it in the image list.

The IMAGE ID is the unique id used to identify the image.

The Dockerfile
In order to create a docker image, you need a Dockerfile (notice the filename has no extension). This is the file the docker cli will assume you want to use. For this example, the Dockerfile is small. It has the following features:
  • creates a group
  • creates a user
  • creates a directory structure with appropriate permissions
  • copies over the package.json file from the host
  • installs the npm packages listed in the package.json
  • runs the package.json's "start" script – which should start the website

For now, make sure this is the only Dockerfile in the root of the project, or anywhere below the root.


# To build and run with Docker:
#
#  $ docker build -t ng-quickstart .
#  $ docker run -it --rm -p 3000:3000 -p 3001:3001 ng-quickstart
#
FROM node:latest

RUN mkdir -p /quickstart /home/nodejs && \
groupadd -r nodejs && \
useradd -r -g nodejs -d /home/nodejs -s /sbin/nologin nodejs && \
chown -R nodejs:nodejs /home/nodejs

WORKDIR /quickstart
COPY package.json typings.json /quickstart/
RUN npm install --unsafe-perm=true

COPY . /quickstart
RUN chown -R nodejs:nodejs /quickstart
USER nodejs

CMD npm start

The nodejs base image will install nodejs, npm and git. The image will just be used for building and hosting the Angular 2 project.

If you have scripts that do the bulk of your build/startup/run process, change the Dockerfile to copy that file to the container and execute it as part of the build.

Build the Image
Usage: docker build [OPTIONS] PATH | URL | -

In order to build the image, use the docker cli.


docker build –t <user>/<yourimagename> .
Example $: docker build –t dfberry/ng-quickstart .


If you don't want to annotate the user, just leave that off.


docker build –t <yourimagename> .
Example $: docker build –t ng-quickstart .

Note: the '.' at the end of the string is the url/location of the Dockerfile. I could have used a Github repository url instead of the local folder.

In the above examples, the REPOSITORY name is 'ng-quickstart'. If you don't use the –t naming param, your image will have a name of <none> which is annoying when they pile up on a team server.

The build will give you some feedback to let you know how it is going.


Sending build context to Docker daemon 3.072 kB 
Step 1 : FROM node:latest 

... 

Removing intermediate container 2cb50f334393 
Successfully built 1265b22b5b90

Since the build can return a lot of information, I didn't include the entire response.

The build of the quickstart takes less than a minute on my Mac.

The last line gives you the IMAGE ID. Remember to view all docker images after building to check it worked as expected.


docker images

Run the Container
Usage: docker run [OPTIONS] IMAGE [COMMAND] [ARG...] 

Now that the image is built, I want to run the image to see the website.

If you don't have an Angular 2/Typescript website, use the ng2 Quickstart.

Run switches
The run command has a lot of switches and configurations. I'll walk you through the choices for this container.

I want to name the container so that I remember the purpose. This is optional but helpful when you have a long list of containers.


--name ng2-quickstart 

I want to make sure the container's web ports are matched to my host machine's port so I can see the website as http://localhost:3000. Make sure the port isn't already in use on your host machine.


-p 3000:3000 

I want to map my HOST directory (/Users/dfberry/quickstart to the container's directory (/home/nodejs/quickstart) that was created in the build so I can edit on my laptop and the changes are reflected in the container. The /home/nodejs/quickstart directory was created as part of the Dockerfile.


-v /Users/dfberry/quickstart/:/home/nodejs/quickstart 

I want the terminal/cli to show the container's responses including transpile status and the file requests.


-it 


The full command is:


docker run -it -p 3000:3000 -v /Users/dfberry/quickstart:/home/nodejs/quickstart --name ng2-quickstart dfberry/ng-quickstart

Notice the image is named dfberry/ng-quickstart while the container is named ng2-quickstart.

Run the "docker run" command at the terminal/cli.

The container should be up and the website should be transpiled and running.


At this point, you should be able to work on the website code on your host with your usual editing software and the changes will reflect in the container (re-transpile changes) and in the web browser.

List Docker Containers
In order to see all the containers, use


docker ps -a

If you only want to see the running containers, leave the -a off.



docker ps



At this point, the easy docker commands are done and you can work on the website and forget about Docker for a while. When you are ready to stop the container, stop Docker or get out of interactive mode, read on.

Interactive Mode (-it) versus Detached Mode (-d)
Interactive Mode means the terminal/cli shows what is happening to your website in the container. Detached mode means the terminal/cli doesn't show what is happening and the terminal/cli returns to your control for other commands on your host.

To move from interactive to detached mode, use control + p + control + q.

This leaves the container up but you have no visual/textual feedback about how the website is working from the container. You can use the developer tools/F12 in the browser to get a sense, but won't be able to see http requests and transpiles.

You are either comfortable with that or not.

If you want the interactive mode and the website transpile/http request information, don't exit interactive mode. Instead, use control + c. This command stops and removes the container from Docker, but doesn't remove the image. You can re-enter interactive mode with the same run command above.

If you are more comfortable in detached mode, where the website gives transpiles and http request information via a different method such as logging to a file or cloud service, change the docker run command.

Instead of using –it as part of the "docker run" command, use –d for detached mode.

Exec Mode to run commands on container
Usage: docker exec [OPTIONS] CONTAINER COMMAND [ARG...]

When you want to connect to the container, you the same -it for interactive mode but with "docker exec."  The end command tells the docker container what environment to enter in the container -- such as the bash shell.


docker exec –it ng2-quickstart /bin/bash

You can log in as root if you need elevated privileges.


docker exec –it –u root ng2-quickstart /bin/bash 

The terminal/cli should now show the prompt changed to indicate you are now on the container:


nodejs@faf83c87c12e:/quickstart$  

When you are done running the commands, use control + p + control + q to exit. The container is still running.

Sudo or Root
In this particular quickstart nodejs docker container, sudo has not been installed. Sudo may be your first choice and you can install. Or you could use the "docker exec" with root. Either way has pros and cons.

Stopping and Starting the Container
When you are done with the container, you need to stop it. You can stop, and restart it as you need by container id or name.


docker stop ng2-quickstart 
docker stop 7449222ec26b 

docker start ng2-quickstart 
docker start 7449222ec26b 

Stopping the container may take some time – be patient. Mine takes up to 10 seconds on my Mac. When you restart the container, it is in detached mode. If you really want interactive mode, remove the container, and use docker run again with –it.

Cleanup
Make sure to stop all containers when they are not needed. When you are done with a container, you can remove it


docker rm –fv ng2-quickstart 

When you are done with an image, you can remove that as well


docker rmi ng-quickstart 

Stop Docker
Remember to stop Docker when you are done with the containers for the moment or day.

Monday, September 5, 2016

An interesting Interview Question: Fibonacci Sequence

Write a function to calculate the nth Fibonacci Sequence is a common interview question and often the solution is something like

 

int Fib(int n)

{

   if(n < 1) return 1;

   return Fib(n-1) + Fib(n-2);

}

 

The next question is to ask for n=100, how many items will be on the stack. The answer is not 100 but actually horrible! It is closer to 2^100.

take the first call – we start a stack on Fib(99) and one on Fib(98). There is nothing to allow Fib(99) to borrow the result of Fib(98).  So one step is two stack items to recurse.  Each subsequent call changes one stack item into 2 items.   For example

  • 2 –> call [Fib(1), Fib(0)]
  • 3 –> calls [ Fib(2)->[Fib(1), Fib(0)], Fib(1) –> Fib(0) ]
  • 4 –> calls [ Fib(3)->[[[ Fib(2)->[Fib(1), Fib(0)], Fib(1) –> Fib(0) ]], Fib(2)->[Fib(1), Fib(0)], Fib(1) –> Fib(0) ]

Missing this issue is very often seen with by-rote developers (who are excellent for some tasks).

 

A better solution is to cache the values as each one is computed – effectively creating a lookup table. You are trading stack space for memory space.

 

Placing constraints on memory and stack space may force the developer to do some actual thinking. A solution that conforms to this is shown below

 

  private static long Fibonacci(int n) {
        long a = 0L;
        long b = 1L;
        for (int i = 31; i >= 0; i—)  //31 is arbitrary, see below

        {

            long d = a * (b * 2 - a);
            long e = a * a + b * b;
            a = d;
            b = e;
            if ((((uint)n >> i) & 1) != 0) {
                long c = a + b;
                a = b;
                b = c;
                }
           }
        return a;
    }

 

The output of the above shows what is happening  and suggests that the ”31”  taking the log base 2 of N can likely be done to improve efficiency

image

for 32:

image

for 65

image

for 129

image

 

What is the difference in performance for the naive vs the latter?

I actually did not wait until the naive solution finished… I aborted at 4 minutes

image

The new improved version was 85 ms, over a 3000 fold improvement.

Take Away

This question:

  1. Identify if a person knows what recursion is and can code it.
  2. Identify if he understands what the consequence of recursion is and how it will be executed(i.e. think about what the code does)
    1. Most recursion questions are atomic (i.e. factorial) and not composite (recursion that is not simple)
  3. Is able to do analysis of a simple mathematical issue and generate a performing solution.

Sunday, August 28, 2016

Apple Store Passbook UML Diagrams and Error Messages

While working on a recent project, a major stumbling block was a lack of clear documentation of what happened where. This was confirmed when I attempted to search for some of the messages returned to the Log REST points by iPhone.. There were zero hits!

 

image

 

In terms of a Store Card, let us look at the apparent Sequence Diagram

 

image

 

Log Errors Messages Seen and Likely Meaning

  • Passbook Inactive or Deleted or some one changed Auth Token
    • [2016-08-28 11:57:01 -0400] Unregister task (for device ceed8761e584e814ed4fe73cbb334ee9, pass type pass.com.reddwarfdogs.card.dev, serial number 85607BFE98D91A-765F7B05-D5E4-4B32-B16D-69C2038EF522; with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Authentication failure
    • [2016-08-28 20:44:25 +0700] Register task (for device 19121d6b570b31a3fa56dbd45411c933, pass type pass.com.reddwarfdogs.card.dev, serial number 85607BFE98D91A-765F7B05-D5E4-4B32-B16D-69C2038EF522; with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Authentication failure
    • [2016-08-24 10:04:38 +0800] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/passbook): Update requested for unregistered serial number 8C6772F099D51AA3-7A32F5FB-F7F8-4285-A2A2-79FC66DF942C
  • Bad Record Keeping in your application
    • [2016-08-23 19:58:35 -0700] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/passbook): Server ignored the 'if-modified-since' header (Tue, 23 Aug 2016 16:54:10 GMT) and returned the full unchanged pass data for serial number '8C6771F89ED51DAA-AAF3100E-C365-4CCD-8C95-ADC974F52894'.
    • [2016-08-23 16:49:38 -0700] Get pass task (pass type pass.com.reddwarfdogs.card.dev, serial number 8C6771F89ED31FAE-57ED753A-8464-408E-95EF-CEF75DBB30D6, if-modified-since Tue, 09 Aug 2016 21:57:32 GMT; with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Received invalid pass data (The pass cannot be read because it isn’t valid.)
      • Cause: Corruption OR change of Certificate used to sign Passbook
    • [2016-08-23 13:56:44 -0700] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/passbook): Server requested update to serial number '8C6771F89ED41BAC-FFBF3B69-98F1-4F2A-A8B7-5AF457558EE7', but the pass was unchanged.
    • [2016-08-23 11:58:25 -0700] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/passbook): Device received spurious push. Request for passesUpdatedSince '20160823180851' returned no serial numbers. (Device = 2c04d18e5f8480f97bb9318b4065dba0)
    • [2016-08-08 10:23:57 -0700] Web service error for pass.com.reddwarfdogs.card.dev (https://llc.reddwarfdogs.com/v1/passbook): Response to 'What changed?' request included 1 serial numbers but the lastUpdated tag (20160808172351) remained the same.
      • Cause: Duplicate push notification sent to a device or logic error. If the tag is   1234, then the server logic should be > 1234 and NOT >=1234
  • Apple gives little guidance to status code and how the iphone will react
    • [2016-08-23 15:46:33 +0700] Get serial #s task (for device 6f175696d73dec465c561f4d3ee2dfe7, pass type pass.com.reddwarfdogs.card.dev, last updated (null); with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Unexpected response code 504
    • [2016-08-23 01:42:53 -0700] Get serial #s task (for device 2c04d18e5f8480f97bb9318b4065dba0, pass type pass.com.reddwarfdogs.card.dev, last updated 20160823083910; with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Unexpected response code 408
    • [2016-08-08 18:53:00 +0800] Get serial #s task (for device 726996d0f44f44b19f157aa0824f64cf, pass type pass.com.reddwarfdogs.card.dev, last updated (null); with web service url https://llc.reddwarfdogs.com/passbook) encountered error: Unexpected response code 596

I suspect there are more messages – I have just not stumbled across them yet.

Friday, August 26, 2016

Solving PushSharp.Apple Disconnect Issue

While doing a load test of a new Apple Passbook application, I suddenly saw some 200K transmissions errors from my WebApi application. Searching the web I found that a “high” rate of connect/disconnect to Apple Push Notification Service being reported as causing APNS to do a forced disconnect.

 

While Apple does have a limit (very very high) on the number of notifications before they will refuse connections for an hour, the limit for connect/disconnect is much lower. After some playing around a bit I found that if I persisted the connection via a static, I no longer have this issue.

 

Below is a sample of the code.

  • Note: we disconnect and reconnect whenever an error happens (I have not seen an error yet) 

 

using Newtonsoft.Json.Linq;

using PushSharp.Apple;

using System;

using System.Collections.Generic;

using System.Security.Cryptography.X509Certificates;

using System.Text;

namespace RedDwarfDogs.Passbook.Engine.Notification

{

    public class AppleNotification : INotification

    {

        private readonly IPassbookSettings _passbookSettings;

        private readonly ILogger_logger;

        private static ApnsServiceBroker _apnsServiceBroker;

        private static object lockObject = new object();

        public AppleNotification(ILogger logger,IPassbookSettings passbookSettings)

        {

            _logger= Guard.EnsureArgumentIsNotNull(logger, "logger");

            _passbookSettings = Guard.EnsureArgumentIsNotNull(passbookSettings, "passbookSettings");

        }

        public void SendNotification(HashSet<string> deviceTokens)

        {

            if (deviceTokens == null || deviceTokens.Count == 0)

            {

                return;

            }

            try

            {

                _logger.Write("PassbookEngine_SendNotification_Apple");

                // Create a new broker if needed

                if (_apnsServiceBroker == null)

                {

                    X509Certificate2 cert = _passbookSettings.ApplePushCertificate;

                    if (cert == null)

                        throw new InvalidOperationException("pushThumbprint certificate is not installed or has invalid Thumbprint");

                      var config = new ApnsConfiguration(ApnsConfiguration.ApnsServerEnvironment.Production,

                        _passbookSettings.ApplePushCertificate, false);

                    _logger.Write("PassbookEngine_SendNotification_Apple_Connect");

                    _apnsServiceBroker = new ApnsServiceBroker(config);

                    // Wire up events

                    _apnsServiceBroker.OnNotificationFailed += (notification, aggregateEx) =>

                    {

                        aggregateEx.Handle(ex =>

                        {

                            _logger.Write("Apple Notification Failed", "Direct", ex);

                            _logger.Write("PassbookEngine_SendNotification_Apple_Error");

                            // See what kind of exception it was to further diagnose

                            if (ex is ApnsNotificationException)

                            {

                                var notificationException = (ApnsNotificationException)ex;

                                var apnsNotification = notificationException.Notification;

                                var statusCode = notificationException.ErrorStatusCode;

                            }

                            _logger.Write("SendNotification", "PushToken Rejected", ex);

                            // We reset to null to recreate / connect

                            Restart();

                            return true;

                        });

                    };

                    _apnsServiceBroker.OnNotificationSucceeded += (notification) =>

                    {

                    };

                    // Start the broker

                }

                var sentTokens = new StringBuilder();

                lock (lockObject)

                {

                    _apnsServiceBroker.Start();

                    foreach (var deviceToken in deviceTokens)

                    {

                        if (string.IsNullOrWhiteSpace(deviceToken) || deviceToken.Length < 32 || deviceToken.Length > 256 || deviceToken.Contains("-"))

                        {

                            //Invalid Token, keep in Apple's good books                   

                            // We use GUID's thus - for faking pushtokens. Do not send them to apple

                            // We do not want to be get black listed

                        }

                        else

                        {

                            // Queue a notification to send

                            var nofification = new ApnsNotification

                            {

                                DeviceToken = deviceToken,

                                Payload = JObject.Parse("{\"aps\":{\"badge\":7}}")

                            };

                            try

                            {

                                _apnsServiceBroker.QueueNotification(nofification);

                                sentTokens.AppendFormat("{0} ", deviceToken);

                            }

                            catch (System.InvalidOperationException)

                            {

                                // Assuming already in queue

                            }

                        }

                    }

                    try

                    {

                        //duplicate signals may occur

                        _apnsServiceBroker.Stop();

                    }

                    catch { }

                }

                var auditLog = new Log

                {

                    Message = sentTokens.ToString(),

                    RequestHttpMethod = "Post"

                };

                _logger.Write("Passbook", PassbookLogMessageCategory.SendNotification.ToString(),

                    "PassbookAudit", "Passbook", auditLog);

                return;

            }

            catch (Exception exc)

            {

                // We swallow notification exceptions - for example APSN is off line. Allow rest of processing to work.

                _logger.Write("SendNotification", "One or more notifications via Apple (APNS) failed", exc);

                Restart();

                _apnsServiceBroker = null; //force a reset

            }

        }

        private void Restart()

        {

            if (_apnsServiceBroker != null)

            {

                try

                {

                    //duplicate signals may occur

                    _apnsServiceBroker.Stop();

                }

                catch { }

                _logCounterWrapper.Increment("PassbookEngine_SendNotification_Apple_Restart");

                _apnsServiceBroker = null;

            }

        }

    }

}

Sunday, August 7, 2016

Taking Apple PkPasses In-House–Working Notes

This year I had a explicit, yet vague, project assigned to me: Move our Apple PkPass from a third party provider to our own internal system. The working environment was the Microsoft Stack with C# and a little googling found that the first 90% of the work could be done by nuget, namely:

  • Install-Package dotnet-passbook
  • Install-Package PushSharp

Created a certificate file on the apple developer site and we are done … easy project… not quite

 

Unfortunately both in-house expertise and 3rd part expertise involved in the original project had moved on. Welcome to reverse engineering black boxes.

 

The Joy of Certificates!

Going to http://www.apple.com/certificateauthority/  open a can of worms. The existing instructions assumed you have a Mac not Windows 10.

The existing instructions found on the web(https://tomasmcguinness.com/2012/06/28/generating-an-apple-ios-certificate-using-windows/)  broke due to some change with Windows or Apple in April 2016 ( apple forum, stack overflow). The solution was Unix on windows via https://cygwin.com/install.html and going the unix route to generate pfx files.

 

The second issue was connected with how we run our IIS servers and the default instructions for installing certificate for dotnet-passbook were not mutually compatible. The instructions said that the certs needed to be install in the Intermediate Certification Authorities – after a few panic hours deploying to load hosts with problems, we discovered that we had to Import to Personal to get dotnet-passbook to work.

The next issue we encountered was that of invisible characters coming along when we copy the thumbprint to our C# code. We implemented a thumbprint check that verified both the length (40) and also walk the characters insuring that all were in range. After this, we verified that we could find the matching certificate. All of this was done on website load. . an error was thrown, the site would not load.

 

This saved us triage time on every new deployment:with an

  • We identify if a thumbprint is ‘corrupt’
  • We verified that the expected certificate is there

The last issue impacts big shops: The certificate should be 100% owned by Dev Ops and never installed on a dev or test machine. This means that alternative certs are needed in those environment. Each cert with have a different thumbprint – hence lots of web.config transformation substituting in the correct thumbprint for the environment. The real life production cert should be owned by dev ops (or security)  with a very strong password that they and they alone know.

 

The Joys of Authentication Tokens

Security review for in-house required that the authentication tokens be a one way hash (SHA384 or higher) and be unique per PkPasses. The existing design used Guids for serial numbers and thus we used a Guid for the authentication token when the pass was first created.  We can never recreate an existing PkPass because we do not know the authentication token, just the hash.  When a request comes in for the latest path, we hash the authentication token sent in the authentication header and compare it to the hash. We then persist it in memory and insert it into the PkPass Json,  then we Zip and Sign the new PkPass.  Security is happy.

 

Now when it comes to the 3rd party provider, we were fortunate that they stored the authentication tokens in plain text, so it was just a hash and save the hash into our database. If they had hashed (as they should have), then we would need to replicate their hash method. If it was a SHA1 and SHA-2 was required by our security, then we would need to do some fancy footwork to migrate the hash, i.e.

  1. add a “SHA” column iWn our table,
  2. when a new request comes in examine the SHA value
  3. if it is “1” then use the authentication token presented and authenticated to create a SHA-2 hash and update the SHA column to “2”
  4. if it is “2” then authenticate appropriately.

This will allow us to track the uplift rate to SHA-2. At some point security would likely say “delete the SHA1 PkPass records”. This is easy because we have tracked them.

 

Push Notifications

This went easy except for missing that a Push Certificate is NOT used for PKPass files. Yes, it is not used.  It is used for registered 3rd party developed Apple applications. The certificate used for connecting to the Apple Push Notification Service (APNS) is the certificate used to sign the PkPass files. There is no separate push notification certificate. Also, using PushSharp, you must set “validate certificate” to false, or an exception will be thrown.

 

The pushTokens are device identifiers and APNS does not provide feedback if the device still exists (one of my old phones exists, but is at the bottom of an outdoor privy in a national park…), is turned off, or is out of communication.  The author of PushSharp, Redth, has done an excellent description of the problem here. The logical way to keep the history in check is to track when each pass is last retrieved and then periodically delete the push notifications for devices where none of the associated passes have been retrieved in the last year.  You will have “dead” push tokens in some circumstances.

 

I have a pkPass, my iPhone got destroyed. I installed the pkPass on the new phone. The old iPhone push token will never be eliminated while I maintain my PkPass. Why? because we do not know which iPhone is getting updates!

 

Minor hiccup

The get serial number since API call had a gotcha dealing with modified since query parameters. Apple documentation suggest that a date be used and we originally code it up assuming that this was a http if-modified-since header. QAing on a iPhone clarified that it was a query parameter and not a http header. We simply moved the same date there and encountered two issues:

  • We had a time-offset issue, our code was working off the database local time and our code deeming it to be universal time…. (which a http header would be)
  • Our IIS security settings did not like seeing a “:” in a query parameter. We resolved by used “yyyyMMddHHmmss” format

The real gotcha that was stated in the apple documentation was that this is an arbitrary token  that is daisy chained from one call to the next. It did not need to be a date. A date is a logical choice, but it is not required to be a date.

 

The value received in the last get serial numbers response is what is sent in the next get serial numbers request. Daisy chaining. The iPhone does nothing but echo it back.

Avoiding a Migraine

The dotnet-passbook code puts into the Json, the pass type identifier name in the certificate regardless of what you passed in. This is good and wise and secure. It has an unfortunate side effect, the routing

devices/{deviceLibraryIdentifier}/registrations/{passTypeIdentifier} and passes/{passTypeIdentifier}/{serialNumber}

is determined by this pass type identifier. If you are running a site and passes come from passes/foobar/1234, but your certificate name is “JackShyte” then the Json in the pass returned would read JackShyte. When the iPhone gets a push token, it would then construct the url for the update as passes/JackShyte/1234 … which will likely return a 404. The PkPass will never be updated unless you create additional routings!!

 

The solution that I took was to compare the {passTypeIdentifier} in the routing to the certificate. If they did not match, then 404 immediately and log an exception. While it is technically possible to “unwind” such a foul up, the path is not pretty.

 

Migration

The key for migration is a stepped approach

  1. Deploy your new solution and test it, correct any issues that you find in the production environment
  2. Deploy the application or mechanism for creating new PkPasses (this could be part of 1), so all new passes use the in-house system
  3. Update your data from the third party provider with authentication tokens (or their hash) and serial numbers. You want to do this after 2, because you want this list to be closed (no new passes created on the third party system)
  4. Have the 3rd party provider change the WebServiceUrl to the in-house solution. In theory, a Moved response to the in house system would also work (I have not tested this with an iPhone).
  5. Since the 3rd party wants to shut down in time, then you must send out a push notification to every push token you have.  You will likely want to throttle this if you have a large numbers of push tokens (in my case, 30 million) because every push token could result in a request for a new PkPass file.
    1. This may need to be repeated to insure adequate coverage for devices off line or abroad without data plans

Bottom Line

The original design worked, but there was a ton of details that had to be sorted out. I have omitted the nightmares that QA had trying to validate stuff, especially the migration portions.

Monday, June 6, 2016

One Migration Strategy to Microservices

The concepts of microservices is nice, but if you have a complex existing system the path is neither obvious or easy. I have seen Principal Architects throw up their hands and persuade the business that we need to build a new replacement system and that the old system is impossible to work with. This path tends to lead into overruns and often complete failures – I have recently seen that happen at a firm: “Just one year needed to deliver…” and three years later it was killed off because it had not been delivered.  The typical reported in industry literature statistics of 80—90% failure are very believable.

 

Over decades, I have seen many failrues (usually on the side lines).  On the other hand, for a planned phrase migration I have seen repeated success. Often success or failure seem to be determined by the agile-ness of the management and technical leads coupled with the depth of analysis before the project start. Unfortunately deep analysis ends up with a waterfall like specification that result in locked-step development and no agile-ness around issues. Similarly, agile often result in superficial analysis (the time horizon for analysis is often just the end of the next sprint)  with many components failing to fit together properly over time!

 

This post is looking at a heritage system and seeing how it can be converted to a microservices framework in an evolutionary manner. No total rewrite, just a phrased migration ending with a system that is close to a classic pro-forma microservice system.

 

I made several runs at this problem, and what I describe below “feels good” – which to me usually mean a high probability of success with demonstrable steps at regular intervals.

 

Example System

I am going to borrow a university system template from my days working for Blackboard.  We have teachers, non-teaching staff, students, classes, building, security access cards, payment cards, etc.  At one point, components were in Delphi, C#, Java, C++ etc with the databases in SQL Server and Oracle. Not only is data shared, but permissions often need to be consistent and appropriate.

 

I have tried a few running starts of microservicing  such a design, and at present, my best suggestion is this:

  • Do NOT extend the microservicing  down to the database – there is a more elegant way to proceed
  • Look at the scope of the microservices API very carefully – this is a narrow path that can explode into infinite microservices or a resurrection of legacy patterns

Elegant Microservice Database Access

Do not touch the database design at the start. You are just compounding the migration path needlessly at the start. Instead, for each microservice create a new database login that is named for the microservice and has (typically) CRUD permissions to:

  • A table
  • A subset of columns in a table
  • An updateable view
  • A subset of columns in an updateable view

We term this the Crud-Columns. There is a temptation to incorporate multiple Crud-columns into one microservice – the problem is simple, what is the objective criteria to stop incorporating more Crud-Columns into this single microservice? If you go to one microservice for each Crud-Columns, then by counting the tables you have an estimate of the number of microservices that you will likely end up with…  oh… that may be a lot! At this point of time, you may really want to consider automatic code generation of many microservices – similar to what you see with Entity-Frameworks, except this would be called Microservices-Framework.

 

This microservice may also have Read only permissions to other tables.  This other tables read only access  may be transitory for the migration. Regardless of final resolution, these tables must be directly related to the CRUD columns, and used to determine CUD decisions. At some future time, these rest calls to these read only tables may be redirected elsewhere (for example using a Moved to directive to a reporting microservices).

 

Oh, I have introduced a new term “reporting microservices”.  This is a microservice with one or read Read Api’s – multiple calls may be exposed depending on filtering, sorting or user permissions.

 

Microservices are not domain level APIs but at sub-domains or even sub-sub-domains. You should not be making small steps, instead, put on your seven-league boots!

American Trucking Industry 1952 Ad - Seven League Boots…

 

Tracking microservices

Consider creating a table where every database column is enumerated out and the microservice having CRUD over it is listed.

i.e.

  • Server.Database.Table.Schema.Column –> CRUD – >Microservice Name

 

The ideal (but likely impractical goal) is to have just one Microservice per specified column. That is a microservices may have many CUD columns, but a column will have only one CUD microservice ( N columns :: 1 Microservice).

 

Similarly, a table with

  • Server.Database.Table.Schema.Column –> R– >Microservice

can be used as a heat map to refactor as the migration occurs. We want to reduce hot spots (i.e. the number of Read microservices per column).

 

Building Microservices from Database Logins

Defining the actions that a microservice login can do cascades into a finite set of possible APIs. We are avoiding trying to define a microservice and then get the database access to support it. We are effectively changing the usual process upside down.

 

Instead of the typical path of asking the client what it needs for an API (to keep it’s life simple), we are insuring that there is a collection of APIs that satisfies its needs – although these may be complicated to call. What we need to return to the classical simplicity is intermediate APIs.

 

Intermediate APIs

Intermediate APIs are APIs are do not have explicit  database CUD rights. They are intended to be helper APIs that talk to the database microservices above and present a simpler API to clients. They will call the above APIs to change the database. They may also be caching APIs and database reporting APIs.

 

A Walk Thru

Using the university model cited above, the first naïve step could be to create a

  • Teacher API
  • Student API
  • Class API

If you bring in column permissions you find that these can be decomposed further. The reason that there may be a single row in the database for each of the above comes from Relational Database Design Normalization theory.  Instead, we should try to decompose according to user permission sets. For example:

  • Teacher API
    • Teacher MetaData API i.e. name,
    • Teacher Address Info API
    • Teacher Salary Info API
    • Teacher HR API
    • Teacher Card Access API
  • Student API
    • Student MetaData API, i.e. name,
    • Student Address Info API
    • Student Tuition API
    • Student Awards API
    • Student Card Access API

Our wishful state is that if you are authorized for an API, there is no need to check for further permissions. As I said, wishful. If you apply this concept strictly then you will likely end up with an unmanageable number of APIs that would be counter productive. This would be the case for an enterprise class system. For less complex systems, like customer retail systems, the number of permissions sets may be greatly reduced.

 

With the Blackboard system (when I was working on it), we were enabling support for hundred of thousands permission sets that often contains hundred of permission each (i.e. each person had their own set, each set contains permissions to access building, Uris, copying machines, etc).

 

An Intermediate API may be ClassAssignmentViewer. In this API, information from Student Metadata API, Teacher Metadata API and other APIs. Alternatively, it may be directly read only from the database.

 

Next Step

Once you have the microservices defined, you can start looking at segmenting the data store to match the microservices. When you leave a classic relational database, you may need to deal with issues such as referential integrity and foreign keys between microservices. If you have the microservice and the database login permissions pre-defined, then these issues are a magnitude simpler.

Bottom Line

The above is a sketch of what I discovered about migration process by trying several different approaches and seeing ongoing headaches, or, massive and risky refactoring.

 

With the above, you can start with a small scope and implement it. The existing system keeps functioning and you have created a parallel access point to the data. As functioning sets are completed, you can cut over to some microservices while the rest is running on the classic big api approach.  You can eventually have the entire system up in parallel and then do a cut over to these microservices stubs. Over time, you may wish to decouple the data stores but that can be done later. You need to isolate the CUD first into microservice to be above to do that step.

Saturday, May 28, 2016

Theory about Test Environments

Often my career has faced dealing with an arbitrary environment to test in. This environment preceded my arrival, and often was still there at my departure with many developers became fatalistic towards this arbitrary environment.  This is not good.

 

The Rhetorical Goal Recomposed

“We use our test environment to verify that our code changes will work as expected”

While this assures upper management, it lacks specifics to evaluate if the test environment is appropriate or complete. A more objective measurement would be:

  • The code changes perform as specified at the six-sigma level of certainty.

This then logically cascades into sub-measurements:

  • A1: The code changes perform as specified at the highest projected peak load for the next N year (typically 1-2) at the six-sigma level of certainty.
  • A2: The code changes perform as specified on a fresh created (perfect) environment  at the six-sigma level of certainty.
  • A3: The code changes perform as specified on a copy of production environment with random data at the six-sigma level of certainty.

The last one is actually the most critical because too often there is bad data from bad prior released code (which may have be rolled back – but the corrupted data remained!) . There is a corollary:

  • C1: The code changes do not need to perform as specified when the environment have had its data corrupted by arbitrary code and data changes that have not made it to production. In other words, ignore a corrupted test environment

 

Once thru is not enough!

Today’s systems are often multi-layers with timeouts, blockage under load and other things making the outcome not a certainty but a random event. Above, I cited six sigma – this is a classic level sought in quality assurance of mechanical processes.

 

“A six sigma process is one in which 99.99966% of all opportunities to produce some feature of a part are statistically expected to be free of defects (3.4 defective features per million opportunities).”

 

To translate this into a single test context – the test must run 1,000,000 times and fail less than4 times. Alternatively, 250,000 times with no failures.

 

Load testing to reach six-sigma

Load testing will often result in 250,000 calls being made. In some cases, it may mean that the load test may need to run for 24 hours instead of 1 hour. There are some common problem with many load tests:

  • The load test does not run on a full copy of the production environment – violates A3:
  • The same data is used time and again for the tests – thus A3: the use of random data fails.
    • If you have a system that has been running for 5 years, then the data should be selected based on user created data with 1/5 from each year
    • If the system has had N releases, then the data should be selected on user created data with 1/n from each release period

Proposal for a Conforming Pattern

Preliminary development (PD) is done on a virgin system each day. By virgin I mean that databases and other data stores are created from scripts and populated with perfect data. There may be super user data but no common user data.  This should be done by an automated process. I have seen this done in some firms and it has some real benefits:

  • Integration tests must create (instead of borrow) users
    • Integration tests are done immediately after build – the environment is confirmed before any developers arrive at work.
    • Images of this environment could be saved to allow faster restores.
  • Performance is good because the data store is small
  • A test environment is much smaller and can be easily (and cheaply) created on one or more cloud services or even VMs
  • Residue from bad code do not persist (often reducing triage time greatly) – when a developer realized they have accidentally jacked the data then they just blow away the environment and recreate it

After the virgin system is built, the developer’s “release folder scripts” are executed – for example, adding new tables, altering stored procedures, adding new data to system tables. Then the integration tests are executed again. Some tests may fail. A simple solution that I have seen is for these tests to call into the data store to get the version number and add an extension to NUnit that indicate that this test applies to before of after this version number. Tests can then be excluded that are expected to fail (and also identified for a new version to be written).

 

Integration development(ID) applies to the situation where there may be multiple teams working on stuff that will go out in a single release. Often it is more efficient to keep the teams in complete isolation for preliminary development – if there are complexities and side-effects than only one team suffers. A new environment is created then each teams’ “release folder scripts” are executed and tests are executed.

i.e. PD+PD+….+PD = ID

This keeps the number of moving code fragments controlled.

 

Scope of Testing in PD and ID

A2 level is as far as we can do in this environment. We cannot do A1 or A3.

 

SmokeTest development (STD) means that an image of the production data base is made available to the integration team and they can test the code changes using real data. Ideally, they should regress with users  created during each release period so artifact issues can be identified. This may be significant testing, but is not load testing because we do not push up to peak volumes.

Tests either creates a new user (in the case of PD and ID) or searches for a random user that was created in release cycle 456 in the case of STD. Of course, code like SELECT TOP 1 *… should not be used, rather all users retrieved and one randomly selected.

 

This gets us close to A3: if we do enough iterations.

 

Designing Unit Tests for multiple Test Environment

Designing a UserFactory with a signature such as

UserFactory.GetUser(UserAttributes[] requiredAttributes)

can simplify the development of unit tests that can be used across multiple environments. This UserFactory reads a configuration file which may have  properties such as

  • CreateNewUser=”true”
  • PickExistingUser=”ByCreateDate”
  • PickExistingUser=”ByReleaseDate”
  • PickExistingUser=”ByCreateDateMostInactive”

In the first case, a user is created with the desired attributes.  In other cases, the attributes are used to filter the production data to get a list of candidates to randomly pick from.

 

In stressing scenarios when we want to test for side-effects due to concurrent operation by the same user, then we could use the current second to select the same user for all tests starting in the current second.

 

Developers Hiding Significant Errors – Unintentional

At one firm, we successfully established the following guidance:

  • Fatal: When the unexpected happen – for example, the error that was thrown was not mapped to a known error response (i.e. Unexpected Server Error should not be returned)
  • Error: When an error happens that should not happen, i.e. try catch worked to recover the situation…. but…
  • Warning: When the error was caused by customer input. The input must be recorded into the log (less passwords). This typically indicates a defect in UI, training or child applications
  • Info: everything else, i.e. counts
  • Debug: what ever

We also implemented the ability to change the log4net settings on the fly – so we could, in production, get every message for a short period of time (massive logs)

Load Stress with Concurrency

Correct load testing is very challenging and requires significant design and statistics to do and validate the results.

 

One of the simplest implementation is to have a week old copy of the database, capture all of the web request traffic in the last week and do a play back in a reduced time period. With new functionality extending existing APIs then we are reasonably good – except we need to make sure that we reach six-sigma level – i.e.  was there at least 250,000 calls???  This can be further complicated if the existing system has a 0.1% error rate. A 0.1% error rate means 250 errors are expected on average, unfortunately this means that detecting a 1 error in 250,000 calls difference is impossible from a single run (or even a dozen runs). Often the first stage is to drive error rates down to near zero on the existing code base. I have personally (over several months) a 50K/day exception logging rate to less than 10. It can be done – just a lot of systematic slow work (and fighting to get these not business significant bug fixes into production). IMHO, they are business significant: they reduce triage time, false leads, bug reports, and thus customer experience with the application.

 

One of the issues is whether the 250,000 calls applies to the system as a whole – or just the method being added or modified? For true six-sigma, it needs to be the method modified – sorry! And if there are 250,000 different users (or other objects) to be tested, then random selection of test data is required.

 

I advocate the use of PNUnit (Parallel Nunit) on multiple machines with a slight twist. In the above UserFactory.Get() described above, we randomly select the user, but  for stress testing, we could use the seconds (long) and modular it with the number of candidate users and then execute the tests. This approach intentionally creates a situation where concurrent activity will generated, potentially creating blocks, deadlocks and inconsistencies.

 

There is a nasty problem with using integration tests mirroring the production distribution of calls. Marking tests appropriately may help, the test runner can them select the tests to simulate the actual production call distribution and rates. Of course, this means that there is data on the call rates and error rates from the production system.

 

Make sure that you are giving statistically correct reports!

 

The easy question to answer is “Does the new code make the error rate statistically worst?” Taking our example above of 0.1% error we had 250 errors being expected. If we want to have 95% confidence then we would need to see 325 errors to deem it to be worst. You must stop and think about this, because of the our stated goal was less than 1 error in 250,000 – and we ignore 75 more errors as not being significant!!! This is a very weak criteria. It also makes clear that driving down the back ground error rate is essential. You cannot get strong results with a high background error rate, you may only be able to demonstrate 1 sigma defect rate.

 

In short, you can rarely have a better sigma rate than your current rate unless you fix the current code base to have a lower sigma rate.