Docker and Beanstalk: Welcome to the Gaps

At UnifyID we’re big fans of microservices à la Docker and Elastic Beanstalk. And for good reason. Containerization simplifies environment generation, and Beanstalk makes it easy to deploy and scale.

Both promise an easier life for developers, and both deliver, mostly. But as with all simple ideas, things get less, well simple, as the idea becomes more widely adopted, and then adapted into other tools and services with different goals.

Soon there are overlaps in functionality, and gaps in the knowledge base (the Internet) quickly follow. Let’s take an example.

When you first jump into Docker, it makes total sense. You have this utility docker and you write a Dockerfile that describes a system. You then tell docker to read this file and magically, a full blown programming environment is born. Bliss.

But what about running multiple containers? You’ll never be able to do it all with just a single service. Enter docker-compose, a great utility for handling just this. But suddenly, what was so clear before is now less clear:

  • Is the docker-compose.yml supposed to replace the Dockerfile? Complement it?
  • If they’re complementary, do options overlap? (Yes.)
  • If options overlap, which should go where?
  • How do the containers address each other given a specific service? Still localhost? (Not necessarily.)

Add in something like Elastic Beanstalk, its Dockerrun.aws.json file, doing eb local run, and things get even more fun to sort out.

In this post I want to highlight a few places where the answers weren’t so obvious when trying to implement a Flask service with MongoDB.

To start off, it’s a pretty straightforward setup. One container runs Flask and serves HTTP, and a second container serves MongoDB. Both are externally accessible. The MongoDB is password protected, naturally, and in no way am I going to write my passwords down in a config file. They must come from the environment.

Use the Dockerfile just for provisioning

The project began its life with a single Dockerfile containing an ENTRYPOINT to start the app. This was fine while I was still in the early stages of development — I was still mocking out parts of external functionality, or not even handling it yet.

But then I needed the same setup to provide a development environment with actual external services running, and the ENTRYPOINT in the Dockerfile became problematic. And then I realized — you don’t need it in the Dockerfile, so ditch it. Let the Dockerfile do all the provisioning, and specify your entrypoint in one of the other ways. From the command line:

docker run --entrypoint make myserver run-tests

Or, from your docker-compose.yml you can do it like

version: '2'
services:
  myserver:
    ...
    entrypoint: make dev-env

This handily solved the problem of having a single environment oriented to different needs, i.e. test runs and a live development environment.

Don’t be afraid of multiple Dockerfiles

The docker command looks locally for a file named Dockerfile. But this is just the default behavior, and it’s pretty common to have slightly different configs for an environment. E.g. our dev and production environments are very similar, but we have some extra stuff in dev that we want to weed out for production.

You can easily specify the Dockerfile you want by using docker -f Dockerfile.dev ..., or by simply using a link: ln -s Dockerfile.dev Dockerfile && docker ...

If your docker-compose.yml specifies multiple containers you may find yourself in the situation where you not only have multiple Dockerfiles for a given service, but Dockerfile(s) for each service. To demonstrate, let’s say we have the following docker-compose.yml

version: '2'
services:
  flask:
    build: .
    image: myserver:prod
    volumes:
      - .:/app
    links:
      - mongodb
    environment:
      - MONGO_USER=${MONGO_USER}
      - MONGO_PASS=${MONGO_PASS}
    ports:
      - '80:5000'
    entrypoint: make run-server
  mongodb:
    build: ./docker/mongo
    image: myserver:mongo
    environment:
      - MONGO_USER=${MONGO_USER}
      - MONGO_PASS=${MONGO_PASS}
    ports:
      - '27017:27017'
    volumes:
      - ./mongo-data:/data/mongo
    entrypoint: bash /tmp/init.sh

In the source tree for the above, we have Dockerfiles in the following locations:

Dockerfile.dev
Dockerfile.prod
docker/mongo/Dockerfile

The docker-compose command uses the build option to tell it where to find the Dockerfile for a given service. The top two files are for the Flask service, and the appropriate Dockerfile is chosen using the linking strategy mentioned above. The mongodb service uses its own Dockerfile kept in a certain folder. The line

build: ./docker/mongo

tells docker where to look for it.

Dockerrun.aws.json, the same, but different

Enter Elastic Beanstalk and Dockerrun.aws.json. Now you have yet another file, and it pretty much duplicates docker-compose.yml — but of course with its own personality.

You use Dockerrun.aws.json v2 to deploy multiple containers to Elastic Beanstalk. Also, when you do eb local run, the file .elasticbeanstalk/docker-compose.yml is generated from it.

Here’s what the Dockerrun.aws.json corollary of the above docker-compose.yml file looks like:

{
  "AWSEBDockerrunVersion": 2,
  "volumes": [
    {
      "name": "mongo-data",
      "host": {
        "sourcePath": "/var/app/mongo-data"
      }
    }
  ],
  "containerDefinitions": [
    {
      "name": "myserver",
      "image": "SOME-ECS-REPOSITORY.amazonaws.com/myserver:latest",
      "environment": [
          {
            "name": "MONGO_USER",
            "value": "changemeuser"
          },
          {
            "name": "MONGO_PASS",
            "value": "changemepass"
          },
          {
            "name": "MONGO_SERVER",
            "value": "mongo-server"
          }
      ],
      "portMappings": [
        {
          "hostPort": 80,
          "containerPort": 5000
        }
      ],
      "links": [
        "mongo-server"
      ],
      "command": [
        "make", "run-server-prod"
      ]
    },
    {
      "name": "mongo-server",
      "image": "SOME-ECS-REPOSITORY.amazonaws.com/mongo-server:latest",
      "environment": [
          {
            "name": "MONGO_USER",
            "value": "changemeuser"
          },
          {
            "name": "MONGO_PASS",
            "value": "changemepass"
          }
      ],
      "mountPoints": [
        {
          "sourceVolume": "mongo-data",
          "containerPath": "/data/mongo"
        }
      ],
      "portMappings": [
        {
          "hostPort": 27017,
          "containerPort": 27017
        }
      ],
      "command": [
        "/bin/bash", "/tmp/init.sh"
      ]
    }
  ]
}

Let’s highlight a few things. First, you’ll see that the image option is different, i.e.

      "image": "SOME-ECS-REPOSITORY.amazonaws.com/myserver:latest",

This is because we build our docker images and push them to a private repository on Amazon ECS. On deploy, Beanstalk looks for the one tagged latest, pulls, and launches.

Next, you may have noticed that in docker-compose.yml we have the entrypoint option to start the servers. However, in Dockerrun.aws.json we’re using "command".

There are some subtle differences between ENTRYPOINT and CMD. But in this case, it’s even simpler. Even though Dockerrun.aws.json has an "entryPoint" option, the server commands wouldn’t run. I had to switch to "command" before I could get eb local run to work. Shrug.

Another thing to notice is that in docker-compose.yml we’re getting variables from the host environment and setting them into the container environment:

    environment:
      - MONGO_USER=${MONGO_USER}
      - MONGO_PASS=${MONGO_PASS}

Very convenient. However, you can’t do this with Dockerrun.aws.json. You’ll have to rewrite the file with the appropriate values, then reset it. The next bit will demonstrate this.

We’re setting a local volume for MongoDB with the following block:

  "volumes": [
    {
      "name": "mongo-data",
      "host": {
        "sourcePath": "/var/app/mongo-data"
      }
    }
  ]

The above path is production specific. This causes a problem with eb local run, mainly because of permissions on your host machine. If you set a relative path, i.e.

        "sourcePath": "mongo-data"

the volume is created under .elasticbeanstalk/mongo-data, and everything works fine. On a system with Bash, you can solve this pretty easily doing something along the following lines:

cp Dockerrun.aws.json Dockerrun.aws.json.BAK
sed -i '' "s/\/var\/app\///g" Dockerrun.aws.json
eb local run ; mv Dockerrun.aws.json.BAK Dockerrun.aws.json

We just delete the /var/app/ part, run the container locally, and return the file back to how it’s supposed to be for deploys. This is also how we set the password — changemepass — from the environment on deploy.

Last, you’d think running eb local run, which is designed to simulate an Elastic Beanstalk environment locally via Docker, would execute pretty much the same as when you invoke with docker-compose up.

However, I discovered one frustrating gotcha. In our Flask configuration, we are addressing the MongoDB server with mongodb://mongodb (instead of mongodb://localhost) in order to make the connection work between containers.

This simply did not work in eb local run. Neither did using localhost. It turns out the solution is to use another environment variable, MONGO_SERVER. In our Flask config, we do the following, which defaults to mongodb://mongodb:

    'MONGO_SERVER': os.environ.get('MONGO_SERVER', 'mongodb'),

In Dockerrun.aws.json, we specify this value as

          {
            "name": "MONGO_SERVER",
            "value": "mongo-server"
          }

Why? Because the "name" of our container is mongo-server and eb generates an entry in /etc/hosts based on that.

So now everything works between docker-compose up, which uses mongodb://mongodb, and eb local run, which uses mongodb://mongo-server.

These are just a few of the things that might confound you when trying to do more than just the basics with Docker and Elastic Beanstalk. Both have a lot to offer, and you should definitely jump in if you haven’t already. Just watch out for the gaps!

How I stopped worrying and embraced Docker Microservices

Hello world,

If you are like us here at UnifyID then you’re really passionate about programming, programming languages and their runtimes. You will argue passionately about how Erlang has the best Distributed Systems model (2M TCP connections in one box), Haskell has the best type system, and how all our ML back-end should be written in Lua (Torch). If you are like me and you start a company with other people, you will argue for hours, and nobody’s feelings are gonna be left intact.

That was the first problem we had in the design phase of our Machine Learning back-end. The second problem will become obvious when you get a short introduction to what we do at UnifyID:

We data-mine a lot of sensors on your phone, do some signal processing and encryption on the phone, then opportunistically  send the data from everybody’s phone into our Deep-Learning backend where the rest of the processing and actual authentication take place.

This way, the processing load is shared between the mobile device and our Deep Learning backend. Multiple GPU machines power our Deep Learning, running our proprietary Machine Learning algorithms, across all of users’ data.

These are expensive machines and we’re a startup with finite money, so here’s the second problem; Scalability. We don’t want these machines sitting around when no jobs are scheduled and we also don’t want them struggling when a traffic spike hits. This is a classic auto-scaling problem.

This post describes how we killed two birds;

  1. Many programming runtimes for DL.
  2. Many machines.

With one stone. By utilizing the sweeping force of Docker microservices! This has been the next big thing in distributed systems for a while, Twitter and Netflix use this heavily and this talk is a great place to start. Since we have a lot of factors we verify against like Facial Recognition, Gait Analysis and Keystroke Analysis, it made sense to make them modular. We packaged each one in its own container, wrote a small HTTP server that satisfies the following REST API and done!

POST /train
Body: { Files: [ <s3 file1>, <s3 file2>,...] }
Response: { jobId: <jobId> }
POST /input
Body: { Files: [ <s3 file1>, <s3 file2>,...] }
Response: { jobId: <jobId> }
POST /output
Body: { Files: [ <s3 file1>, <s3 file2>,...] }
Response: { outputVector: <output vector> } 

GET /status?jobId=<jobId> 
Response: { status: [running|done|error] }

This API can be useful because every Machine Learning algorithm has pretty much the same API; training inputs, normal inputs and outputs.  It’s so useful we decided to open-source our microservice wrapper for Torch v7/Lua and for Python. Hopefully more people can use it and we can all start forking and pushing entire machine learning services in dockerhub.

But wait, there’s more! Now that we containerized our ML code, the scalability problem has moved from a development problem to an infrastructure problem. To handle scaling each microservice according to their GPU and Network usage, we rely on Amazon ECS. We looked into Kubernetes, as a way to load-balance containers, however its support for NVIDIA GPU based load-balancing is not there yet (There’s a MR and some people who claim they made it work). Mesos was the other alternative, with NVIDIA support, but we just didn’t like all the Java.

In the end, this is how our ML infrastructure looks like.

screen-shot-2016-09-16-at-8-54-16-pm
Top-down approach to scalable ML microservices

Those EB trapezoids represent Amazon EB (Elastic Beanstalk), another Amazon service which can replicate machines (even GPU heavy machines!) using custom-set rules. The inspiration for load-balancing our GPU cluster with ECS and EB came from this article from Amazon’s Personalization team.

For our Database we use a mix of Amazon S3 and a traditional PostgreSQL database linked and used as a local cache for each container. This way, shared data becomes as easy as sharing S3 paths, while each container can modularly keep its own state in PostgreSQL.

So there you have it, both birds killed. Our ML people are happy since they can write in whatever runtime they want as long as there is an HTTP server library for it. We don’t really worry about scalability as all our services are small and nicely containerized. We’re ready to scale to as many as 100,000 users and I doubt our microservices fleet would even flinch. We’ll be presenting our setup in the coming Dockercon 2017 (hopefully, waiting for the CFP to open) and we’re looking to hire new ML and full-stack engineers. Come help us bring the vision of passwordless implicit authentication to everyone!

Distributed Testing with Selenium & Docker Containers (Part I)

As seen at TechCrunch Disrupt, the UnifyID Google Chrome extension plays a prominent place in the user’s first impression with the product. As such, it is critical that the experience be smooth, user-friendly, and most important, reliable. In minute 2:10 of the demo, the Amazon login fields are replaced with the UnifyID 1-click login. This requires DOM (document object model) manipulation on the web page which isn’t challenging on a single website; however, when scaling your code across sites in multiple languages and formats, finding the right login form and place to insert our logic is challenging.

How to find the right elements in here?
How to find the right elements in here and generalize well?

Parsing through the noise of variations on HTML structures, CSS naming conventions, and general web development is like the wild west–but despite this, there are ways to gain sanity, namely in testing.

As most UIs are tested, we utilize Selenium and its Javascript binding. For developers who have had some experience with Selenium, it becomes painfully obvious very early on that speed is not its forte. The testing environment must handle testing functionality across dozens, hundreds, and thousands of sites while maintaining the fast, iterative nature of development. If every test case takes an hour to finish, then our continuous integration would lose its magical properties (you know, moving fast without breaking things).

Surfing at speed through the codebase.
Surfing at speed through the codebase, knowing everything will be ok.

The solution to distribute testing across multiple computers will require the following:

  1. Must run multiple instances of the tests with the same code but on different websites.
  2. Must run all of the tests in parallel.
  3. Must collect all results.

In order to simplify my work, I discovered that Selenium has this neat feature called Selenium Grid that allows you to control testing in a remote machine so that the local machine doesn’t have to run the tests itself. You just run the grid in another machine, expose the port, and remote connect to that port by using settings found in your chromedriver.

This was amazing! This meant that we could have machines entirely dedicated to running a bunch of Selenium webdrivers (which are considerably RAM consuming) without killing our CI server. Plus, we could easily scale by just spawning more of these machines and more of these web drivers.

Finally!
Finally! Scalability!

As a result, the design evolved to this:

  1. Set up a Selenium server to run a grid of browsers. Run Chrome.
    1. Package under a Docker image for easy deployment.
    2. Run under an Amazon AWS instance for easy scaling.
  2. The CI server has to be able to send requests to the Grid.
  3. The tests have to run in parallel.
    1. Create a Javascript library to make tests into chunks that can be sent over to the server.
  4. Finally, the results have to be merged together and sent back to the CI server.

For 1, I downloaded selenium-server-standalone-2.53.1.jar from “http://selenium-release.storage.googleapis.com/index.html?path=2.53/” and saved as selenium-server-standalone.jar.

Run hub with:

java -jar selenium-server-standalone.jar\
-role hub

Run node with:

java -jar selenium-server-standalone.jar -role node\
-hub http://localhost:4444/grid/register\
-browser browserName=chrome,version=52.0,\
          chrome_binary='/usr/bin/google-chrome',\
          platform=LINUX,\
          maxInstances=5

Now to connect to the grid, we can run from JavaScript:

Look out for my next post which will delve deeper into parts 2, 3, and 4 above.

Thanks for Hacking!