GenomeWizard: Autopsy of an Alexa Skill / by Aaron Samuel

Amazon Alexa Philosophy

Amazon, known for their flagship products, Amazon Web Services, a massively dominant retail service, and being the target of attacks by the president of the united states, broke into the voice services sector with their Alexa Echo & Dot product lines, released in Q4 of 2014. As with most things Amazonian, it's a story of success. Since the product release 4 years ago, Amazon Voice Services (AVS), has grown exponentially, it can be found embedded in smart televisions, mobile devices, and a plethora of vendor specific devices which utilize AVS as the central processing engine of their voice interface/platform. Welcome to Skynet kids, I'll be your guide

The primary selling point of AVS/Alexa which allowed it to succeed, was that Amazon combined natural language processing, automated speech recogniton and their robust compute cloud to provide a software as a service. This allowed vendors other than Amazon to integrate Alexa/ASK capabilities for a more modern voice controlled interface. Next, the ASK toolkit (Amazon Skill Kit) extended Amazons reach even further, to all software developers & individuals in general. A brief glance at the skills available in the Alexa application reveals hundreds (if not thousands) of entries from individuals, with applications ranging from daily briefings to game, to well known companies like Domino's, who provide a voice based pizza ordering service via their Alexa skill. 

 

Skills, Skills, Skills

There is a Cortana Alexa Skill, pretty sure that's porno.

 

Preparing to develop ASK services involves having:

 

  • AWS Amazon account (free-tier may work for this, i haven't checked)
    • Amazon Developer account
      • Any legal information pertaining to your business will be requested if you plan to monetize, or advertise via Alexa. (ex: organizational & tax structure, EIN, etc.)
    • Installed/Configured AWS cli
  • an Alexa device (Echo, Dot, Show, etc.)
    • This is optional, Amazon provides an Alexa Simulator which will work fine for your development cycles, a beta-tester option allows the author to invite other users, who own an Alexa device to test the skill and provide feedback.
  • Setting Up Your Development Environment
    • Launch an EC2 Instance from an AMI, you can choose any linux based distribution you are comfortable with. 
      • Install docker on that instance, start the docker service.

Launching an EC2 Instance from the AWS Console is simple, follow the prompts provided in the Launch Instance helper.

 

EC2 Links, How To, etc.:

  • How To Create Key Pair Using AWS CLI
  • How To Launch Instances Using AWS CLI
  • Any T-* type instance should work fine, this will be your CI/CD system.
  • Make sure you assign an EIP to the instance for remote access.
  • Don't forget to create & safely store your PEM key (key pair). 
  • Connect to your instance and configure the development software.
    • Use SSH to connect to the system by the EIP address.
    • Install docker.
    • We will use the opensource docker lambda build environment solution "lambdaci/lambda" to provide a consistent testing solution.
      • lambdaci provides different docker images for different lambda languages, setup the image relevant to your project, eg. "python2.7", "nodejs8.10" and etc.
      • using lambda CI, you can develop code on the EC2 instance, and test how well it may run as a lambda using the prebuilt docker images (which are similar to the lambda execution environment).

 

Screen Shot 2018-09-07 at 4.19.24 AM.png

lambdaci/lambda

Makes testing your lambda code a breeze.

 

 

Visualizing Your Code

With a firm understanding of AVS/ASK/Alexa SaaS solutions in place, the next logical step is developing the actual Alexa skill. You can think of programming an Alexa Skill as any other application development process, the first step should always be a brainstorming & drafting process, pseudocode, and visually plotting the applications flow of logic. The modern internet provides several solutions for this task, some are paid-solutions, however many are open source. I use the web based LucidChart for ease of use and a rich feature set (it's also cheap). You can also behave like a human and use a pen & pad, if you must.

Genome Wizard

Simple Flow Chart diagram indicating the interface for 'Genome Wizard' Alexa Skill

 

The Lambda

Your next set of steps will take place in the Amazon AWS Console (or CLI if you're bossy), and concern creating & configuring your Lambda function. An Amazon Lambda function is serverless code that executes on the elastic compute cloud. Instead of configuring autoscaling groups, or EC2 instances, you simply package your application as a ZIP and "upload" it to AWS, where it then will become available for consumption by other services, externally or internally.

AWS Lambda supports many programming languages, the most interesting being, Python, NodeJS, and Go, opening the options of development to a very wide group.  AWS Lambda is the best practices advised endpoint for your ASK service, providing native integration solutions. Lambdas can be created in the UI of the AWS console, or from the CLI, as a tip The UI method provides a helper wizard to step you through creation of the lambda and any dependencies needed for functional execution, development, testing.  

Creating a Lambda in 12 steps from the AWS Console.

Creating a Lambda in 12 steps from the AWS Console.

 

You'll want to make sure your lambdas have an IAM role which allows access to write to CloudWatch log groups, so that any error messages can be properly viewed and addressed. If you create your lambda within the AWS graphical console, the option to create the IAM role for you, is provided. 

If you execute the steps via AWS CLI, you'll need to create the IAM role first, the attached policy should look something like this: 

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:us-east-1:ACCOUNT_ID:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:us-east-1:ACCOUNT_ID:log-group:/aws/lambda/*:*"
            ]
        }
    ]
}

 

The most basic lambda may be triggered by an HTTP request to its endpoint & return a json "hello world" formatted response. In my case, I used the Alexa Skill Kit to trigger the Lambda. In the end, it's all the same, information inbound is processed by the lambda, and an outbound response is returned to caller.

The pivotal function of the lambda, will be Python code that processes the inputs sent by Alexa, and returns it's outputs to the caller (Alexa), to be converted in most cases, to speech. This is where your previously configured development environment becomes relevant. You will want to develop on a machine similar to the ones which actually host lambda (at the time of writing this reference was relevant) in context to CPU Architecture, Kernel version, pre-installed libraries, and other low level details directly or indirectly related to binary execution in userland which will affect the lambdas functioning. This means, if you are on Mac such as myself, or windows machine, you SHOULD NOT develop the lambda code locally. Libraries and depencies bundled for those OS, will not be compatible with the running lambda environments (which are linux kernel based). Your lambda will often silently fail in these situations, and you'll spend much time face palming.

The concept of lambda is fairly simple in essence, however a full description is somewhat out of scope for this posting. To summarize very briefly, your lambda development code can be all of one function that returns "hello world". AWS Lambda expects your code to have a function defined, which you can name anything, but defaults to `lambda_handler`. This function is what is executed by lambda, and it's return values are what is returned to the caller. If "lambda_handler" returns "Hello World", your lambda will return "Hello World" on execution. 

 

Dissecting Genome Wizard Lambda (Python)

The lambda_handler function for GenomeWizard accomplishes all-of-the-things in 8 LOC.

 

The function arguments, event, and context, are created and populated by the triggering service, and are free objects that you can use for effective processing in your code. As you can see, each variable is an object, carrying important information about the triggering request. This data can then be used for further decision making. The variable, event, for example, will change its context based on the calling service, in the instance of an Alexa Skill Trigger, the event object will contain a request type key populated with labels pertaining to how to respond. If the request type is to launch the service ("LaunchRequest"), to operate the service ("IntentRequest"), or to end the service "SessionEndRequest". Think of these primitive labels as a conversational state machine in your Amazon Alexa Skill, in fact, that's exactly what this is, a giant event-based state machine. A closer look at the above snippet, will reveal the on_ functions that are defined. Let's look at those functions below. 

For each event, we have defined a function to execute, we've also added some debugging statements.

The get_welcome_response and handle_session_end_request functions handle opening and closing sessions.

 

As you can see, on_launch dispatches to a get_welcome_response, and on_session_ended dispatches to handle_session_end_request functions. on_intent dispatches to a number of functions based on any number of intents. So let's focus on the on_intent function for a moment.

 

In ASK, an intent is a "user aim", intents set states of which some are built-in and some are user defined. Built-In intents, are usually prefixed with "AMAZON.", (ex. AMAZON.HelpIntent), as to where user-defined intents, can be any compliant string of characters. The Amazon.HelpIntent may be called if Alexa does not understand something you said within a session, the AMAZON.CancelIntent is another built-in state, that occurs when you request to end the session gracefully. User defined intents are the meat and guts of your skill, they will handle inputs and provide outputs back to their dispatcher (calling functions, in this cases on_intent). User defined intents will map to intents defined in the developer.amazon.com console for your Alexa Skill Model. In the snippet above, you will see 3 user-defined intents, DescribeGene, LocateGene, and ProteinsTranslatedFor, as well, you will see 3 separate functions are dispatched to by each intent. Let's take a look at one of these more specific functions. 

query_entrez_summary, as the name implies, is a function which queries the Entrez Gene Database for a user provided parameter gene_label (converted to text by Amazons automated voice recognition), validates the data, providing clean responses to different conditions, and also prettifies the data, in preparation to send a narrated summary (by Alexa), to the requesting end user.

 

Think of it like this:

  • the application was launched with a LaunchRequest intent, a session was started, and the application was waiting for further input.
  • the application received a DescribeGene intent.
  • the intent is matched, and dispatched to the query_entrez_summary() function, which accepts the gene name as it's only paramater.
  • A query against the Entrez Gene database is run, validated, output formatted and returns the response data to the on_intent caller.  
  • the on_intent caller returns the data to Alexa.
  • Alexa uses Text-To-Speech to narrate the response to the end user.
  • The Skill waits for another intent request.

A response suitable for an Alexa Skill is a specially formatted JSON document that is pretty easy to build programatically. There are generally a minimum of 3 keys required within this document.

The outputSpeech key is the data Alexa will recite to the end user, it can be of type PlainText, or the more advanced and configurable SSML type, which introduces a tag based markup language for controlling Alexa voice parameters while reciting the response. 

The card key is used by Alexa to determine what and how to render any visual components to a display for devices like FireTV and Echo Show. A simple type card allows you to send simple text strings to the display interface, more complex options allow you to send image data, or request user permissions which may be required during initialization of your ASK service by it's other intents.  In my case, Simple type was sufficient.

The reprompt key is used by Alexa in the condition that the users voice request is not understood properly, providing a convenient routing for handling user errors by re-prompting for data.

Screen Shot 2018-09-07 at 8.17.11 AM.png

Building Responses

Building the JSON document response programatically, is just a matter of providing some function for the task.

 

My particular ASK service, Genome Wizard, can also handle 2 other user-defined intents, LocateGenes, and ProteinsTranslatedFor. User defined intents in a perfect world, should describe what they do. ProteinsTranslatedFor for instance, takes the gene_label parameter, and determines if the gene translates to any functional proteins, and will list which proteins if so. LocateGene will take the gene_label parameter, and determine the Chromosome, arm, and positional location of the gene on that arm. As a side note, Genome Wizard currently only supports the Homo sapiens species. 

 

Screen Shot 2018-09-07 at 8.27.02 AM.png

BRCA2 Gene Diagram

Future releases will change the card type to a more advanced kind, allowing me to generate gene diagrams to send to display friendly devices along with the gene data.

 

Interaction Model Development

Remarkably I've talked very little about our actual Alexa Interaction Model. That's because per best practices, you haven't made one yet. Making the interaction model is easy, assuming you took the prescribed order of operations, it's simply filling in the data you expect your lambda to handle here. Much of this will sound familiar, as you've laid the underlying framework for it within the lambda described above.

Head over to the developer console, and create a new skill. Once created, navigate to the build tab of the developer console for your skill.

There are 6 key concepts to understand for Interaction Model Development

  1. Invocation
  2. Intents
  3. Utterances
  4. Slots
  5. Slot Types. 

I'll introduce them more naturally by their relations.

Invocation, is simply the phrase that is recognized as a launch request for your ASK service, in my case it is "GenomeWizard", which allows phrases like "Alexa, open the genomewizard" to start a sessions with the service. 

Screen Shot 2018-09-07 at 6.14.53 AM.png

Invocation

User: "Alexa, open genomewizard"

Intents,  We've already discussed the lambda side of an intent, think of the Interaction Model side as as the "Alexa conversational side", so above you'll see i have 3 custom intents defined, DescribeGene, LocateGene, and ProteinsTranslatedFor, which should look familiar.

Screen Shot 2018-09-07 at 6.33.43 AM.png

Intents

Here, we see our defined intents, and have the ability to add more intents, if the application gains new features.

Utterances, Slots, Each of your intents, will be recognized by a list of Utterances, which literally means the things the user says to Alexa that will be recognized by the natural language processor and passed to the lambda as text. So, want to take a guess what a slot is? That's right, you guessed it, it's a keyword within the utterance that is "important". Think of it like a streaming text editor, no different than sed searching for a string in text, extracting that string, and doing something else with it. Utterances and slots are sort of, executing a regular expression on voice input, extracting what your service needs from the statement.

Utterances and Slots

Here i have defined 3 utterances to support the DescribeGene intent. Each utterance uses the same slot as the pivotal variable for processing, "Gene".

User: "lookup the gene labeled TP53"

Slot Types, Similar to intents, there are Amazon defined built-in slots, and user defined slots. An Amazon defined built-in slot may be of type "AMAZON.Address", and will do very well at yanking addresses out of utterances. In my case, i defined a simple type "Gene" slot, which is as the name implies, any gene label. What we are really doing here is training a learning NLP, as such, we only need to provide our custom slot with examples of some gene names, and it should be able to learn how they work from that point. Otherwise, in a non learning development setup, you'd need to find a way to know the 20,000 or so gene labels and any changes that may occur in them. 

Screen Shot 2018-09-07 at 6.22.02 AM.png

Slot Types

a list of official gene labels, used to teach the skill what to look for in user requests. Only a few samples are required to get the service learning.

 

Interfaces & Other Settings

Beyond scope of this write-up, however, within interface settings, you can enable different features, like display features specific to devices with a display, like Echo Show, or FireTV. This allows you to customize the user experience dynamically based on the device they are using, and also allows your application to specialize at specific devices.

Interfaces Settings

You'll also need to setup the endpoint, which involves fetching your skillI ID from the console, and having your AWS lambda ARN available. 

Screen Shot 2018-09-07 at 6.41.55 AM.png

Endpoint Settings

Once you've iterated through the tabs, and confirmed your primary settings, you simply Save & Buil the interactionmodel, this will take up to 10 minutes depending on the size of the model. 

Screen Shot 2018-09-07 at 7.25.36 AM.png

Save & Build

Use the buttons in the upper left of the build tab to do a save or build, watch out for errors which pop up in the lower right hand portion of the UI.

 

Testing Model

As previously mentioned in the prerequisite section, it's convenient to have an Alexa based testing device, but not neccesary, the tab to the right to the build tab, test, provides an Alexa simulator fit for testing your ASK service. The Alexa simulator provides an interface to view the device logs, request and response data, as well as input data via text (keyboard), or speech for systems with mics. It's a full service test solution. You should smoke test the ASK service, confirming launch, all intents, and closing a session all function as expected. Your lambda should have a cloudwatch log group named after itself with logs for it's execution. Keeping an eye on the cloudwatch logs, as well as the simulator, provide a robust solution for issue resolution. 

Screen Shot 2018-09-07 at 6.47.30 AM.png

Alexa Simulator

You'll need to enable the test feature, this will require the model to be built at least once, you'll  be unable to enable testing until that's done.

 

Deploying to Skill Store

In order to deploy your model to the skill store, you'll need to complete the distribution tab, as well as the certification tab within the developer console. Again, if monetizing, have your company data at the ready.

Screen Shot 2018-09-07 at 6.53.34 AM.png

Distribution

Information about the application, summary, small and large icons will be requested here. This is primarily what will be displayed in the Skill catalogs.

Screen Shot 2018-09-07 at 6.53.51 AM.png

Certification

Amazon runs validation & functional tests on your code here. If it passes, you can submit the code for a more intense set of checks performed by Amazon. Amazon will work with you to resolve any outstanding issues, and the next step is pushing the service into the Amazon store catalogs.

Using Simulator for testing GenomeWizard