In January of this year I thought it would be an interesting exercise to try to build an anonymous live chat webapp for my site using AWS serverless technologies.

It worked, up until the point where real people starting using it. Then everything broke.

Let me back up a bit, and explain what exactly this thing was.

If you want to jump straight to the code, here's the Conclusion.

Serverless Crash Course

In the early 2000s, in order to host a web app you'd typically need to put it on a server running special software (like Apache, PHP, and MySQL). Some clever people quickly realized that there was money to be made in hosting servers for other people, so the concept of the VPS (Virtual Private Server) was born.

With a VPS, you'd pay a monthly fee to a VPS provider (like DigitalOcean or AWS EC2 today) and in turn they'd give you access to one of their virtual servers. It was virtual because it wasn't a physical machine in a data center - instead it ran on a real, powerful computer with dozens or hundreds of other virtual servers. As long as you paid the monthly rent, you could have as many VPS's as you want with whatever specs you wanted. Plus, you'd never have to deal with a tangled wires, electric companies, or ISPs to make sure your servers stayed up.

At some point down the line someone realized that, hey, like 90% of these VPS's are just web servers, so why don't we just offer a web server service instead? This way, instead of individually managing a fleet of virtual servers, you could just focus on your website/webapp. You could use something like cPanel to manage the files, databases, and domains connected to the server. You could even install pre-built server software packages like WordPress with a few clicks. You'd still have to pay a monthly fee, but it would save you from the headache of dealing with low level sysadmin details like configuring database software or controlling firewalls. If you've ever paid a website to run a Minecraft server for you, this was basically the same deal.

Soon enough, there were Infrastracture-as-a-Service and Platform-as-a-Service startups popping up everywhere. Need a database but don't know how to manage it? Need to send email or text messages? User authentication? Storage? Analytics? All of these services offered to replace the software you'd need to build yourself with something that already existed and that was managed by professionals.

None other than Amazon exemplified the idea of "let someone else deal with it" with its Web Services division. AWS offered services for practically everything - at the time of this blog post there are 130+ services. It's the Build-a-Bear of web software. This is what the "cloud" is. As long as you've got the money, someone else will be willing to deal with it for dirt cheap thanks to economies of scale.

It should be noted that Microsoft and Google are also heavy competitors in the cloud space at this point in time, but AWS was the one to kick it off in 2006 with S3. I'll keep saying AWS because that's what I'm most familiar with, but keep in mind that its competitors have very similar services to those I'll mention by name.

In 2014, Amazon released AWS Lambda which was meant to be the "glue" in the AWS Build-a-Bear. See, a database or API by itself isn't useful unless it's used by something else. Lambda was that missing piece that would go on to link all of these services together.

With Lambda, you could make Lambda functions which would get called in response to events that originated on other AWS services. For example, when you talk to a custom skill on Amazon Alexa, Amazon Lex (the AWS service that powers Alexa) calls that custom skill's Lambda function with an event object containing information related to that event (like a transcription of what you said, what time you said it, and which Alexa you said it from). From there your Lambda function could use the regular AWS APIs and any other libraries/frameworks you uploaded to process the event.

Most AWS services charge against usage, and Lambda was no exception. As long as you weren't dealing with millions of requests a month, you could operate a Lambda function for pennies! Contrast this to the \$5-\$45 / month you'd need to pay for a single VPS.

Again, at some point someone realized it was possible to run a full-blown web app on AWS Lambda taking advantage of other AWS services, without ever having to pay for a server. Hence, serverless.

Spoiler alert: there's still a server involved. You just don't have to pay for it as much anymore. That monthly flat fee would turn into a scalable usage fee, so if your service didn't get used much it wouldn't cost more than a few dollars a month to operate.

Live Chat Crash Course

The live chat used the following AWS services:

Setting up API Gateway and DynamoDB were easy enough through AWS's console. But that was to be expected; the Lambda was where the magic would happen.

And so, slowly through bruteforce and sheer will, I built out the Lambda function over a few days. There were quite a lot of pitfalls that I'd like to talk about, but before I do that I want to describe the live chat "protocol" I ended up building so you can get a sense of what the function actually did.

Every Lambda instance needs a handler, which is the function AWS Lambda executes whenever a request comes in. Here's the abbreviated handler for my chat:

let handler = async event => {
  if (!ws) {
    ws = new AWS.ApiGatewayManagementApi(...);
  }
  let body = null;
  try {
    body = JSON.parse(event.body);
    if (!body.action || !body.uuid) throw new Error("missing required keys");
    let args = {
      event,
      body,
      uuid: body.uuid,
      connection: event.requestContext.connectionId,
      ip: event.requestContext.identity.sourceIp,
      requestID: event.requestContext.requestId
    };
    if (body.uuid == "andi") {
      if (!(body.auth == PASSWORD)) throw new Error("incorrect auth");
      return module.exports.adminHandler(args);
    } else {
      return module.exports.userHandler(args);
    }
  } catch (err) {
    return JSONError();
  }
};

So, in this protocol, every request has to be in the form of valid JSON and contain a uuid and action key. Depending on what the uuid is and if the auth key matches the authentication password, the request is either sent to the userHandler or the adminHandler (which are both ultimately just other functions).

When you use the live chat, your request is sent to the userHandler for processing. When I use it, mine goes through adminHandler. There are very heavy similarities between the two so I'll only cover the user part.

Here's an abbreviated version of userHandler:

let userHandler = async ({ event, body, uuid, connection, ip, requestID }) => {
  switch (body.action) {
    case "hello":
      // ...
      return JSONReply("lastConnected", lastConnected);
    case "register":
      // ...
      return JSONReply("welcome");
    case "ping":
      // ...
      return JSONReply("pong");
    case "list":
      return JSONReply("history", ...);
    case "send":
      // ...
      return sent ? JSONReply("sent") : JSONReply("sendError");
  }
};

adminHandler is not much different, it just has different cases to support different actions. Speaking of which, let's talk about what they do.

DynamoDB Item Structure

Here's a quote:

Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious. - Eric Raymond

Which itself is an updated version of:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious. - Fred Brooks

I have no idea if either of these are true. I just needed something to introduce this section.

For your reference, here's what a fake account looks like in my DynamoDB table:

{
  "uuid": "b445629b-4061-419a-b3a1-caf486ab71a1",
  "timestamp": 0,
  "nickname": "baba",
  "email": "booey@email.com",
  "ip": "127.0.0.1",
  "connection": "testConnection",
  "lastConnected": 0,
  "lastRequestsServed": {
    "wrapperName": "Set",
    "values": [
      "null"
    ],
    "type": "String"
  },
  "updates": 1
}

Note that for this table, the HASH key is uuid and the RANGE is timestamp. Having a timestamp of 0 means that this is an account type of item, whereas having a timestamp above 0 makes it a message type (in that case, the timestamp would correspond to what time that message was sent).

I expect that last paragraph to only make sense to you if you've ever used DynamoDB. If not, don't worry. It always sounds like gibberish until it doesn't.

And here's what a message looks like:

{
  "msg": "testing message",
  "type": "to",
  "uuid": "b445629b-4061-419a-b3a1-caf486ab71a1",
  "timestamp": 1559200025231
}

Finally, here's the special admin account:

{
  "uuid": "andi",
  "timestamp": 0
  "ip": "127.0.0.2",
  "lastConnected": 1559200023104,
  "unread": {
    "wrapperName": "Set",
    "values": [
      "e1b3b0b4-1cd2-4a77-a74a-6df7e0bd820c",
      "null"
    ],
    "type": "String"
  },
  "connection": "testConnection2",
  "conversations": {
    "e1b3b0b4-1cd2-4a77-a74a-6df7e0bd820c": "baba-1",
    "919e2811-e502-466c-ac47-966a62a4deeb": "baba-2",
    "c781c0c3-ac19-4550-96fe-c8a00268a006": "baba-3"
  }
}

Two unique keys here, conversations and unread. unread is a String Set that contains the UUIDs of any accounts I have unread messages from. Whenever a user sends me a message, their UUID gets added to that set. conversations offers a way to "match the name to the face", where the face is the UUID. This way I know the nickname of every account in the unread list without having to query each account individually. This saves time and prevents unnecessary queries to DynamoDB (\$\$\$).

I don't expect most of these items or keys to make sense to you right now, but hopefully they inform you of what the different actions are doing to the table.

userHandler

  case "send":
    if (!body.msg) return JSONError();
    let unique = await utils.isUniqueRequest(uuid, requestID);
    if (!unique) return JSONReply("duplicate");
    await utils.addMessageToConversation({
        from: uuid,
        to: "andi",
        msg: body.msg
    });
    await utils.markUUIDUnread(body.uuid, true);
    let sent = await utils.sendResponse({
        from: uuid,
        to: "andi",
        msg: body.msg,
        ws
    });
    return sent ? JSONReply("sent") : JSONReply("sendError");

And here are the steps it takes, in English:

  1. Throw out the request if there's no msg to send.
  2. Check if the request is unique to solve the Idempotency Pitfall.
  3. Insert the message as an item in the DynamoDB table.
  4. Mark the UUID unread - add it to the admin account's unread list so that the next time I log in I'll see that I have an unread message from this UUID.
  5. Try to send the message through the connection of the recipient with utils.sendResponse. In this case, this method looks up my account, picks out my connection ID, and attempts to send the new message through my WebSocket connection using the API Gateway Management API.

If all works out right, the new message gets added to the database as an item, updates my admin account so I know about it, and sends a reply to my live connection.

Boy, did it take a while to get it to all work out right.

Take Two

The first version of the live chat (I'll call it the legacy version from now on) was built using test-driven development. That is, I wrote code in the AWS inline code editor, saved it, and tried sending random events to it to see if it would work. I hit the Test Events Suck With JSON Pitfall here which was incredibly annoying to figure out, but at least I only had to fix it once.

As you can imagine, this move-fast-and-break-things attitude is probably why things broke.

To be fair, I wouldn't have been able to succeed quite as well as I did the second time around if I didn't have the legacy attempt to build on. I learned a ton from the first time, and that experience allowed me to deliver a more professional solution that avoids all the pitfalls I discovered since then.

I decided to do things properly the second time around. I built unit tests & integration tests, for the first time on a serious project that would be put in production.

It took so much longer! But as a reward, I'm much more confident in the codebase now. And for any new feature I need to support, I only need to add a few new tests to verify everything. In fact, as I solved the last few pitfalls I ran into, the test suite proved itself useful time and time again as every once in a while a minor change I would make would break everything. Being able to catch this before deploying the code to AWS was such a godsend.

Unit Testing

I had been trying to get into unit testing for a long time, but it only started clicking when I started writing tests for this existing codebase. I had previously read a bunch of tutorials and checked out a few demos, and they all sucked for one simple reason:

function add(a, b) {
  return a + b;
}

test("it adds five together correctly", () => {
  let sum = add(5, 5);
  expect(sum).toEqual(10);
});

...is NOT how real, production, web app code looks like. It's the same problem with "Hello World" examples for new programming languages. Like, what is print("Hello world!") supposed to tell me? How do I talk to a database with this? How do I send a HTTP request? How do I save something to disk or read a configuration file?

Actually, more importantly, how do I NOT do all these things? Because when I'm unit testing my code, I don't actually want it to hit a database or website. That's the whole point of unit testing, I gather. You just fake everything and check to make sure the code is calling all the fake stuff correctly, so that in the real world it'll do the same thing with all the real stuff. But now you have to figure out how to turn your "real" code into "fake" code without ever actually changing the code.

But I continued in my testing journey, armed with Jest and dozens of Medium tutorials. Soon enough, my questions were answered with the almighty jest.fn().

I'm not going to go too in depth with Jest unit testing, but the idea is something like this:

function registerAccount(username, password) {
    if (db.accountExists(username)) return false;
    db.query(...);
    return true;
};

test("it registers a new account correctly", () => {
    db.accountExists = jest.fn(() => false);
    expect(registerAccount("test", "password")).toBeTruthy();
});

test("it fails to register an existing account", () => {
    db.accountExists = jest.fn(() => true);
    expect(registerAccount("test", "password")).toBeFalsy();
});

This way you could fake external functions so they would return the values you want to test your code against. This is also great for handling errors your code could run up against that wouldn't usually occur in normal practice. You can just fake the error to make sure your code deals with it correctly.

Of course, it's important that those external functions do their job correctly too. For me, those external functions were the utils suite which comprised of the following functions:

JSONReply and JSONError were just helper methods for constructing responses, so I didn't bother testing them (I'm sure this will come back to bite me eventually).

The next two were critical to solving the Fake Database Pitfall.

The remaining functions all manipulated the database in some way, so I built integration tests for them.

Integration Tests

The difference between a unit test and an integration test is that a unit test tests a unit of code (like a function), while an integration test tests a system (compromised of units).

To build off the previous example, a unit test would check if the registerAccount() method returns the right value, while an integration test would run the method and then check the database to see if the new entry was created correctly. This way you're not just confident that "the method works", but you're also confident that "this part of the service works." Which is a fantastic thing to be confident about when delivering a service.

In my case, I was interested in testing all the utils functions to make sure they would create, read and update the correct items in DynamoDB when they were called. I would do this by having them send their queries to a local DynamoDB server, and then issuing read queries to the server from my tests to check to see if the resulting changes were the ones I intended.

After the initial setup, writing the tests wasn't all that bad. Here's one for markUUIDUnread:

it("adds the uuid to the unread list when unread is true", async () => {
  const testUUID = "bababooey-markUUIDUnread";
  let oldAndiItem = await utils.andiItem();
  await utils.markUUIDUnread(testUUID, true);
  let newAndiItem = await utils.andiItem();
  expect(newAndiItem).not.toEqual(oldAndiItem);
  expect(newAndiItem.unread.values).toEqual(expect.arrayContaining([testUUID]));
});

Reads more or less like English. Simple, concise, and reliable. Just like it should be.

Pitfalls

Okay, I've been mentioning these the whole article. Let's open this can of worms. Here's every pitfall I ran into building the live chat, in the order I ran into it.

Normal Databases Suck With Lambda Pitfall

This one I knew about already from previous experience, but figured it would be worth mentioning to further explain why I chose DynamoDB.

Okay, remember VPS's from the serverless crash course? Turns out Lambdas aren't that different from them. In fact, whenever you call a Lambda for the first time, Amazon essentially spawns a VPS (they call it a container) for that Lambda to run on. Once it finishes running, Amazon keeps that container around for the next few minutes in case another request comes in.

When a container exists for the Lambda, it's said to be "hot" because it can immediately resolve incoming requests. When there are no containers left (because there were no incoming requests for a long time), it's "cold."

You might be thinking that it takes a lot of time for Amazon to spin up a container when a function is cold. And indeed, it does take a lot more time to go from cold to hot, but that "startup cost" is only incurred when the function is cold. Depending on how hefty the function is the startup cost might range between a few milliseconds to a few seconds. To get around this and keep a function perpetually hot, some people set up timers (like with AWS CloudWatch) that send a sample request to the Lambda every 5 minutes to ensure it's hot and ready to serve any real requests. This way, even if nobody has used the Lambda for months, it'll still be ready the millisecond someone tries. Of course, you still have to pay the usage cost for all those fake requests, so it may always be a good idea.

So, to get to the point, say you don't want to use DynamoDB. Maybe you want to use AWS RDS (their other database service), which lets you use RDBMS systems you're used to like MySQL or PostgreSQL. In order to do this, you have to put your Lambda function in a VPC (Virtual Public Cloud) with your RDS server. There's no getting around this; you need your Lambda to be in a VPC to access RDS. And that wouldn't be a problem if it didn't mean three (three!) very bad things...

  1. In order to join the VPC, your Lambda container needs to be assigned an IP address through an ENI (Elastic Network Interface). So now, not only does it need to boot up, it also needs to resolve an IP address and connect to a private network before it can go hot and serve your request. Considering that Lambda bills by every hundred milliseconds, this not only costs far more than regular functions but also takes up an incredible amount of time. For a real world example I've sometimes had Lambdas in VPCs take up to 15 seconds to respond to my requests (to play fair, sometimes it finished in 2-3 seconds too. Lambda is unpredictable). I'm sure this can get better, but I mean...how much better could it get? The fact is there's a lot of hefty setup that needs to take place to join a VPC.

  2. Once you've joined the VPC, you lose access to the Internet. Jeff really screwed you on this one. If you're lucky, you only need access to your database and have no 3rd party APIs you need to talk to. But, if you really need to, you can restore Internet access to your VPC by buying an AWS NAT Gateway for....\$0.045 / hour. Yes, per hour. Napkin math: $0.045 * 24 hours * 30 days = $32.4 / month. You're actually better off just paying for a server if you're not getting a ton of traffic already.

  3. Connection pooling isn't Lambda compatible. This one actually isn't that bad unless you intend for your service to get popular. The general premise of connection pooling is to set up only one connection to the database, and then reuse that connection for any queries that need executing. This way, instead of having to connect to the database everytime you want to run a query, you can use the pooled connection instead. Well, because of how Lambda scales horizontally, you instead get a thousand Lambda containers with a thousand individual connections to your singular RDS server. This would not be a problem if each database didn't also have a limit to the maximum amount of connections. I think the limit for Postgres by default is 100, and although you can change it, there's only so many connections one server can support before performance suffers. You can get around this, ironically, with another server - you can use something like pgBouncer to do the connection pooling for you, and have all your Lambdas hit your pgBouncer server instead. But at this point you've introduced a server into your serverless environment and everything stopped making sense a few hours ago anyways.

So yeah, normal databases suck with Lambda. That's why I prefer to stick with Dynamo for ultra low cost projects. For something that I know will require SQL in the future? Probably best to stick with a server, and then break that down into smaller Lambdas once you have an existing codebase to inspect.

I should mention that there's a new RDS addition called Aurora Serverless which was billed as the solution to this entire pitfall. It's Postgres and MySQL compatible, and it only costs...\$0.041 an hour for the cheapest MySQL option. Now, \$29.52 / month isn't an insane amount of money to pay for something like this, but again, think about whether just using a server would be a better idea at this point. Serverless requires a lot of special consideration and practical experience before it really clicks, and if you think it's not any better than paying for a server, go for the server instead. Don't feel bad! Amazon's EC2 and Elastic Beanstalk teams will happily take your money if serverless isn't your cup of tea.

Test Events Suck With JSON Pitfall

Few things are more annoying than having some weird error that keeps popping up regardless of what you do the code. In this case, I got to the point where the majority of my code was gone and I still kept getting a dumb JSON error!

One of the Lambda console's features is something called "Test Events" where you can build your own JSON event and have Lambda invoke your function with that as its event. For example, here's a test event to make sure the hello action works right:

{
  "requestContext": {
    "domainName": "58yojgvxyi.execute-api.us-east-1.amazonaws.com",
    "stage": "dev",
    "connectionId": "testConnection",
    "identity": {
      "sourceIp": "test.ip"
    }
  },
  "body": "{ "action": "hello", "uuid": "bababooey"}"
}

Maybe you've spotted the error already, but if not, here's the problem:

  "body": "{ "action": "hello", "uuid": "bababooey"}"

Turns out you need backslashes:

  "body": "{ \"action\": \"hello\", \"uuid\": \"bababooey\"}"

Or else Lambda will think you meant to set body to "{ \".

To be fair, this isn't a Lambda centric problem, but it was something I struggled to fix for hours. So I'm putting it in here in the hopes that it saves someone else.

Jest Has Weird Setup Files Pitfall

When you configure jest in your package.json, there's a neat key called setupFiles you can set to a group of files that get run once before any of your tests run. I set it to setup.js. Inside setup.js, you get access to the global object which all of your tests subsequently have access to. So I defined a global call method:

let call = async json =>
  await handler({
    requestContext: {
      domainName: "58yojgvxyi.execute-api.us-east-1.amazonaws.com",
      stage: "dev",
      connectionId: "testConnection",
      identity: {
        sourceIp: "127.0.0.1"
      },
      requestId: "testRequestID"
    },
    body: JSON.stringify(json)
  });

global.lambda = {
  call,
  ....
};

Which could then be called in any test file:

const { call } = lambda;

let result = await call({
  uuid: "bababooey",
  action: "hello",
});

This allowed me to fake whatever Lambda event I wanted without having to duplicate code across multiple tests.

Once I did this, I also wanted to manually set process.env.PASSWORD to a fake password at the beginning of the jest test session, so that I could provide it to call to verify that the base handler forwarded my actions to the admin handler when I gave the right password. Seemed easy, right?

Well, it just did not work at all. I tried a few variations and eventually went to search for it on Google, and turns out that only the globalSetup key can be used to set process.env across test files for some reason. So, I aded the globalSetup key pointing towards envSetup.js:

/*
  - setup.js exists because global.* variables can't be declared in this file but they can be in there
  - on the other hand, process.env can't be changed in there but it can in here
  - jest is confusing sometimes
*/
module.exports = () => {
  process.env = Object.assign(process.env, {
    PASSWORD: "password",
  });
};

Maybe there's a better way to do this, but this was the best I could do from my limited experience with Jest.

Fake Database Pitfall

In order for me to build integration tests, I needed an actual database to query against. I could have used a separate DynamoDB table as the test table in this situation, but I would rather run DynamoDB locally to avoid the risk of having any crazy situation where I end up getting charged \$900 because a function being tested inserted a hundred thousand items or something.

First I needed a local DynamoDB server. For that, I installed LocalStack.

Next, I needed a way to make the code use my DynamoDB server instead of AWS's. I was able to do this with the utils.DynamoDocumentClient object:

/* utils.js */
let dynamo = async (action, params) =>
  await module.exports.DynamoDocumentClient[action]({
    TableName: "DuroLiveChat",
    ...params
  }).promise();

module.exports = {
    ...,
    dynamo,
    DynamoDocumentClient
}

/* utils.test.js */

const DynamoParams = {
  apiVersion: "2012-10-08",
  region: "us-east-1",
  endpoint: "http://localhost:4569"
};

const DDB = new AWS.DynamoDB(DynamoParams);
const DDC = new AWS.DynamoDB.DocumentClient(DynamoParams);

beforeAll(async () => {
  utils.DynamoDocumentClient = DDC;
  // ...
});

The important part here is that utils.dynamo calls module.exports.DynamoDocumentClient instead of DynamoDocumentClient. That makes it susceptible to replacement in the integration testing file. If utils.dynamo called DynamoDocumentClient instead, Node would have used the one declared at the beginning of utils (pointing to the actual AWS DynamoDB) rather than the one with the custom endpoint.

Also, shoutout to dynamodb-admin for providing a perfectly simple yet cable web interface to interact with the DynamoDB tables running on LocalStack. The only change I had to make was to define an environment variable before running it so it used the right port to talk to Dynamo:

andi$ DYNAMO_ENDPOINT=http://localhost:4569 dynamodb-admin

Just Runs Tests Concurrently Pitfall

Also kind of knew about this one from past experience, but I think it's important to mention because it really should influence the way you design tests, especially integration tests.

DO NOT assume your test runs in order! That means if you insert an item into the database in test1 and try to retrieve it in test2, you're doing it wrong. Yes, even if the tests are in the same file.

Try to keep every integration test self-contained. If possible, it should create, read, update, and delete every item within itself.

The problem with concurrency is that you never know which test will get run first, so sometimes your test suite passes and sometimes it might fail even without changing a line of code. I got bit by this and had to rewrite a bunch of tests as a result.

Lambda Uses Old AWS SDK Versions Pitfall

Oh, this one sucks. What's the latest AWS SDK version?

2.437.0

So when I run require("AWS-SDK") inside my Lambda running Node.js 8.10, I should get version 2.437.0, right?

From AWS Lambda Runtimes:

Node.js Runtimes

Name Identifier Node.js Version AWS SDK For JavaScript Operating System
Node.js 10 nodejs10.x 10.15 2.437.0 Amazon Linux 2
Node.js 8.10 nodejs8.10 10.15 2.290.0 Amazon Linux

Sigh. This wouldn't be a problem if AWS API Gateway WebSockets wasn't a new feature when I first started building this, but it was, and so the ApiGatewayManagementApi module which is used to send messages to WebSocket connections didn't exist in the 2.290.0 version of the AWS SDK. In the legacy live chat, I included the files needed to make this work manually in a custom node_modules/aws-sdk folder. Knowing that the AWS SDK had updated since January, I figured the management API would come with it at this point. And well, it did, but not if you ran Node 8.10.

This turned out to be the easiest pitfall to solve. I just switched my runtime to Node.js 10.x. However, I was really lucky that my code was able to handle this major version upgrade without needing to change a single line.

Lambda Hates .Zip Files Pitfall

Lambda has three ways to upload code to it:

I opted to use the zip file this time around because I could make one easily. It took a few tries to finally get it in a format Lambda accepted.

Here's what ended up working for me:

zip duro.zip base.js utils.js;
aws lambda update-function-code --function-name duro --zip-file fileb://duro.zip;

Try unzipping your zip file before uploading it to Lambda if it doesn't work the first time. Make sure the source code files are inside the zip file and not inside a folder inside the zip file. Lambda won't be able to find your handler through a folder.

That means you want:

duro.zip
 |- base.js
 |- utils.js

And not:

duro.zip
 |- duro
   |- base.js
   |- utils.js

At this point I was really excited because I had finally solved all the pitfalls in front of me, deployed a working version of the live chat to AWS Lambda, and had over 40 individual tests covering the entire code base.

But then I had to learn a new word.

Idempotency Pitfall

Q: What happens if my function fails while processing an event?

For Amazon S3 bucket notifications and custom events, AWS Lambda will attempt execution of your function three times in the event of an error condition in your code or if you exceed a service or resource limit. - AWS Lambda FAQ

Here's the tricky part about this answer. You know that it's possible for your function to be executed up to 3 times, but only if it fails while processing an event.

I was operating off the premise that if I had enough tests to cover every reasonable event I would throw at the function, it would never fail while processing an event. And yes, I know that no amount of test suites can truly cover every scenario. But, still. It turned out, my code wasn't failing. Amazon's was.

I caught this particular issue because every once in a while while testing the live chat frontends (user and admin), I would notice a message would get sent two or three times. Sometimes I'd refresh the page and the same message would've been saved to the database twice. This was really odd, and also random. I couldn't reliably trigger it. So instead I added logging to my function and spammed messages until I could get the error to happen again. And although it was random, it did happen often enough to be a serious problem. It could definitely happen to a user while they were using the application.

The problem turned out to be the ApiGatewayManagementApi. I would get the console logs from before its postToConnection method was called, but none after. The method would hang, and the Lambda function would get axed once it reached its timeout of 3 seconds. And once that happened, it would get called...again. And sometimes another time too. The problem is that before calling postToConnection, the function also added the message to the database. This would explain why there were both duplicate sends as well as duplicate inserts.

Now, to be fair, this may not be an actual issue with the ApiGatewayManagementApi. It might be designed to fail for whatever reason, and then end up working when Lambda calls it again post-retry. But the problem is my code wasn't idempotent, so I would end up with duplicates.

Idempotency in this context means that you should only change things the first time you process an event. If you are given the same exact event again, you shouldn't change anything.

Most of the utils methods were idempotent already, because they only read and returned items. They didn't add or update existing items. But for the ones that did, I needed a way to tell if the request was a duplicate so I could avoid executing them a second time.

This is where utiis.isUniqueRequest() comes in. Every Lambda event also comes with a requestId which is a unique identifier that identifies that particular event. If we save the requestId into the database the first time the event is processed, we can check the database the second time the event is processed to see if we already processed it, and ignore it if that's the case.

// Lambda sometimes retries the same event if it didn't work the first time, so this is here to prevent duplicate messages from being sent
let isUniqueRequest = async (uuid, requestID) => {
  let Key = {
    uuid,
    timestamp: 0,
  };
  let convo = await dynamo("get", {
    Key,
    ProjectionExpression: "lastRequestsServed, updates",
  });
  if (!convo.Item) return true; // unknown uuid
  let lastRequestsServed = convo.Item.lastRequestsServed.values;
  let updates = convo.Item.updates;
  if (lastRequestsServed.includes(requestID)) return false;
  lastRequestsServed.push(requestID); // add new request
  if (lastRequestsServed.length >= 6) lastRequestsServed.shift(); // drop the oldest saved id if there are too many
  let updateParams = {
    Key,
    UpdateExpression:
      "SET lastRequestsServed = :newSet, updates = updates + :one",
    ConditionExpression: "updates = :updates",
    ExpressionAttributeValues: {
      ":newSet": module.exports.DynamoDocumentClient.createSet(
        lastRequestsServed
      ),
      ":updates": updates,
      ":one": 1,
    },
  };
  try {
    await dynamo("update", updateParams);
  } catch (err) {
    if (err.code != "ConditionalCheckFailedException") throw err;
    return false; // the conversation was updated by the time we tried to update it, so just invalidate everything
  }
  return true;
};

As you can see we save the last 5 requestIds processed to a set in the account item, and return true/false depending on if our requestId is in that set.

Technically we can have an event get retried after 5 others have already been processed, but the likelihood of that happening is so low it's not worth accounting for. Especially because Lambda immediately retries the function once it fails the first time; there's very little time delay between initial execution and retry.

What's updates for? Well, turns out, you can have 2+ Lambda functions access the same Dynamo item at the same time. If the lastRequestsServed set doesn't include their requestId, and they both have the same requestId, both functions will conclude that the request is unique, even though it's actually not. To prevent against this, we increment the updates key and use a ConditionExpression to tell DynamoDB to reject this update if the current value of updates doesn't match the one we have. This way, only one Lambda will get to claim the request ID is unique (as it should). All others running concurrently will report that it is a duplicate. This technique is not unique - it's actually a duplicate of a common DynamoDB tactic called Optimistic Locking.

I'd also like to thank the following two articles for holding my hand through figuring this out:

Too Many Arguments Pitfall

This one is not Lambda related at all, just a JavaScript optimization.

I used to have methods that looked like this:

let createConversation = async (uuid, nickname, email, connection, ip) => { ... }

let sendReponse = async (from, to, msg, ws) => {...}

The problem is, when you actually call them, it's really obtuse to figure out which index is for which argument:

await utils.createConversation(
  "e1b3b0b4-1cd2-4a77-a74a-6df7e0bd820c",
  "bababooey",
  "baba@email.com",
  event.requestContext.connectionId,
  event.requestContext.ipAddress
);

Instead, accept an object instead. It's much simpler to parse and also easier to add arguments in the future when you want to extend what the function does.

let createConversation = async ({ uuid, nickname, email, connection, ip }) => { ... }

This is some nice ES6+ syntax sugar that deconstructs the object for you. The following is another way to do the same thing:

let createConversation = async (params) => {
  let uuid = params.uuid;
  let nickname = params.nickname;
  // ...
};

And you can call it like so:

await utils.createConversation({
  uuid: "e1b3b0b4-1cd2-4a77-a74a-6df7e0bd820c",
  nickname: "bababooey",
  email: "baba@email.com",
  // ...
});

Much easier to follow. Named parameters are always welcome.

Anyways, I wish I had followed this advice before I needed to rewrote most of my utils functions, which ended up breaking all my tests, which I also had to rewrite somewhat.

Connections Die When They Are Killed Pitfall

Last pitfall before the end! I'll keep it short.

Here's how ApiGatewayManagementApi.postToConnection() works:

await ws
  .postToConnection({
    ConnectionId: connectionString,
    Data: JSON.stringify(obj),
  })
  .promise();

The ConnectionId is the most important part here. It refers to which WebSocket to send this data to, and if that WebSocket doesn't exist or isn't connected anymore, postToConnection() throws an error.

The legacy version of this sending function handled two scenarios:

  1. The user was sending a message to the admin, so it would get the ConnectionId from the admin account's connection key

  2. The admin was sending a message to the user, in which case they provided a connectionTo key which would be passed to postToConnection()

As you can imagine, this brittle system didn't really work when the user switched WiFi networks or refreshed the page.

The new system instead always takes the ConnectionId from the user's account, and makes sure it is updated frequently with the ping action I talked about before. This way, as long as the live chat is open and running code locally, the system will continue being aware of what its connection ID is. Of course, the above code is wrapped in a try/catch in the event that sending still fails, but at least then we'll know it's because the user isn't connected anymore and not because we passed the wrong connectionId.

Conclusion

Congratulations on making it this far! I hope this article helped you figure out how you can use Serverless on a real, not-built-for-a-tutorial real-time service.

Here's the source code for everything: https://github.com/nexuist/duro-chat

Inside, you'll find a __tests__ folder and a legacy folder containing all my unit/integration tests and the legacy version of the live chat, respectively.

If you want to try out Jest on your own machine, make sure you install LocalStack and have it running beforehand or all the integration tests will fail.

If you find any better workarounds for the pitfalls, please don't hesitate to let me know using the live chat tab in the sidebar ;)


Tagged #web, #technical.


Want more where that came from? Sign up to the newsletter!



Powered by Buttondown.