How I answer the door with AWS Lambda and Twilio

Swizec TellerDecember 8, 2019

Remember the doorbell slack bot I built in 2016? Probably not, but the girl whose sanity it saved became one of my best friends. For 2 years she could focus on work instead of answering the phone every time Yup Inc got a visitor.

Then we moved and the bot died.

2 weeks ago I moved and the new apartment has a front gate buzzer, but no package concierge. For the first time in 5 years we're gonna have to answer the door! 😱

Millennials are killing the doorbell industry by texting "here" but that don't work for deliveries. Especially if you're not home.

The buzzer looks like this:

Delivery person finds your name, taps call and the box calls a pre-programmed phone number. You pick up the phone, talk to the person, and press 9 to let them in. They drop off the package behind a locked gate and people don't steal it. 👌

Now here's the thing: My phone is where phone calls go to die. And I don't want my girlfriend to be on the hook every time I order something from Giant Dildos Dot Com.

The box accepts 1 phone number.

So I sat down for 3 hours and banged out a serverless app that answers the door, transcribes the audio, sends us a text, waits for reply, and opens the door if you say YES. 🤙

Final integration test around the 3:00:35 mark

You can read the code on GitHub

Code contains my old Twilio Auth Token because lazy. Someone racked up $300 in fraudulent calls within minutes. That was dumb.

How you can answer the door with AWS Lambda and Twilio

👆 sketch of how it works. It's harder to draw than I thought. Here's a description of the process:

Delivery person makes call
Twilio picks up the phone
Twilio sends request to AWS Lambda "What should I say?"
Lamda responds with instructions 4.1. Say "Welcome, what do you want? State your business after the beep and press any key" 4.2. Record response 4.3. Wait for 60 seconds
Twilio talks to callbox
Person says what they want and presses a key
Twilio sends recording to AWS Lambda "now what?"
Lambda responds with further instructions 8.1. Say "Thanks, someone will let you in 8.2. Pause for 60 seconds 8.3. If still in call, say "Sorry, nobody responded" 8.4. Hang up
Twilio sends all that to callbox and pauses the call
In parallel, twilio transcribes the recording
Twilio sends transcript to AWS Lambda "here transcript, now what?"
Lambda saves Call ID and callbox number in DB for later
Lambda tells Twilio to text Swiz
Twilio sends text
Swiz sees text and replies with YES
Twilio gets text response and sends to AWS Lambda "here, response. Now what?"
Lambda looks up original Call ID 17.1. If no ongoing call, bail 17.2. If more than 60 seconds since call, bail (it hung up)
Lambda checks if my text matches yes
Send voice call response to Twilio 19.1. Say "Letting you in" or "Sorry, you can't come in" 19.2. Dial 9
Twilio sends all that to callbox
Door unlocks, delivery gets delivered
Lambda sends text to Swizec saying "All good, person was let in"

Sounds complicated, right? Thanks to AWS Lambda and Serverless it's pretty easy 👉 Each step becomes a standalone JavaScript function. The sophistication comes from how they work together.

Like I mention in the Serverless Pros & Cons chapter of Serverless Handbook:

Serverless lets you trade function complexity for systems complexity. Individual pieces are easier to build & test, but the system becomes hairier.

You can see this in action during the livestream. Every few minutes we integration test the next piece of the puzzle. 🤘

Step 1: Picking up the phone

This is the first Lambda in our system. It answers the phone when Twilio converts it to an API POST request.

Twilio sends a POST request with various params, which we ignore since our response is always the same: A TwiML message constructed via Twilio's node library.

TwiML is Twilio's markup language based on XML used to respond to voice calls and handle text messages.

response.say() turns into a <Say>Hello</Say> line and becomes a spoken computer voice. response.record() allows us to record the person's reply.

In this case we're giving a 60 second timeout, asking Twilio to transcribe, and telling it to send the recording to an acceptRecording endpoint. Twilio is smart enough to handle relative URLs so we don't have to worry about that.

We use Twilio's dashboard to map a phone number to an API endpoint.

Step 2: Accept voice recording, ask to wait

After the person says what they want, they press a button. This tells Twilio to stop recording and talk to our next lambda: acceptRecording.

Same spiel as before 👉 we get a POST request and respond with some TwiML constructed with Twilio's node library. Let the person know someone's about to answer the door, wait 60 seconds, and if nothing happens deliver the bad news.

Btw, the sendTwiml function is a helper to avoid code duplication:

Status Code 200 means request succeeded, content type application/xml so Twilio API doesn't get confused, and twiml converted to a string as the body.

Step 3: Accept transcript, send text

Twilio's transcript API doesn't let you send TwiML into a phone call. That's why this is separate from the recording lambda, which just replies.

This time we do care about params Twilio sends with their request:

RecordingUrl is the audio file I can listen to
TranscriptionText is the machine transcription of the audio, usually good, sometimes hilariously wrong
CallSid is the original call ID, we'll need it to hook back into the call
Called is the phone number that was called, which helps us identify the callbox (future proofing, if I productize)

We use updateItem to save the (CallSid, Called) pair in DynamoDB. Our next lambda will use this to hook into the original call and to keep track of whether the call was handled yet. Great when multiple people reply YES to the same call.

Step 4: Handle SMS reply

Handling the YES reply to that SMS gets tricky. A bunch of situations to consider: What if there's no call? What if they're late? What if someone else said YES already?

So we build the main handler method with a big conditional and call helper methods.

Takes the sms Body and To phone number from Twilio's POST request and checks the DynamoDB database.

If there's a call and it hasn't been handled and it's not too late, we call closeDoor or openDoor based on the text Body. Otherwise we reply with a text saying there's no call, it's too late, or all is well.

The openDoor and closeDoor functions are similar. They both call continueCall to talk to the callbox and send a text to let me know the deed is done.

Oh and they update the database to say call handled. Same updateItem method as before :)

Hooking into the waiting call to open the door looks like this:

Here's where storing that callSid becomes useful. We can update an ongoing call without being part of the original API loop 💪

We send Twilio some TwiML to let the person know they're being buzzed in and dial number 9 with some waits.

That is all ✌️

And that completes the crazy flowchart from before. A sequence of small steps you can understand and test on their own. When combined they make magic.

Cheers,

~Swizec

PS: wanna learn more about using serverless? I'm making Serverless Handbook the best way to get started

Filed under: BackendTechnical

How I answer the door with AWS Lambda and Twilio