Remember the doorbell slack bot I built in 2016? Probably not, but the girl whose sanity it saved became one of my best friends. For 2 years she could focus on work instead of answering the phone every time Yup Inc got a visitor.
Then we moved and the bot died.
2 weeks ago I moved and the new apartment has a front gate buzzer, but no package concierge. For the first time in 5 years we're gonna have to answer the door! 😱
Millennials are killing the doorbell industry by texting "here" but that don't work for deliveries. Especially if you're not home.
The buzzer looks like this:
Delivery person finds your name, taps
call and the box calls a pre-programmed phone number. You pick up the phone, talk to the person, and press
9 to let them in. They drop off the package behind a locked gate and people don't steal it. 👌
Now here's the thing: My phone is where phone calls go to die. And I don't want my girlfriend to be on the hook every time I order something from Giant Dildos Dot Com.
The box accepts 1 phone number.
So I sat down for 3 hours and banged out a serverless app that answers the door, transcribes the audio, sends us a text, waits for reply, and opens the door if you say YES. 🤙
Final integration test around the 3:00:35 mark
You can read the code on GitHub
Code contains my old Twilio Auth Token because lazy. Someone racked up $300 in fraudulent calls within minutes. That was dumb.
👆 sketch of how it works. It's harder to draw than I thought. Here's a description of the process:
- Delivery person makes call
- Twilio picks up the phone
- Twilio sends request to AWS Lambda
"What should I say?"
- Lamda responds with instructions 4.1. Say "Welcome, what do you want? State your business after the beep and press any key" 4.2. Record response 4.3. Wait for 60 seconds
- Twilio talks to callbox
- Person says what they want and presses a key
- Twilio sends recording to AWS Lambda
- Lambda responds with further instructions 8.1. Say "Thanks, someone will let you in 8.2. Pause for 60 seconds 8.3. If still in call, say "Sorry, nobody responded" 8.4. Hang up
- Twilio sends all that to callbox and pauses the call
- In parallel, twilio transcribes the recording
- Twilio sends transcript to AWS Lambda
"here transcript, now what?"
- Lambda saves Call ID and callbox number in DB for later
- Lambda tells Twilio to text Swiz
- Twilio sends text
- Swiz sees text and replies with
- Twilio gets text response and sends to AWS Lambda
"here, response. Now what?"
- Lambda looks up original Call ID 17.1. If no ongoing call, bail 17.2. If more than 60 seconds since call, bail (it hung up)
- Lambda checks if my text matches
- Send voice call response to Twilio 19.1. Say "Letting you in" or "Sorry, you can't come in" 19.2. Dial 9
- Twilio sends all that to callbox
- Door unlocks, delivery gets delivered
- Lambda sends text to Swizec saying "All good, person was let in"
Serverless lets you trade function complexity for systems complexity. Individual pieces are easier to build & test, but the system becomes hairier.
You can see this in action during the livestream. Every few minutes we integration test the next piece of the puzzle. 🤘
This is the first Lambda in our system. It answers the phone when Twilio converts it to an API POST request.
Twilio sends a POST request with various params, which we ignore since our response is always the same: A TwiML message constructed via Twilio's node library.
TwiML is Twilio's markup language based on XML used to respond to voice calls and handle text messages.
response.say() turns into a
<Say>Hello</Say> line and becomes a spoken computer voice.
response.record() allows us to record the person's reply.
In this case we're giving a 60 second
timeout, asking Twilio to
transcribe, and telling it to send the recording to an
acceptRecording endpoint. Twilio is smart enough to handle relative URLs so we don't have to worry about that.
We use Twilio's dashboard to map a phone number to an API endpoint.
After the person says what they want, they press a button. This tells Twilio to stop recording and talk to our next lambda:
Same spiel as before 👉 we get a POST request and respond with some TwiML constructed with Twilio's node library. Let the person know someone's about to answer the door, wait 60 seconds, and if nothing happens deliver the bad news.
sendTwiml function is a helper to avoid code duplication:
Status Code 200 means request succeeded, content type
application/xml so Twilio API doesn't get confused, and
twiml converted to a string as the body.
Twilio's transcript API doesn't let you send TwiML into a phone call. That's why this is separate from the recording lambda, which just replies.
This time we do care about params Twilio sends with their request:
RecordingUrlis the audio file I can listen to
TranscriptionTextis the machine transcription of the audio, usually good, sometimes hilariously wrong
CallSidis the original call ID, we'll need it to hook back into the call
Calledis the phone number that was called, which helps us identify the callbox (future proofing, if I productize)
updateItem to save the
(CallSid, Called) pair in DynamoDB. Our next lambda will use this to hook into the original call and to keep track of whether the call was handled yet. Great when multiple people reply YES to the same call.
YES reply to that SMS gets tricky. A bunch of situations to consider: What if there's no call? What if they're late? What if someone else said YES already?
So we build the main
handler method with a big conditional and call helper methods.
Takes the sms
To phone number from Twilio's POST request and checks the DynamoDB database.
If there's a call and it hasn't been handled and it's not too late, we call
openDoor based on the text
Body. Otherwise we reply with a text saying there's no call, it's too late, or all is well.
closeDoor functions are similar. They both call
continueCall to talk to the callbox and send a text to let me know the deed is done.
Oh and they update the database to say
call handled. Same
updateItem method as before :)
Hooking into the waiting call to open the door looks like this:
Here's where storing that
callSid becomes useful. We can update an ongoing call without being part of the original API loop 💪
We send Twilio some TwiML to let the person know they're being buzzed in and dial number
9 with some waits.
And that completes the crazy flowchart from before. A sequence of small steps you can understand and test on their own. When combined they make magic.
PS: wanna learn more about using serverless? I'm making Serverless Handbook the best way to get started
Here's how it works 👇
Then get thoughtful letters 💌 on mindsets, tactics, and technical skills for your career.
"Man, love your simple writing! Yours is the only email I open from marketers and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"
Ready to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz components your whole team can understand with React for Data Visualization
Curious about Serverless and the modern backend? Check out Serverless Handbook, modern backend for the frontend engineer.
Ready to learn how it all fits together and build a modern webapp from scratch? Learn how to launch a webapp and make your first 💰 on the side with ServerlessReact.Dev
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️