AI telephone agent using Asterisk and Cloudflare Workers AI

The idea behind AI-driven telephone agents is rather simple:

Accept an incoming call
Wait for some instructions
Process instructions
Return result to caller

I decided to build a small telephone agent with as little code as possible. Let's go through each of the steps.

Accept an incoming call using Asterisk

Instead of signing up for a new landline, I realized that my existing Internet provider (Deutsche Telekom) "gifted" me with 3 phone numbers - that I have never used. However, since they allow me to connect to the telephone network via Voice-over-IP (VoIP), this is actually a perfect fit for the project.

Asterisk - an open-source implementation of a Private Branch Exchange (PBX) - allows us to connect to a VoIP provider (using standard SIP protocols) and receive/initiate phone calls.

Configure Transport

I am using the pjsip module of Asterisk for communications with Deutsche Telekom. And we first start by configuring how packets should be transported.

[transport-udp]
type = transport
protocol = udp
bind = 10.0.0.1
local_net = 10.0.0.0/24

This defines

transport should happen over UDP
the Asterisk server is running on 10.0.0.1
local network is 10.0.0.0/24

Configure Registration

Registration is different across telco providers, but I guess some sort of authentication is required by any. Because obviously you have to "own" a specific telephone number to receive calls for it and make calls using it. DTAG uses a username & password combo.

[telekom]
type = auth
auth_type = userpass
username = <my phone number>
password = <my DTAG password>@tel.t-online.de
realm = tel.t-online.de

[telekom]
type = registration
transport = transport-udp
server_uri = sip:tel.t-online.de
client_uri = sip:+49<my phone number>@tel.t-online.de
outbound_auth = telekom
contact_user = <my phone number>
fatal_retry_interval = 20
forbidden_retry_interval = 20
retry_interval = 20
max_retries = 9999
expiration = 60

This

tells pjsip we want to authenticate using a specific username & password
references our transport section
configures the server's URL (depends on telco provider)
configures our client URL (depends on telco provider and your phone number)
some timeouts and # of retries in case something went wrong

Configure Endpoint

Last but not least we configure an endpoint. This is kind of a sink that receives phone calls from external and internal callers.

[telekom]
type=endpoint
context=incoming
disallow=all
allow=alaw,ulaw
aors=telekom
outbound_auth=telekom
from_user=<my phone number>
from_domain=tel.t-online.de

[telekom]
type = aor
contact = sip:tel.t-online.de

[telekom]
type=identify
endpoint=telekom
match=217.0.0.0/13

This says

whenever there's a SIP invite from 217.0.0.0/13 (address range of DTAG)
assign it to the Telekom endpoint
and only some specific codecs are allowed
and whenever there's someone calling, process the Asterisk dialplan using the incoming section

Configure Dialplan

Now that we are registered to the provider and we are receiving a call on our phone number, the thing that decides what is happening with that call is the dialplan.

[incoming]
exten => <my phone number>,1,Verbose(1, "Incoming call with CallerID: ${CALLERID(all)}")
same  => n,Answer()
exten => _.,1,Set(UUID=${SHELL(printf $(uuidgen))})
same  => n,AudioSocket(${UUID},127.0.0.1:3454)
same  => n,Hangup()

This is the incoming section we referenced in the endpoint. It is

logging some information about the caller's phone number
answers the phone
generates a uuid using the uuidgen program
opens a TCP connection to an "Audiosocket" server

The last point here is the most important bit. An Audiosocket server is a program that will receive all audio samples on a TCP socket. The TCP payload it receives is well-defined as 16-bit, 8kHz, single-channel PCM.

Wait for some instructions via a small Rust program

So, someone can call us, we accept the call and forward all audio samples to a TCP server. There are different strategies on how to extract instructions from a caller and make them processable. I decided to go with a minimalistic Voice Activity Detection (VAD) approach for this project.

We know that the Audiosocket implementation of Asterisk gives us PCM data. More specifically it gives us 8000 samples per second (8kHz). With each sample being 2 bytes (signed payload). The implementation I am using here is looking at the absolute (again, 16-bit signed payload) energy of each sample & considers every sample as speech, if that energy is above a certain threshold.

A state machine

The bigger challenge here is to extract whole instructions or sentences from a stream of samples, knowing when the caller starts and stops giving an instruction. A sneeze shouldn't be detected as speech & humans are taking a breath when giving an instruction.

I came up with a state machine that looks like this:

In case a sample's energy was above a certain threshold, the human might actually be speaking

#![allow(unused)]
fn main() {
fn active(&mut self, consider: f64, msec: f64) {
    *self = match self {
        VadState::Silence => VadState::Consider(msec),
        VadState::Consider(s) => {
            if *s >= consider {
                VadState::Speech(*s + msec)
            } else {
                VadState::Consider(*s + msec)
            }
        }
        VadState::Speech(s) => VadState::Speech(*s + msec),
        VadState::Wait(t, s) => VadState::Speech(*t + *s + msec),
    }
}
}

they were silent, consider the input as speech for up to consider milliseconds
we are currently considering the stream as speech and doing so for more than consider milliseconds, actually mark it as speech
it was speech, we are updating the number of milliseconds the caller is talking for

There is one more transition that only makes sense when we look at the opposite case. So let's say a sample's energy is below our fixed energy threshold:

#![allow(unused)]
fn main() {
fn silence(&mut self, msec: f64) {
    *self = match self {
        VadState::Silence => VadState::Silence,
        VadState::Consider(_) => VadState::Silence,
        VadState::Speech(s) => VadState::Wait(*s, msec),
        VadState::Wait(t, s) => VadState::Wait(*t, *s + msec),
    }
}
}

they were silent, do nothing
we considered it as speech, mark it as silent (In case it isn't clear yet, the Consider state shall detect the "sneeze-case". If the caller Silent -> Sneeze -> Silent, we only "considered" the sneeze as speech but never really transitioned into the Speech state)
they were speaking, transition to a Wait state

The Wait state is important to cover the "human-took-a-breath" case. We allow the caller to go Silent -> Speak -> Silent -> Speak, and consider the whole sequence as a non-interrupted instruction, even though they were being silent for a little bit.

However, if the stream is in a Wait state for too long, we consider the instruction to be finished, and we'll transmit the captured audio data to the "processor" (see next paragraph).

Additionally - and to be precise - I am not looking at the energy of individual samples. I am actually looking at a sequence of 20 samples (or 2.5 milliseconds) and use the maximum energy of those 20 samples to drive that state machine.

#![allow(unused)]
fn main() {
for chunk in self.chunks(20) {
    if chunk.energy() > ACTIVE_ENERGY {
        state.active(CONSIDER_MSEC, chunk.msec(self.rate));
    } else {
        state.silence(chunk.msec(self.rate));
    }
}
}

Process instructions using Cloudflare Workers AI

Now that we have extracted an instruction from a stream of audio samples, we can actually process it. There are many different approaches to this problem, but in a nutshell what we need to do is:

Speech-To-Text
Generate response
Text-To-Speech

All of this can happen locally, or can be offloaded to some external service provider. I decided to play around with Cloudflare's Workers AI for this project.

`@cf/openai/whisper-tiny-en`

Takes a WAVE audio file and turns it into text. All we gotta do here is make our stream of PCM samples a WAVE PCM soundfile. Then we send this file to a Cloudflare Worker script

`@cf/meta/llama-3.1-8b-instruct-fast`

Within that script we are piping the text to a generative text model to generate a response

`@cf/myshell-ai/melotts`

I also don't want to deal with generating audio from that response myself, so I am using a third model that takes the response as text and provides me with some audio.

Return result to caller

After receiving the response audio from the Cloudflare Workers AI endpoint, we have to transform it to an audio format that Asterisk's Audiosocket understand. Workers AI gives us an mp3, but we need a 8kHz, 16-bit, single-channel PCM stream.

While there are other options my simple approach here is to call into ffmpeg to do the heavy lifting for me:

#![allow(unused)]
fn main() {
let mut child = Command::new("ffmpeg")
    .arg("-i")
    .arg("pipe:0")
    .arg("-f")
    .arg("s16le")
    .arg("-ac")
    .arg("1")
    .arg("-acodec")
    .arg("pcm_s16le")
    .arg("-ar")
    .arg("8000")
    .arg("pipe:1")
    .stdin(Stdio::piped())
    .stderr(Stdio::piped())
    .stdout(Stdio::piped())
    .spawn()
    .unwrap();
}

This

pipes the mp3 from Cloudflare into ffmpeg
converts the data to raw PCM
pipes the PCM audio data to my program

The resulting PCM audio samples can now be sent over the existing Audiosocket TCP connection and the caller will receive the generated reply. It's important to not send the whole audio to Asterisk in one go but rather chunk it and send chunks at the same pace as we are receiving samples from Asterisk.

The Audiosocket server as well as the Cloudflare Worker can be found on Github

cehrig.dev