Implementing Natural Conversational Agents with Elixir

(seanmoriarity.com)

194 points | by ac_alejos 13 days ago

6 comments

andy_ppp 13 days ago
I don't think many people know how amazing Elixir has become at machine learning. If you want to learn more I can't recommend Seam Moriarity's book Machine Learning in Elixir enough. Concepts are explained in extremely straight forward language and there's loads of examples!
https://pragprog.com/titles/smelixir/machine-learning-in-eli...
[-]
- jatins 13 days ago
  Does Elixir ML ecosystem libs (Nx, Axon) provide some sort of interop with Python ecosystem?
  For example can I load or fine tune a model pre-trained in pytorch/JAX in Axon? Or does everything need to be written from ground up in Elixir?
  [-]
  - thibaut_barrere 13 days ago
    Despite some limitations, you will probably find Bumblebee (https://github.com/elixir-nx/bumblebee) interesting.
    "Bumblebee provides pre-trained Neural Network models on top of Axon. It includes integration with HuggingFace Models, allowing anyone to download and perform Machine Learning tasks with few lines of code"
  - ricketycricket 13 days ago
    Ortex exists to run ONNX models: https://github.com/elixir-nx/ortex
  - ac_alejos 13 days ago
    Not directly but it supports the ONNX Runtime and has support for many of the models you might want through Bumblebee and Hugging Face
    https://github.com/elixir-nx/bumblebee
    [-]
    - thibaut_barrere 13 days ago
      (sorry for the double post!)
- thibaut_barrere 13 days ago
  Yes, this is getting quite exciting. There is cross-pollinisation of concepts going on (e.g. https://www.youtube.com/watch?v=RABXu7zqnT0 which shows a port of Python's Instructor library to https://github.com/thmsmlr/instructor_ex, https://hexdocs.pm/scholar/Scholar.html etc!).
  That coupled with LiveView + (quite easy scaling in general) results into interesting opportunities.
  [-]
  - andy_ppp 13 days ago
    The scaling story in Elixir is so nice, I just implemented eventual consistency for calculating average ratings on a Class/Instructor level and it was 20 lines of code in a GenServer that can be tested and is super clear how it works. I'm not even sure how you'd do something like this in Javascript or Python but it would probably involve extra infrastructure that is another moving piece to deploy, manage and can be a cause of potential failures. The same is true with rate limiting and something like Hammer (https://github.com/ExHammer/hammer).
- jonvk 11 days ago
  I'm guessing you mean that you would recommend it. You say you can't.
- enraged_camel 12 days ago
  It says the book is in beta. How complete/finished is it?
  [-]
  - seanmor5 12 days ago
    Hey, I'm the author! All of the chapters are done, but there are still some minor updates as APIs change. It should be going to production soon
    [-]
    - enraged_camel 12 days ago
      Nice. Just placed an order.
TonyHaenn 12 days ago
Nice writeup! Super interesting that we both took different paths, but ended up with similar latencies.
I built a real-time conversation platform in Elixir. I used the Membrane framework to coordinate amongst the STT, LLM and TTS steps. I also ended up with latency in the ~1300 ms range.
I found research that says the typical human response time is 250 to 300 ms [0] in a conversation, so I think that should be the goal.
For my solution, some of the things we did to get latency as low as possible: 1. We stream the audio to the TTS endpoint. If you're transcribing as the audio comes in, then all you care about is the tail latency (the time between when the audio ends and the final transcript arrives). That helped a bunch for us. Google is around 200 ms with this approach.
2. Gpt 3.5 still has a time to first token of ~350 to ~400 ms. I couldn't find a way around that. But you can stream those tokens to ElevenLabs and start getting audio faster which helps.
3. ElevenLabs eats us most of the latency budget. Even with their turbo model their latency is 600-800 ms according to my timings. Again, streaming the words in (not tokens) and calling flush seemed to help.
The key I found was to cover up the latency. We respond immediately with some filler audio. The trick was getting the LLM to be aware of the filler audio text and continue naturally from that point
[0] https://journalofcognition.org/articles/10.5334/joc.268#
[-]
- nojs 12 days ago
  This matches my experience doing it with Elixir/OpenAI/ElevenLabs as well.
  Depending on the application it’s also possible to fire the whole thing off pre-emptively, and then use the early response unless later context explicitly invalidates it.
  Another cool trick to get around TTS latency is to maintain an audio cache keyed by semantic meaning, and get the LLM to choose from the cache. This saves high TTS API costs too.
  [-]
  - Dowwie 12 days ago
    appointment scheduling seems like an ideal consumer of cached audio responses, but how can segments be concatenated into a naturally sounded response?
- theflyinghorse 12 days ago
  1.3s imo is a fine time frame to start actually speaking. Humans, well most of us anyway, don’t start speaking informative words right away. Instead we add in “umm”s, inhales, “mhm”s, “yeah…”s and so on. I think your approach is a good one. I’m now wondering for these filler sounds, do you contextualize them somehow? That is make filler feel more natural.
  [-]
  - TonyHaenn 12 days ago
    Depends on what you're aiming for. For my use case, I'm aiming for the feeling of talking to another human. I built an iOS app for little kids to call Santa. Low latency was important. Now I'm working on a mock interview experience; same deal, needs to feel like the real thing.
    Re: contextualizing the filler. No, but it's a good idea :) This thread made me think there's a way to generate one on the fly based on the first part of what the person has said. The challenge though is it seems to me that filler phrases usually relate to what the person said last, not first.
- abrookewood 12 days ago
  Slightly off-topic, but there isn't anyway to tag other HN users is there? Interested to see whether Sean could use any of your methods to improve his own approach.
birracerveza 13 days ago
Excellent article.
> Now, if you’re wondering if I spent $99 to save some milliseconds for a meaningless demo, the answer is absolutely yes I did.
Godspeed, soldier.
> I was very excited for this problem in particular because it’s literally the perfect application of Elixir and Phoenix. If you are building conversational agents, you should seriously consider giving Elixir a try. A large part of how quick this demo was to put together is because of how productive Elixir is.
Back in the pre-GPT era I built a chatbot with LiveView, it is a a fantastic fit for assistants.
I might pick it up again, it was pretty fun.
recurser 13 days ago
Great write-up! I'm really interested in this area but have minimal experience, and I learnt a lot from this.
jasonjmcghee 12 days ago
Is ElevenLabs Turbo v2 faster than streaming OpenAI TTS?
Also, have you checked out XTTSv2 over StyleTTS2?
meatyapp 12 days ago
how does Elixir+Phoenix help for this sort of use case instead of just using Python or JavaScript? thanks for any info!
[-]
- ac_alejos 12 days ago
  TLDR: The problem domain (telecom) fits Elixir perfectly
  If you’re talking about scaling this, Elixir is built on the BEAM VM which was originally made for Ericsson and is tailor made for telecom systems.
  Its whole paradigm is built around the concept of Let It Fail, which is basically about achieving fault tolerance through isolation and supervision.
  So aside from the fact that Elixir+Phoenix is a productive framework that allowed the author to build this in a few days, it also means that it will scale very well with minimal code changes.
  For reference, one of the solutions you might use to distribute this in Python is Celery, which is built on RabbitMQ which is built on Erlang, which is the predecessor of Elixir.