With the help of voice cloning and generative AI, the team built a service for morning messages that play on the Sonos speakers at Prototyp’s offices. But getting it just right turned out to be more challenging than expected.
Press to view video
For quite some time, Prototyp’s employees have enjoyed morning messages through Sonos speakers at the various offices. The pre-written announcements were run through a text-to-speech (TTS) model, which inevitably made them somewhat predictable and robotic.
The project was originally initiated during a previous Prototyp Week, and during the 2024 AI-themed event, a new team wanted to explore the possibility of creating more dynamic and engaging morning messages − partly by using voice clones, and partly by generating a unique text each morning based on data such as the day’s weather and special events.
The team began by looking into different approaches to voice cloning − either by using existing APIs or by creating their own AI model. Ultimately, they chose the ElevenLabs Voice Cloning API.
They divided the group so everyone could work simultaneously. Some members focused on recording and providing ElevenLabs with quality audio samples for voice cloning. Others concentrated on integrating with the Sonos speakers, building on the existing codebase from the previous project.
There were multiple challenges. One was working with ElevenLabs to ensure that the voices sounded authentic − the cloned voices often came across as monotone and lost many of the unique characteristics of the recorded voice. To refine this, the team experimented extensively with the tool’s adjustable parameters for speech generation.
Another challenge involved the Sonos speakers themselves. The Stockholm office has several speakers, and the previous setup sent the generated audio to each speaker individually, causing some delay. Despite spending time attempting to address this, the team did not find a way for the Sonos API to simultaneously send audio files to multiple groups of speakers.
A key insight is that voice cloning is hard − and sometimes even a bit uncanny. There was a significant difference in how well each voice was cloned, and by the end of the week, the results were still not perfect.
However, because the messages are now generated dynamically, employees are perking up their ears every morning. There’s a sense of curiosity about what message will appear − and which voice will be used.
The team believes the project clearly improved on the initial service but is looking forward to further refining some of the voice clones.
A future idea is to integrate the generated text with the office calendars, so employees can be reminded of important meetings or fun events.
We use cookies to give you a better experience when visiting our website. Read more about how we handle cookies