Five minutes with DeepZen

We spent five minutes with Kerem Sozugecer (CTO and co-Founder), Omer Gunes (NLP lead) and Spyridoula Papandreou (TTS language engineer) from DeepZen, OCFI’s newest resident.

The DeepZen team has created an ultra-realistic voice using AI that can convert text to audio, regardless of length. Traditional speech systems work by generating every word separately and then they’re put together to form a sentence. Unlike a robotic voice, DeepZen’s technology synthesizes the human voice to replicate emotions and intonations. Check out their website, where you can hear ‘William’ reading The Reluctant Cannibals and Kafka’s Metamorphosisnarrated by ‘Lauren’ – you simply won’t believe these are AI-produced voices. Amazing.

Not only is the technology incredible but it means that an audio book can be produced in days rather than weeks. Kerem says, “A typical audiobook will cost around $5000 to produce we are aiming to reduce that significantly. A ten-hour audio book can be produced by us in just in a few hours; the rest of the time is spent on editing to check for continuity, context and emotion. This will get quicker with time, as our algorithm improves.”

Currently the Deep Zen team has just five voices – which can have different accents and speak in different languages – but soon they’ll be able to simulate well-known voices too from a short recording which they can imitate to get the right tone, pauses, tempo and expression.

There are three main strands to the DeepZen business. The main focus is on audiobooks – currently an $8bn market worldwide and set to grow by 25% per year – of which DeepZen plans to take a large chunk in two ways. Kerem explains “We are currently creating audio books for publishers and simply charging for production, but we are also co-publishing with Legend Press and Endeavour Media and are ‘in conversation’ the big six publishing houses etc. At the moment, two million books are published annually but only 3 per cent are converted into audio books, so there’s a big gap in the market that DeepZen wants to fill”.

The second strand of their business is working with an agency to do short voiceovers for advertisers, gaming companies and animation. And the third strand is online training and education, which the team is currently developing Their text-to-speech tools can add voice features to literacy apps, e-learning platforms and digital learning tools. “Oh, and maybe there’s a fourth strand,” says Kerem, “producing audio content for exhibitions in museums and galleries”.

DeepZen was started by Kerem and colleague Taylan Kamis (CEO) in 2017, but a team, including Omer and Spyridoula, quickly formed to develop the product. Their first office was close to Paddington Station in West London but they wanted to be in Oxford as well as that’s where the expertise is. The business is now 14-strong with language specialists, editors and software developers – all based in London and at OCFI.

So, what is Deep Zen’s greatest challenge for 2020? “Scaling up” says Kerem. “We’ve done a first round of Seed Funding and are now doing our second round. With this new round we aim to expand our R&D capabilities, bring in more editorial staff and work on new languages.

We wish them the very best of luck with William, Lauren, Alice, and Ellie  and look forward to listening to a DeepZen audiobook soon!

If you want to find out more about DeepZen, see here.