Text to Singing: AI to Generate Vocals

This is part of a series on Opportunities for AI in Music Production.

Problem / I can kind of sing, but not that well. I want to sing on my own songs, but then apply effects to make it sound “good” (in tune and on time, but also with different formants), or in the style of another singer.

Solution / An audio plugin that transforms an input vocal to a new voice of your own design. The TacoTron 2 algorithm is getting close! especially with rap…

I’m Not quite sure how this works. You train a machine learning model on a famous singer’s vocal performance including a transcription, then generate new audio tracks with a new transcript and reference vocal? I’m also not sure where auto-tune fits into this, initially it probably remains a separate prior step in the effects chain.

One of the best tools available today for generating vocals is Vocaloid. It synthesizes singing from input MIDI and text. It sounds like singing a singing version of pre-Siri Text to Speech, but I’m not sure if it uses a TTS approach or just a huge library of audio samples. It represents the User Experience I want though: play a melody on the piano, input words using the keyboard, then choose a virtual singer to sing it.

This Neutrino project is promising, it only generates Japanese singing, but it sounds very good!

The easiest way I found to clone a voice and do text to speech (still talking, not singing) so far is this python notebook. It provides interesting results with speech in a US accent, but didn’t couldn’t match singing samples or UK garage rappers when I uploaded some samples. This in and of itself would be an amazing tool. I’d like to feed it samples of singers and rappers that I like, and generate more phrases in that style.

Next: Mixing Music with AI