AI Sound: Shamanic Dances

March 1, 2024
Misha Voynov
Composer, engineer, producer, and lecturer
Misha Voynov

    AI models in the audio field are significantly less developed than language and diffusion models. Even MusicLM, which Google has not yet released to the public, apparently generates rather unconvincing results according to all indications.

Music generators marketed as neural networks typically rely on loading extensive libraries and accessing them directly, implying that their generation process does not occur within the latent space.

    There are several barriers to the industry's development. First, there's the necessity to consider time as one of the vectors, which means that training datasets require much more server space than images, for example. To achieve anything, quality must often be sacrificed by reducing the size of the original data. Neural networks work with music by converting it into spectrograms, allowing them to analyze sound information as an image. This also leads to low-quality generation results because the neural network may not fully understand what it hears, requiring annotations that only humans can provide (and again, the process is complicated because it all happens on a timeline). Unlike images, where an author's style can be specified, music generation requires specific data about harmonic content and sound sources.

    Music

    For example, Sound Draw uses pre-recorded fragments combined algorithmically. Users can select style, structure, duration, density, tempo, and instruments to create tracks for use in videos or songs, requiring a subscription to access the "save" feature. However, one can record directly from their own sound card.

    AppSunoAI - for generating songs from the text of specific styles, such as Soul R&B, which accesses libraries.

    Output AI, which creates musical fragments - their Coproducer is very convenient for creating tracks. With natural language prompts, it can provide samples and loops from Output AI and their flagship library, Arcade. (It's worth noting that Coproducer is more of a "good search engine" rather than a full-fledged AI generator)

    Noise

    The initial application of AI in audio was noise removal from sound recordings. Spectral Layers (Steinberg) effectively separates voice from noise. Previously, algorithms would leave part of the voice in the noise, but now Spectral Layers can do this much more accurately. Plus, there's the ability to extract voice from music for remixes and split several voices onto separate tracks.

    The author said meow

    If neural networks can differentiate instruments, noise, and voices, they can quickly find any borrowings. For instance, neural networks recently identified samples used in Daft Punk's tracks. This opens up incredible possibilities for monitoring copyright infringements, likely requiring new regulation methods.

    Voice Cloning

    Another breakthrough is in voice generation. Neural networks can copy the intonation and voice of famous musicians, actors, and movie characters. This is done by training the neural network on their voices and using Python scripts. He mentions that such technology is called "voice cloning."

              Winnie the Pooh performs "Toxicity" by System of a Down*

* In the Soviet version of Winnie the Pooh, the tone of the protagonist is markedly different from that found in Disney's Winnie the Pooh. Yevgeny Leonov's character is fast-talking and cunning, presenting himself as a pseudo-intellectual poet (and Piglet has a gun…).

         "Beautiful Faraway" covered by the late punk-rock icon Egor Letov*

*Egor Letov, the singer of Civil Defence (Grazhdanskaya Oborona), a band prohibited in the USSR for its anti-government ideas, was forcibly committed to an asylum in 1985. Here, voice cloning taps into a paradoxical reality, resurrecting the long-deceased Letov to sing a song typically delivered in the purest soprano. This song, from the Soviet science fiction romantic drama "Guest from the Future" (1985), depicts a child's dream of a wonderful and beautiful future, filled with hope that everything will embrace communism.

    The voice generation process requires a lot of GPU and  currently can only be run locally on Windows.
    SO-VITs, created analogously to the Vocaloid program that allows animated characters to sing texts, generates excellent voices. This technology's application in the film industry will allow for creating live, intonationally rich speech from a base of actors' voices, making the actors themselves generally unnecessary. The model trains on a volume of literally 500 words and has many settings that can be adjusted depending on the material.

    Of course, it's worth mentioning that hundreds of paid services offer to read texts in the voices of famous characters and personalities, such as Morgan Freeman, Trump, or Obama, but the funniest source for testing new developments is probably the familiar "house of models," aka the platform of transformers, Huggingface.

    Multimodality

    Currently,  assembling and testing multimodal solutions is a huge fun. These are mainly combinations of vocal and musical models, but sometimes also diffusion models - with broad applications in art, although more known for the feature of animating mouth movement to speech recording. One of the leaders here is Vocaloid from Yamaha, but Dreamtonics achieves very realistic results.
Arguably, for successful interaction with audio models, musical education, Python, and curiosity are necessary. But it seems soon the first two might not be needed - because while some are training models, others can learn music from models.

    You are Music

    There are no tools capable of creating music from scratch, and if you want to control speech, you'll have to voice the text yourself and assign it characteristics because synthesizers don't understand context. Model training allows for more, but it also has limitations. To work successfully with audio, understanding the basic metaphors of music is still indispensable.

    What (now) doesn't work? Current limitations and the future development of neural networks in sound processing

    • Understanding context and intonations: although neural networks are good at recognizing speech, fully understanding the context and nuances of natural language remains a challenging task.

    • Recognizing and interpreting irony, sarcasm, and other subtle aspects of speech.

    • Emulating human emotions in synthesized speech: Creating a naturally sounding, emotionally expressive synthesized voice is still a challenge.

    • Realtinme render - Complex types of sound processing in real time: some types of sound processing, especially in real-time conditions, may require significant computational resources.

    • Automated creation of complex music: while neural networks can generate music, creating complex musical works comparable in quality to those created by experienced composers is not yet fully realized.

    Industries Where Changes Are Imminent

    • Animation: Using AI to create unique voices for characters based on descriptions of their characters.

    • Education: musical training courses where AI is used to generate musical examples and exercises of various styles/difficulties.

    • Virtual DJ sets: Development of AI systems that can create and play music sets in real-time based on audience data.

    • Calls and conferences: for filtering background noise and improving speech clarity in real-time in calls and video conferences.

    • Video games: Automatic creation of sound effects for video games and cinema: Developments that can generate and adapt sound effects depending on the game or film scenario.

    A Few More Links

    Soundraw.io
    A platform for automatic music generation that allows users to create unique tracks without specific knowledge in music.
    Features: Intuitive interface, a wide selection of genres and styles, suitable for content creators of all levels.

    So-vits
    An open-source project for voice cloning, using deep learning to create accurate copies of human voices
    Features: The ability to create realistic voice duplicates, useful for creating digital assistants, voiceover text, and other audio projects.

    Heygen.com
    A platform for creating digital avatars and video speech translation to automatically generate expressive and convincing video presentations in many languages.
    Features: Ease of use, the ability to personalize avatars, suitable for business presentations, training courses, and social media.

    App.suno.ai 
    A service for creating songs from text, turning any words into musical compositions using AI.
    Features: Innovative approach to music composition, convenient for songwriters, poets, and creative individuals seeking new forms of expression.

      MORE FOR YOU:

Discover how to transcribe long audio and video files using Whisper and Colab in just 5 minutes. Perfect for beginners looking to streamline their transcription process.