Train your own AI singer using RVC.

Introduction#

Recently, AI singers have become very popular, to the point where Bilibili has started a submission event.

Screenshot

Currently, the commonly used ones are RVC, VITS, and DiffSinger.

RVC and VITS are voice changers that convert input audio. DiffSinger is similar to VOCALOID.

In theory, DiffSinger should have better results than VITS.

However, learning VOCALOID is difficult for those who haven't studied music (I don't know how to use it either). I will write a tutorial later (delayed).

gezi

This tutorial can be run on Windows, but requires modifications for Linux.

First, let's take a look at the disclaimer

This tutorial is intended for communication and learning purposes only. It should not be used for commercial purposes or to infringe upon the legitimate rights and interests of others. Using someone else's voice for training and publishing may violate the "Internet Information Service Deep Synthesis Management Regulations" and other laws and regulations. When using this tutorial and its related models, please comply with relevant laws and regulations, as well as intellectual property laws and regulations. Any legal disputes arising from the use of this tutorial and its related models are the responsibility of the user, and the author of this site bears no responsibility.

Tutorial Preparation#

An NVIDIA graphics card with a VRAM larger than 6GB, and updated to the latest driver version.
Download and install the UVR program Download link
Download and extract the RVC program Download link
Download and extract the Audio Slicer program Download link

Dataset Preparation#

You can use either vocal or spoken audio for training, but do not mix them, as the results will be poor.

Training with vocal audio will result in a closer approximation of the singer's original singing style, while training with spoken audio will be closer to the singing style of the input audio.

Here are some examples:

Input Audio

Training with Vocal Audio

Training with Spoken Audio

Do not download low-quality music for training audio, as it will affect the results.

Open the previously installed Ultimate Vocal Remover to separate the accompaniment and vocals.

Since the default model doesn't perform very well, you need to download the Demucs v4 htdemucs_ft model by following these steps:

Screenshot 2023-05-28 211605

Since I have already downloaded it, you should actually select Demucs v4 htdemucs_ft, don't choose the wrong one.

After downloading, pay attention to the settings indicated by the red lines, and then click start (it may take several hours if there are many files).

Screenshot 2023-05-28 213012

After the output, it's best to listen to the audio. If there are noise, distortion, accompaniment, or telephone-like sound quality, you can discard or trim the noisy parts, as they will greatly affect the results.

Although the RVC program includes truncation, it's still better to use Audio Slicer to trim the audio. Open the slicer-gui.exe program, set the parameters indicated by the red lines, and then click start.

Screenshot 2023-05-28 212712

After the output, delete the audio files that are less than 5 seconds long, and listen to the audio files that are less than 10 seconds long to check if they are all noise. If they are noise, delete them.

Start Training#

The RVC program includes a Python runtime environment, so there's no need to prepare a separate runtime environment.

Simply open go-web.bat and wait for a moment. A browser will open. If it doesn't open, you can manually open http://127.0.0.1:7897/ and select "Train". You can modify the explanations for the words, but it's best not to modify the ones that are not explained.

Word Explanations

Experiment Name: The name of the model, should not be the same as the folder under "logs".

Target Sampling Rate: Default is 40k, but you can change it to 48k.

Input Training Folder Path: Enter the folder path where the Audio Slicer output is located.

Total Training Epochs: The number of training epochs, it's better to have more than 100 for better results.

Save Frequency (Epoch): The frequency of saving the model. If you increase the total training epochs, you should also increase this value, otherwise it will take up a lot of disk space.

Batch Size per GPU: If the VRAM is 6GB, do not modify it. If it's larger, you can modify it accordingly. For example, a 12GB VRAM on a 3060 can choose a batch size of 12 or 16.

Screenshot 2023-05-28 214929

After setting the parameters, click One-Click Training and wait (you can take a nap).

Inference Audio#

You will need to use UVR. Configure it as mentioned earlier, but this time you don't need to select "vocals only". Remember to choose a new empty folder for the output.

The feature retrieval file is located in the "logs/experiment name" folder. The one starting with "added_" is the correct file (if you're not continuing the training, you can place the feature file in the "weights" folder and delete the "logs/experiment name" folder, as it takes up a lot of disk space).

Note! The trained model files are in the "weights" folder, not in the "logs" folder

The feature retrieval ratio is best kept between 0.6 and 0.9. If it exceeds this range, the results will be poor compared to the singer's voice.

Then, configure it as shown in the image below and click "Convert".

Screenshot 2023-05-28 220043

After the conversion is complete, download the output and use software like Adobe Audition to mix the output audio with the accompaniment.

Conclusion#

Tutorial ends here

If you have any questions, feel free to ask in the comments 😆