Preface & Introduction
Recently, AI singers have become very popular, to the point where Bilibili has directly launched a submission activity.
Currently, commonly used ones are RVC, VITS, and DiffSinger.
RVC and VITS are voice changers that convert input audio. DiffSinger is similar to VOCALOID.
In theory, DiffSinger should have better effects than VITS.
However, learning VOCALOID is difficult for those who have not studied music (I don't know how to use it either). I will write a tutorial later (gugu).
This tutorial can be run on Windows, but Linux requires modification of the steps.
First, let's take a look at the disclaimer
This tutorial aims to provide communication and learning purposes, and should not be used for commercial purposes or infringe upon the legitimate rights and interests of others. Using someone else's voice for training and publishing may violate the "Internet Information Service Deep Synthesis Management Regulations" and other laws and regulations. When using this tutorial and its related models, please comply with relevant laws and regulations, and abide by intellectual property laws and regulations. Any legal disputes arising from the use of this tutorial and its related models shall be borne by the user, and the author of this site shall not be held responsible.
Tutorial Preparation
- A NVIDIA graphics card with a VRAM larger than 6GB, and update it to the latest driver version.
- Download the UVR program and install it Download link.
- Download the RVC program and unzip it Download link.
- Download the Audio Slicer program and unzip it Download link.
Dataset Preparation
Training can use vocal or spoken audio, but do not mix them, as the effect will be poor.
Training with vocal audio will approximate the original singing style of the singer, while training with spoken audio will approximate the singing style of the input audio.
Here are some examples:
Input audio
Training with vocal audio
Training with spoken audio
Do not download low-quality music for training audio, as it will affect the effect.
Open the Ultimate Vocal Remover that was just installed to separate the accompaniment and vocals.
Since the default model does not perform well, you need to download the Demucs v4 htdemucs_ft model by following the steps below.
Since I have already downloaded it, you should actually select Demucs v4 htdemucs_ft, don't select the wrong one.
After downloading, pay attention to the settings indicated by the red lines, and then click "start" (it may take several hours if there are many files).
After the output, it is recommended to listen to the audio. If there is noise, distortion, accompaniment, or telephone-like sound quality, you can directly discard or cut the noisy parts, otherwise it will greatly affect the effect.
Although the RVC program includes truncation, it is still better to use Audio Slicer to truncate. Open the slicer-gui.exe program, set the parameters indicated by the red lines, and then click "start".
After the output, delete the audio shorter than 5 seconds, and then listen to the audio shorter than 10 seconds to check if it is all noise. If it is noise, delete it.
Start Training
The RVC program includes a Python runtime environment, so there is no need to prepare a runtime environment.
Simply open go-web.bat and wait for a while. A browser will open. If it doesn't open, open it manually at http://127.0.0.1:7897/ and select "Training". You can modify the explanations of the words, but it is best not to modify the ones without explanations.
Word explanations
Input experiment name: The name of the model, do not use the same name as the folder under logs.
Target sampling rate: Default is 40k, but you can change it to 48k.
Input training folder path: Enter the folder path output by Audio Slicer earlier.
Total training epochs (total_epoch): The number of training epochs, it is better to have more than 100 epochs for better model performance.
Save frequency (save_every_epoch): The frequency of saving. If the total training epochs (total_epoch) is increased, this value should be increased, otherwise it will occupy a lot of disk space.
Batch size per GPU: If the VRAM is 6GB, do not modify it. If it is larger, you can modify it as appropriate. It has been tested that a 12GB VRAM 3060 can choose a batch size of 12 or 16.
After setting is complete, click "One-Click Training" and wait (you can take a nap).
Inference Audio
You need to use UVR, configure it as mentioned above, but this time you don't need to select "vocals only", and remember to change the output to a new empty folder.
The feature retrieval file is in the logs/experiment_name folder, and the one starting with "added_" is the feature file (if you don't continue training, you can put the feature file in the weights folder and delete the logs/experiment_name folder, as it takes up a lot of disk space).
Note! The trained model files are in the weights folder, not the contents of the logs folder
The feature retrieval ratio is best between 0.6 and 0.9. If it exceeds this range, the effect will be poor, and it will be much worse than the singer's vocal range, so don't use the default value of 0.76.
Then, set it as shown in the image below and click "Convert".
After the conversion is complete, download it and use software such as Adobe Audition to mix the output audio with the accompaniment.
Conclusion
Tutorial ends here
If you have any questions, feel free to ask in the comments section 😆