Deep Learning – Voice Conversion
TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
Objective
VC is the task of transforming the voice of the source speaker into that of the target speaker while maintaining the linguistic content of the source speech. The existing methods do not simultaneously satisfy the above two aspects of VC, and their conversion outputs suffer from a trade-off problem between maintaining source contents and target characteristics. In this study, we propose Triple Adaptive Attention Normalization VC (TriAAN-VC), comprising an encoder-decoder and an attention-based adaptive normalization block, that can resolve the trade-off problem of VC.
Data
We use the VCTK dataset [1] which is an English multi-speaker corpus
[1] C.Veaux, J.Yamagishi, K.MacDonald, et al., “Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
Related Work
AdaIN-VC [2] adopted adaptive instance normalization for the conversion process.
S2VC [3] used cross-attention for the conversion process.

[3] Lin, Jheng-hao, et al. “S2vc: A framework for any-to-any voice conversion with self-supervised pretrained representations.” arXiv preprint arXiv:2104.02901 (2021).
Proposed Method
Triple Adaptive Attention Normalization VC (TriAAN-VC) is for non-parallel A2A VC. TriAAN-VC, which is based on an encoder-decoder structure, disentangles content and speaker features. TriAAN block extracts each detailed and global speaker representation from disentangled features and uses adaptive normalization for conversion. As a training approach, siamese loss with time masking is applied to maximize the maintenance of the source content.

TriAAN-VC achieved better performance on WER, CER, and SV scores, regardless of conversion scenarios compared to the existing methods which suffered from a trade-off problem of VC. It suggests that the conversion methods using compact speaker features can simultaneously retain both source content and target speaker characteristics.
For MOS results, it is similar to the results of objective evaluation. TriAAN-VC demonstrated a slight improvement over S2VC in terms of similarity, which is close to the performance of the oracle. Furthermore, TriAAN-VC outperformed the previous methods in terms of naturalness evaluation, suggesting the proposed model can make relatively unbiased results.
