For Any to Any voice conversion transformer the linguistic features and voice identity of an utterance are seperated and used independently to achieve any combination on voice conversion. BNF and Speaker embeddings are inputs and mel-spectrogram is predicted. Speech quality syntesised is very clear with good voice conversion.
voice conversion transformer
Voice conversion Attention Transformer PPG BNF Speaker embeddings