{"id":2331,"date":"2025-06-25T16:35:33","date_gmt":"2025-06-25T07:35:33","guid":{"rendered":"https:\/\/aida.korea.ac.kr\/?page_id=2331"},"modified":"2025-06-25T16:38:38","modified_gmt":"2025-06-25T07:38:38","slug":"635-2","status":"publish","type":"page","link":"https:\/\/aida.korea.ac.kr\/?page_id=2331","title":{"rendered":""},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">Deep Learning \u2013 Speaker Verification<\/h1>\n\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong><strong>Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification<\/strong><\/strong><\/h2>\n\n\n\n<p><strong>Objective<\/strong><\/p>\n\n\n\n<p>Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pre-trained networks. In this study, we analyze the approaches toward such a paradigm and underline the significance of interlayer information processing as a result. Accordingly, we present a novel backend model that comprises a layer\/frame-level network and two steps of pooling architectures for each layer and frame axis.\n <\/p>\n\n\n\n<p><strong>Data<\/strong><\/p>\n\n\n\n<p>We use the CSTR VCTK corpus [1], LibriSpeech [2], and VoxCelebs [3, 4] datasets, diversifying the experimental setups of high-low resource and auditorial recording environments. <\/p>\n\n\n\n<p class=\"has-small-font-size\">[1] J. Yamagishi et al., \u201cCSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)\u201d, 2019. <br>\n[2] V. Panayotov et al., \u201cLibriSpeech: an ASR corpus based on public domain audio books\u201d, 2015. <br>\n[3] A. Nagrani et al., \u201cVoxCeleb: A large-scale speaker identification dataset,\u201d 2017. <br>\n[4] J. S. Chung et al., \u201cVoxCeleb2: Deep speaker recognition,\u201d, 2018\n<\/p>\n\n\n\n<p><strong>Related Work<\/strong><\/p>\n\n\n\n<p>ECAPA-TDNN [5] has been developed based on SE-Res2Net [6] topology for acoustic feature processing. WavLM [7] incorporates ECAPA-TDNN to process the pre-trained model output, which led to achieving state-of-the-art verification performance in the VoxCeleb benchmark. Meanwhile, D3Net [8] proposes an architecture that can avoid the creation of blind spots in dilated-convolutional dense connections.<\/p>\n\n\n\n<p>ChemBERTa [2] predicts masked tokens to acquire SMILES representations, GROVER [3] focuses on predicting masked atom or bond attributes to learn molecular graphs.<br>\nMolCLR [4] introduced contrastive learning with graph-based augmentations, enhancing molecular graphs by randomly masking atoms or deleting bonds before computing contrastive loss.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-full\">\n    <img decoding=\"async\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2025\/02\/UniPool.png\" alt=\"\" class=\"wp-image-1727\"\/>\n    <figcaption class=\"wp-element-caption\">\n        [5] B. Desplanques et al., \u201cECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification\u201d, 2020. <br>\n        [6] S. Gao et al., \u201cRes2Net: A new multi-scale backbone architecture\u201d, 2019.<br>\n        [7] S. Chen et al., \u201cWavLM: Large-scale self-supervised pre-training for full stack speech processing\u201d , 2022.<br>\n        [8] N. Takahashi et al., \u201cDensely connected multi-dilated convolutional networks for dense prediction tasks\u201d, 2021.<br>\n        [9] J. S. Kim et al., \u201cUniversal pooling method of multi-layer features from pretrained models for speaker verification\u201d, 2024.<br>\n    <\/figcaption>\n<\/figure>\n\n\n<p><strong>Proposed Method<\/strong><\/p>\n\n\n\n<p>In this paper, we propose a backend module that extracts speaker embeddings from pre-trained model output in three steps: layer\/frame-level processing network, layer attentive pooling, and attentive statistic pooling. We adopt D3Net to process multiple hidden states of the pre-trained model, hence creating more speaker-representative features from adjacent relationships between layer- and frame-wise information. Then, two attentive poolings follow which are applied for layer and frame-wise aggregation, respectively. The layer-pooling comprises Squeeze-Excitation [10]-based significance scoring and max-pooling, and we follow the temporal aggregation strategy introduced in ECAPA-TDNN.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-full\">\n    <img decoding=\"async\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2025\/02\/UniPool2.png\" alt=\"\" class=\"wp-image-1727\"\/>\n    <figcaption class=\"wp-element-caption\">\n        [9] J. S. Kim et al., \u201cUniversal pooling method of multi-layer features from pretrained models for speaker verification\u201d, 2024.<br>\n        [10] J. Hu et al., \u201cSqueeze-and-Excitation networks\u201d, 2018.<br>\n    <\/figcaption>\n<\/figure>\n\n\n<p> We conducted experiments on various data environments, leveraging popular pre-trained Transformer networks. We compare the proposed method with two approaches of fine-tuning pre-trained models to perform speaker verification, yet our methodology involves freezing the pre-trained weights since employing the pre-trained model as a feature extractor.<br>\nThe results show that the proposed backend outperforms the others despite training fewer parameters, demonstrating that the benefits of pretraining can be further maximized given the context of fewer data resources and a natural speech environment such as VoxCeleb 1. <br>\nHowever, the fine-tuning methods showed high sensitivities and were difficult to apply to pre-trained models other than the wav2vec 2.0. \n<\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-full\">\n    <img decoding=\"async\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2025\/02\/UniPool3.png\" alt=\"\" class=\"wp-image-1727\"\/>\n    <figcaption class=\"wp-element-caption\">[9] J. S. Kim et al., \u201cUniversal pooling method of multi-layer features from pretrained models for speaker verification\u201d, 2024.<\/figcaption>\n<\/figure>","protected":false},"excerpt":{"rendered":"<p>Deep Learning \u2013 Speaker Verification Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification Objective Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pre-trained networks. In this study, we analyze the approaches toward such a paradigm and underline the significance of interlayer information processing as a &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/aida.korea.ac.kr\/?page_id=2331\" class=\"more-link\">Read more<span class=\"screen-reader-text\"> &#8220;&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2331","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/pages\/2331","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2331"}],"version-history":[{"count":2,"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/pages\/2331\/revisions"}],"predecessor-version":[{"id":2358,"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/pages\/2331\/revisions\/2358"}],"wp:attachment":[{"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2331"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}