Hirofumi Inaguma

News😀

  • 09/2021: Four papers got accepted to ASRU2021.
        ”Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates” (1st-author)
        ”A COMPARATIVE STUDY ON NON-AUTOREGRESSIVE MODELINGS FOR SPEECH-TO-TEXT GENERATION” (co-author)
        ”A STUDY OF TRANSDUCER BASED END-TO-END ASR WITH ESPNET: ARCHITECTURE, AUXILIARY LOSS AND DECODING STRATEGIES” (co-author)
        ”ASR RESCORING AND CONFIDENCE ESTIMATION WITH ELECTRA” (co-author)

  • 09/2021: New preprint on non-autoregressive end-to-end speech translation is available.
        ”Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring” [arxiv] (1st-author)

  • 07/2021: One paper on IWSLT2021 offline speech translation systems is available.
        ”ESPnet-ST IWSLT 2021 Offline Speech Translation System” [arxiv] (1st-author)

  • 06/2021: Two papers got accepted to INTERSPEECH2021.
        ”StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR” [arxiv] (1st-author)
        ”VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording” [arxiv] (1st-author)

  • 03/2021: One paper got accepted to NAACL-HLT2021.
        ”Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation” [arxiv] (1st-author)

  • 03/2021: New preprint on low-latency streaming ASR is available.
        ”Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition” [arxiv] (1st-author)

  • 01/2021: Three papers got accepted to ICASSP2021.
        ”Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder” [arxiv] (1st-author)
        ”Improved Mask-CTC for Non-Autoregressive End-to-End ASR” [arxiv] (co-author)
        ”Recent Developments on ESPnet Toolkit Boosted by Conformer” [arxiv] (co-author)

  • 12/2020: Recieved 14th IEEE Signal Processing Society (SPS) Japan Student Conference Paper Award [link].

  • 12/2020: Slides in NLP friends talk are available. [slide]

  • 07/2020: Four papers got accepted to INTERSPEECH2020.
        ”CTC-synchronous Training for Monotonic Attention Model” [arxiv] [slide] (1st-author)
        ”Enhancing Monotonic Multihead Attention for Streaming ASR” [arxiv] [slide] (1st-author)
        ”Distilling the Knowledge of BERT for Sequence-to-Sequence ASR” (co-author)
        ”End-to-end speech-to-dialog-act recognition” [arxiv] (co-author)
  • About Me

     I am a final-year Ph.D. student at Graduate School of Informatics, Kyoto University, Kyoto, Japan. I will obtain a Ph.D. degree in September 2021 (thesis was already defended). After guraduation, I will join Facebook AI Research as a postdoctral researcher.
    My CV is available here.

    Email (office): inaguma [at] sap.ist.i.kyoto-u.ac.jp
    Email (private): hiro.mhbc [at] gmail.com
    Address (office): Research Building No.7 Room 407, Yoshida-honmachi, Sakyo-ku, Kyoto-shi, Kyoto, 606-8501, Japan

    Google Scholar | GitHub | LinkedIn | Twitter

    Research interests🤔

    Automatic speech recognition (ASR)
    • End-to-end speech recognition
    • Online streaming ASR
    • Multilinguality
    • Language modeling
    Speech translation
    • End-to-end speech translation
    • Multilinguality
    • Non-autoregressive sequence model

    Major publications🧐

    Speech translation Streaming ASR

    Education🎓

    Ph.D. in Computer Science, Kyoto University, Kyoto, Japan (April 2018 - September 2021)
    • Department of Intelligence Science and Technology, Graduate School of Informatics
    • Supervisor: Prof. Tatsuya Kawahara
    M.E. in Computer Science, Kyoto University, Kyoto, Japan (April 2016 - March 2018)
    • Department of Intelligence Science and Technology, Graduate School of Informatics
    • Thesis title: Joint Social Signal Detection and Automatic Speech Recognition based on End-to-End Modeling and Multi-task Learning
    • Supervisor: Prof. Tatsuya Kawahara
    B.E. in Computer Science, Kyoto University, Kyoto, Japan (April 2012 - March 2016)
    • Supervisor: Prof. Tatsuya Kawahara

    Work experiences💻

    Microsoft Research, Redmond, WA, USA, Research Internship (July 2019 - October 2019)
    • Mentor: Yifan Gong, Jinyu Li, Yashesh, Gaur, and Liang Lu
    Johns Hopkins University, Baltimore, MD, USA, Research Internship (July 2018 - September 2018)
    • Worked on end-to-end speech recognition and translation
    • Participated in the JSALT workshop (topic: multilingual end-to-end speech recognition)
    • Participated in IWSLT2018 end-to-end speech translation evaluation campaign
    • Mentor: Prof. Shinji Watanabe
    IBM research AI, Tokyo, Japan, Research Internship (September 2017 - November 2017)
    • Worked on end-to-end ASR systems
    • Mentor: Gakuto Kurata and Takashi Fukuda

    Awards & Honors 🏆

    Awards
    • 14th IEEE Signal Processing Society (SPS) Japan Student Conference Paper Award, from IEEE Signal Processing Society (SPS) Tokyo Joint Chapter, December 2020. [link]
      Paper title: "Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR"
    • Yamashita SIG Research Award, from Information Processing Society of Japan (IPSJ), March 2019. [link]
      Paper title: "An End-to-End Approach to Joint Social Signal Detection and Automatic Speech Recognition"
    • Yahoo! JAPAN award (best student paper), from SIG-SLP, June 2018. [link]
    • Full exemption from Repayment of Scholarship Loan for Students with Outstanding Results, from Japan Student Services Organization (JASSO), May 2018.
    • Student award, from the Acoustical Society of Japan (ASJ), March 2018. [link]
    • Student award, from the 79th of National Convention of Information Processing Society of Japan (IPSJ), March 2017.
    Fellowship
    • Microsoft Research Asia Ph.D. Fellowship (top 12 phd students in Asia), from Microsoft Research Asia (MSRA), October 2019. [link]
    • Research Fellowship for Young Scientists (DC1), from Japan Society for the Promotion of Science (JSPS), April 2018 - March 2021.

    Talk 📢

    • NLP コロキウム (in Japanese), Aug. 2021. [link]
    • NLP friends, Dec. 2020. [link]

    Preprint

    • Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring
      Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, and Shinji Watanabe
      [arxiv]
      Under review at IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2021

    • Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition
      Hirofumi Inaguma and Tatsuya Kawahara
      [arxiv]
      Under review at IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2021

    International conference (Review paper, first author)

      --- 2021 ---

    • Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates
      Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, and Shinji Watanabe
      ASRU, 2021

    • ESPnet-ST IWSLT 2021 Offline Speech Translation System
      Hirofumi Inaguma* , Brian Yan*, Siddharth Dalmia, Pengcheng Guo, Jiatong Shi, Kevin Duh, and Shinji Watanabe
      [arxiv]
      IWSLT 2021 (system)

    • StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR
      Hirofumi Inaguma and Tatsuya Kawahara
      [arxiv] [ISCA online archive]
      INTERSPEECH 2021 (acceptance rate: 963/1990=48.4%)

    • VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
      Hirofumi Inaguma and Tatsuya Kawahara
      [arxiv] [ISCA online archive]
      INTERSPEECH 2021

    • Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation
      Hirofumi Inaguma, Tatsuya Kawahara, and Shinji Watanabe
      [arxiv] [ACL Anthology]
      NAACL-HLT 2021 (acceptance rate: 23%)

    • Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder
      Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, and Shinji Watanabe
      [arxiv] [IEEE Xplore]
      ICASSP 2021 (acceptance rate: 1734/3610=48.0%)
    • --- 2020 ---

    • CTC-synchronous Training for Monotonic Attention Model
      Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara
      [arxiv] [slide] [ISCA online archive]
      INTERSPEECH 2020 (acceptance rate: 47.0%)

    • Enhancing Monotonic Multihead Atteniton for Streaming ASR
      Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara
      [arxiv] [demo] [slide] [ISCA online archive]
      INTERSPEECH 2020

    • ESPnet-ST: All-in-One Speech Translation Toolkit
      Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Yalta, Tomoki Hayashi, and Shinji Watanabe
      [arxiv] [slide] [ACL Anthology] [slator]
      ACL 2020 System Demonstrations

    • Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR
      Hirofumi Inaguma, Yashesh, Gaur, Liang Lu, Jinyu Li, and Yifan Gong
      [arxiv] [slide] [IEEE Xplore]
      ICASSP 2020 (acceptance rate: 47%, Oral)
    • --- 2019 ---

    • Multilingual End-to-End Speech Translation
      Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, and Shinji Watanabe
      [arxiv] [pdf] [poster] [IEEE Xplore]
      ASRU 2019 (acceptance rate: 144/299=48.1%)

    • Transfer Learning of Language-Independent End-to-End ASR with Language Model Fusion
      Hirofumi Inaguma, Jaejin Cho, Murali Karthick Baskar, Tatsuya Kawahara, and Shinji Watanabe
      [arxiv] [pdf] [poster] [IEEE Xplore]
      ICASSP 2019 (acceptance rate: 1774/3815=46.5%)
    • --- 2018 ---

    • Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR
      Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara
      [arxiv] [pdf] [poster] [IEEE Xplore]
      SLT 2018 (acceptance rate: 150/257=58.3%)

    • The JHU/KyotoU Speech Translation System for IWSLT 2018
      Hirofumi Inaguma, Xuan Zhang, Zhiqi Wang, Adithya Renduchintala, Shinji Watanabe, and Kevin Duh
      [pdf]
      IWSLT 2018 (system)

    • An End-to-End Approach to Joint Social Signal Detection and Automatic Speech Recognition
      Hirofumi Inaguma, Masato Mimura, Koji Inoue, Kazuyoshi Yoshii, and Tatsuya Kawahara
      [pdf] [poster] [IEEE Xplore]
      ICASSP 2018 (acceptance rate: 1406/2830=49.7%)
    • --- 2017 ---

    • Social Signal Detection in Spontaneous Dialogue Using Bidirectional LSTM-CTC
      Hirofumi Inaguma, Koji Inoue, Masato Mimura, and Tatsuya Kawahara
      [pdf] [poster] [ISCA online archive]
      INTERSPEECH 2017 (acceptance rate: 799/1582=52.0%)

      International conference (Review paper, co-author)

      --- 2021 ---

    • A COMPARATIVE STUDY ON NON-AUTOREGRESSIVE MODELINGS FOR SPEECH-TO-TEXT GENERATION
      Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, and Shinji Watanabe
      ASRU, 2021

    • A STUDY OF TRANSDUCER BASED END-TO-END ASR WITH ESPNET: ARCHITECTURE, AUXILIARY LOSS AND DECODING STRATEGIES
      Florian Boyer, Yusuke Shinohara, Takaaki Ishii, Hirofumi Inaguma, and Shinji Watanabe
      ASRU, 2021

    • ASR RESCORING AND CONFIDENCE ESTIMATION WITH ELECTRA
      Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara
      ASRU, 2021

    • The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans
      Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, Jing Shi, Aswin Shanmugam Subramanian, and Wangyou Zhang
      [arxiv]
      IEEE Data Science and Learning Workshop (DSLW) 2021

    • Improved Mask-CTC for Non-Autoregressive End-to-End ASR
      Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, and Tetsunori Kobayashi
      [arxiv] [IEEE Xplore]
      ICASSP 2021

    • Recent Developments on ESPnet Toolkit Boosted by Conformer
      Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang
      [arxiv] [IEEE Xplore]
      ICASSP 2021
    • --- 2020 ---

    • Distilling the Knowledge of BERT for Sequence-to-Sequence ASR
      Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara
      [arxiv] [ISCA online archive]
      INTERSPEECH 2020

    • End-to-end speech-to-dialog-act recognition
      Trung V. Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, and Tatsuya Kawahara
      [arxiv] [ISCA online archive]
      INTERSPEECH 2020
    • --- 2019 ---

    • A Comparative Study on Transformer vs RNN in Speech Applications
      Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shnji Watanabe, Takenori Yoshimura, and Wangyou Zhang
      [arxiv] [IEEE Xplore]
      ASRU 2019

    • Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition
      Jaejin Cho, Shinji Watanabe, Takaaki Hori, Murali Karthick Baskar, Hirofumi Inaguma, Jesus Villalba, and Najim Dehak
      [arxiv] [IEEE Xplore]
      ICASSP 2019
    • --- 2018 ---

    • Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition
      Masato Mimura, Sei Ueno, Hirofumi Inaguma, Shinsuke Sakai, and Tatsuya Kawahara
      [pdf] [IEEE Xplore]
      SLT 2018

    • Acoustic-to-Word Attention-Based Model Complemented with Character-level CTC-Based Model
      Sei Ueno, Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara
      [pdf] [IEEE Xplore]
      ICASSP 2018