Positive Semidefinite Tensor Factorization (PSDTF)


Abstract

We discovered a new class of tensor factorization called positive semidefinite tensor factorization (PSDTF) that decomposes a set of PSD matrices (observations) into the convex combinations of fewer PSD matrices (bases). PSDTF can be viewed as a mathematically natural extension of nonnegative matrix factorization (NMF) that decomposes a set of nonnegative vectors (observations) into the convex combinations of fewer nonnegative vectors (bases).


References

Matlab source codes available (2-clause BSD license, Octave possibly compatible) [Code]

  1. Kazuyoshi Yoshii, Ryota Tomioka, Daichi Mochihashi, and Masataka Goto.   Beyond NMF: Time-Domain Audio Source Separation without Phase Reconstruction.   The 14th International Society for Music Information Retrieval Conference (ISMIR), pp. 369–374, November 2013.   [PDF] [Code]
  2. Kazuyoshi Yoshii, Ryota Tomioka, Daichi Mochihashi, and Masataka Goto.   Infinite Positive Semidefinite Tensor Factorization for Source Separation of Mixture Signals.   The 30th International Conference on Machine Learning (ICML), pp. 576–584, June 2013.   [PDF] [Supplementary] [Spotlight] [Poster] [Video]

Demonstration

We used three mixture audio signals each of which was synthesized using piano sounds (011PFNOM), electric guitar sounds (131EGLPM), or clarinet sounds (311CLNOM) recorded in the RWC Music Database: Musical Instrument Sound. Each mixture signal was made by concatenating seven 2.0-s isolated or mixture sounds (C4, E4, G4, C4+E4, C4+G4, E4+G4, and C4+E4+G4). The resulting 14.0-s signals were sampled at 16kHz. The task was to separate each mixture signal into three source signals respectively corresponding to C4, E4, and G4. The signal was analyzed by using a Gaussian window with a width of 512 samples (M=512) and a shifting interval of 160 samples (N=1400). The PSD matrices and their activations were estimated by using the MU algorithm with K=3. For comparison, we used KL-NMF for amplitude-spectrogram decomposition and IS-NMF for power-spectrogram decomposition. We evaluated the quality of separated signals in terms of source-to-distortion ratio (SDR), source-to-interferences ratio (SIR), and sources-to-artifacts ratio (SAR) using the BSS Eval toolbox.

The experimental results showed the clear superiority of LD-PSDTF for source separation. The average SDR, SIR, and SAR were 17.7 dB, 22.2 dB, and 19.7 dB in KL-NMF, 19.1 dB, 24.0 dB, and 21.0 dB in IS-NMF, and 23.0 dB, 27.7 dB, and 25.1 dB in LD-PSDTF. We found it effective to initialize LD-PSDTF by using basis vectors and their activations obtained by IS-NMF for reducing the computational cost and avoiding the local optima.

Observed mixture signal (piano: 011PFNOM)

Source signals of pitch C Source signals of pitch E Source signals of pitch G
Original
KL-NMF SDR 17.4dB, SIR 20.6dB, SAR 19.8dB SDR 16.6dB, SIR 21.1dB, SAR 18.5dB SDR 19.9dB, SIR 23.6dB, SAR 22.3dB
IS-NMF SDR 18.9dB, SIR 22.5dB, SAR 21.5dB SDR 19.0dB, SIR 24.6dB, SAR 20.5dB SDR 20.4dB, SIR 22.9dB, SAR 23.9dB
LD-PSDTF SDR 21.5dB, SIR 25.2dB, SAR 24.0dB SDR 23.8dB, SIR 28.7dB, SAR 25.5dB SDR 21.5dB, SIR 23.0dB, SAR 27.1dB

Observed mixture signal (electric guitar: 131EGLPM)

Source signals of pitch C Source signals of pitch E Source signals of pitch G
Original
KL-NMF SDR 11.5dB, SIR 16.0dB, SAR 13.5dB SDR 14.5dB, SIR 19.6dB, SAR 16.2dB SDR 16.9dB, SIR 22.0dB, SAR 18.5dB
IS-NMF SDR 11.4dB, SIR 17.2dB, SAR 12.8dB SDR 15.4dB, SIR 21.0dB, SAR 16.8dB SDR 16.2dB, SIR 20.8dB, SAR 18.0dB
LD-PSDTF SDR 14.6dB, SIR 19.2dB, SAR 16.5dB SDR 24.2dB, SIR 32.4dB, SAR 24.9dB SDR 17.1dB, SIR 20.2dB, SAR 20.0dB

Observed mixture signal (clarinet: 311CLNOM)

Source signals of pitch C Source signals of pitch E Source signals of pitch G
Original
KL-NMF SDR 17.6dB, SIR 24.5dB, SAR 18.6dB SDR 17.5dB, SIR 24.8dB, SAR 18.5dB SDR 20.1dB, SIR 27.8dB, SAR 20.9dB
IS-NMF SDR 20.8dB, SIR 25.7dB, SAR 23.1dB SDR 23.0dB, SIR 32.6dB, SAR 23.6dB SDR 27.2dB, SIR 31.6dB, SAR 29.1dB
LD-PSDTF SDR 25.4dB, SIR 30.8dB, SAR 26.9dB SDR 27.5dB, SIR 33.5dB, SAR 29.3dB SDR 31.1dB, SIR 36.8dB, SAR 32.5dB


We also tested LD-PSDTF on an audio signal synthesized by MIDI. The total length was 8.4s (N=840). The experimental results showed the overwhelming superiority of LD-PSDTF for source separation. The average SDR, SIR, and SAR were 16.7dB, 21.1dB, and 18.7dB for KL-NMF, 18.9dB, 24.1dB, and 20.5dB for IS-NMF, and 26.7dB, 33.2dB, and 27.8dB for LD-PSDTF. LD-PSDTF works very well for audio signals satisfying the assumption that basis signals are stationary.

Observed mixture signal (MIDI piano)

Source signals of pitch C Source signals of pitch E Source signals of pitch G
Original
KL-NMF SDR 17.4dB, SIR 21.9dB, SAR 19.4dB SDR 15.5dB, SIR 21.0dB, SAR 18.5dB SDR 16.2dB, SIR 20.6dB, SAR 18.2dB
IS-NMF SDR 18.3dB, SIR 23.9dB, SAR 19.7dB SDR 20.5dB, SIR 26.2dB, SAR 21.9dB SDR 17.9dB, SIR 22.4dB, SAR 19.8dB
LD-PSDTF SDR 25.5dB, SIR 33.7dB, SAR 26.2dB SDR 30.2dB, SIR 36.4dB, SAR 31.4dB SDR 24.2dB, SIR 29.4dB, SAR 25.8dB

Frequency-domain amplitude-spectrogram decomposition by KL-NMF


Frequency-domain power-spectrogram decomposition by IS-NMF


Time-domain signal-covariance decomposition by LD-PSDTF