確率的潜在意味解析とは？わかりやすく解説

確率的潜在意味解析（かくりつてきせんざいいみかいせき、Probabilistic latent semantic analysis、PLSA、または情報検索の分野では 確率的潜在意味インデキシング（PLSI）とも）は、2モードデータや共起データの解析に用いられる統計的手法である。これは、潜在意味解析（LSA）と同様に、観測された変数の低次元表現を、いくつかの隠れた変数との関連性に基づいて得る方法である。

従来の潜在意味解析が線形代数に基づき、出現頻度表を特異値分解などによって次元削減するのに対し、確率的潜在意味解析は潜在クラスモデルに基づく混合分解を用いる。

モデル

語と文書の共起 $(w,d)$ を観測とすると、PLSAは各共起の確率を条件付き独立な多項分布の混合として次のようにモデル化する：

P(w,d)=\sum _{c}P(c)P(d|c)P(w|c)=P(d)\sum _{c}P(c|d)P(w|c)

ここで $c$ は語の属する「トピック」を意味する。トピック数は事前に決定されるハイパーパラメータであり、データから推定されるものではない。

最初の式は「対称モデル」で、語 $w$ と文書 $d$ がともにトピック $c$ から生成される構造を示している。一方、二番目の式は「非対称モデル」で、文書 $d$ に対してトピック $c$ がまず選ばれ、そこから語 $w$ が生成される。

このモデルではパラメータの数は $cd+wc$ であり、文書数に比例して増加する。そのため、PLSAは訓練コーパス上の文書に対しては生成モデルだが、新しい文書の生成モデルとは言えない。

モデルパラメータはEMアルゴリズムによって学習される。

応用

PLSA はフィッシャーカーネルを用いて識別的な文書表現として使用されることもある^[1]。

PLSA は、情報検索、情報フィルタリング、自然言語処理、機械学習、バイオインフォマティクス^[2] など幅広い分野に応用されている。

ただし、PLSA で使用されるアスペクトモデルには過学習の問題があることが指摘されている^[3]。

拡張

階層モデルの拡張
- 非対称型：MASHA（Multinomial ASymmetric Hierarchical Analysis）^[4]
- 対称型：HPLSA（Hierarchical Probabilistic Latent Semantic Analysis）^[5]
生成モデルの拡張：
- 潜在的ディリクレ配分法（LDA） - 文書ごとのトピック分布にディリクレ分布を導入し、PLSAの欠点（新文書を生成できない）を克服する。
高次元データへの拡張：PLSAは3変数以上の共起にも拡張可能であり、追加の条件付き分布を導入することで、非負値テンソル因子分解に対応する確率モデルとして利用できる。

歴史

PLSAは潜在クラスモデルの一例であり、非負値行列因子分解との理論的関連性も報告されている^[6]^[7]。この用語「PLSA」は1999年にトーマス・ホフマンによって導入された^[8]。

脚注

^ Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000
^ Pinoli, Pietro; et, al. (2013). “Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations”. Proceedings of IEEE BIBE 2013. The 13th IEEE International Conference on BioInformatics and BioEngineering (英語). IEEE. pp. 1–4. doi:10.1109/BIBE.2013.6701702. ISBN 978-147993163-7.
^ Blei, David M.; Andrew Y. Ng; Michael I. Jordan (2003). “Latent Dirichlet Allocation”. Journal of Machine Learning Research 3: 993–1022. doi:10.1162/jmlr.2003.3.4-5.993.
^ Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002
^ Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents Archived 2016-03-04 at the Wayback Machine., in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002
^ Chris Ding, Tao Li, Wei Peng (2006). "Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006"
^ Chris Ding, Tao Li, Wei Peng (2008). "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing"
^ Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

外部リンク

[1] Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000

[2] Pinoli, Pietro; et, al. (2013). “Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations”. Proceedings of IEEE BIBE 2013. The 13th IEEE International Conference on BioInformatics and BioEngineering (英語). IEEE. pp. 1–4. doi:10.1109/BIBE.2013.6701702. ISBN 978-147993163-7.

[3] Blei, David M.; Andrew Y. Ng; Michael I. Jordan (2003). “Latent Dirichlet Allocation”. Journal of Machine Learning Research 3: 993–1022. doi:10.1162/jmlr.2003.3.4-5.993.

[4] Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002

[5] Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents Archived 2016-03-04 at the Wayback Machine., in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002

[6] Chris Ding, Tao Li, Wei Peng (2006). "Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006"

[7] Chris Ding, Tao Li, Wei Peng (2008). "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing"

[8] Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

確率的潜在意味解析とは？わかりやすく解説