ConvNetとは？わかりやすく解説

畳み込みニューラルネットワーク（たたみこみニューラルネットワーク、英: convolutional neural network、略称: CNNまたはConvNet）は、畳み込みを使用しているニューラルネットワークの総称である。画像認識や動画認識、音声言語翻訳^[1]、レコメンダシステム^[2]、自然言語処理^[3]、コンピュータ将棋^[4]、コンピュータ囲碁^[4]などに使用されている。

数式表記

畳み込みニューラルネットワークの定義は厳密に決まっているわけではないが、画像認識の（縦, 横, 色）の2次元画像の多クラス分類の場合、以下の擬似コードで書かれるのが基本形である^[5]。ここから色々なバリエーションが作られている。損失関数は交差エントロピーを使用し、パラメータは確率的勾配降下法で学習するのが基本形である。これらの偏微分は自動微分を参照。

以下の繰り返し
    畳み込み層と活性化関数
    最大値プーリング
ベクトルに平坦化（flatten）
以下の繰り返し
    全結合層と活性化関数
ソフトマックス関数

畳み込み層

基本形は、入力 $\mathbf {X} (H_{\text{in}},W_{\text{in}},C_{\text{in}})$

CNN受容野の再帰計算

$RF_{l-1}=\left(RF_{l}-1\right)s_{l}+k_{l}$

よって $RF_{N}=1$ を初期条件としてこの式を入力層受容野 $RF_{0}$ まで再帰することで受容野を計算できる。

歴史

画像処理のフィルタとして畳み込みを使用するという手法はコンピューターでの画像処理が登場した初期の段階から使われている手法である。エッジ検出やガウシアンぼかしなど多数ある。

畳み込みニューラルネットワークは動物の視覚野から発想を得て^[31]福島邦彦によって提唱されたネオコグニトロンに起源を持つ^[32]^[33]^[34]。ネオコグニトロンはニューラルネットワークで畳み込みを使用した。

有名なモデルとして以下のものがある。

1979年 - ネオコグニトロン
1989年 - LeNet
2012年 - AlexNet。トロント大学のチーム名 SuperVision が ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC 2012) で優勝^[35]。ディープラーニングのブームが始まる。
2014年 - VGG。オックスフォード大学のチーム名 VGG が ILSVRC 2014 で優勝。^[36]
2015年 - ResNet。マイクロソフトのチーム名 MSRA が ILSVRC 2015 で優勝。^[37]

2012年からのディープラーニングブームより前の画像認識は画像（ピクセルデータ）を注意深く設計されたデータ前処理により特徴量（1999年のSIFTや2006年のSURFなど）へ変換し、それを用いた学習が主流だった。例えば AlexNet が優勝した ILSVRC 2012 の2位のモデルの ISI は SIFT などを使用している^[35]。畳み込みニューラルネットワークはピクセルを直接入力に用いることができ、特徴量設計において専門家の知識に依存しない特徴をもつとされた^[38]。現在では畳み込みニューラルネットワーク以外のニューラルネットワーク（例: Vision Transformer; ViT、MLPベースの gMLP）でもピクセル入力の画像処理が実現されている^[39]^[40]。ゆえに畳み込みそのものが特徴量設計を不要にするキー技術であるとは言えないことがわかっている^[要出典]。

脚注

注釈

出典

^ “K-Pop Hit Song Recorded in 6 Languages Using Deep Learning” (英語). K-Pop Hit Song Recorded in 6 Languages Using Deep Learning (2023年8月2日). 2024年2月20日閲覧。
^ van den Oord, Aaron; Dieleman, Sander; Schrauwen, Benjamin (2013-01-01). Burges, C. J. C.. ed. Deep content-based music recommendation. Curran Associates, Inc.. pp. 2643–2651
^ Collobert, Ronan; Weston, Jason (2008-01-01). “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning”. Proceedings of the 25th International Conference on Machine Learning. ICML '08 (New York, NY, USA: ACM): 160–167. doi:10.1145/1390156.1390177. ISBN 978-1-60558-205-4.
^ ^a ^b “誰もdlshogiには敵わなくなって将棋AIの世界が終わってしまった件 | やねうら王公式サイト”. yaneuraou.yaneu.com. 2024年7月15日閲覧。
^ “examples/mnist/main.py at main · pytorch/examples”. 2024年7月15日閲覧。
^ “Conv2d — PyTorch 2.3 documentation”. pytorch.org. 2024年7月15日閲覧。
^ "convolved with its own set of filters" PyTorch 1.10 Conv1D
^ “MaxPool2d — PyTorch 2.3 documentation”. pytorch.org. 2024年7月15日閲覧。
^ ^a ^b ^c ^d 福島, 邦彦 (2019). “ネオコグニトロンと畳み込みニューラルネットワーク”. 医用画像情報学会雑誌 (Japan Science and Technology Agency) 36 (2): 17-24. doi:10.11318/mii.36.17.
^ Zhang, Wei (1988). “Shift-invariant pattern recognition neural network and its optical architecture”. Proceedings of annual conference of the Japan Society of Applied Physics.
^ Zhang, Wei (1990). “Parallel distributed processing model with local space-invariant interconnections and its optical architecture”. Applied Optics 29 (32).
^ Fukushima, Kunihiko (1980-04-01). “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”. Biological Cybernetics 36 (4): 193–202. doi:10.1007/BF00344251. ISSN 1432-0770. https://doi.org/10.1007/BF00344251.
^ Azulay, Aharon; Weiss, Yair (2019). “Why do deep convolutional networks generalize so poorly to small image transformations?”. arXiv. https://arxiv.org/abs/1805.12177.
^ Chaman, Anadi; Dokmanić, Ivan (2021). “Truly shift-invariant convolutional neural networks”. arXiv. https://arxiv.org/abs/2011.14214.
^ "a 1×1 convolution called a pointwise convolution." Andrew (2017) MobileNets Arxiv
^ "Depthwise convolution with one filter per input channel (input depth)" Andrew G. Howard. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Arxiv
^ "In a group conv layer ..., input and output channels are divided into C groups, and convolutions are separately performed within each group." Saining (2017). Aggregated Residual Transformations for Deep Neural Networks. Arxiv
^ "groups controls the connections between inputs and outputs. ... At groups=1, all inputs are convolved to all outputs ... At groups= in_channels, each input channel is convolved with its own set of filters" PyTorch nn.Conv2d
^ "depthwise separable convolutions which is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a 1×1 convolution called a pointwise convolution." Howard, et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
^ “14.10. Transposed Convolution — Dive into Deep Learning 1.0.3 documentation”. d2l.ai. 2024年7月23日閲覧。
^ “ConvTranspose2d — PyTorch 2.3 documentation”. pytorch.org. 2024年7月23日閲覧。
^ “AvgPool2d — PyTorch 2.4 documentation”. pytorch.org. 2024年7月30日閲覧。
^ “Global Average Pooling Explained”. 2024年7月30日閲覧。
^ “Spatial Pyramid Pooling Explained”. 2024年7月30日閲覧。
^ “空間ピラミッドプーリング層 (SPP-net, Spatial Pyramid Pooling) とその応用例や発展型 | CVMLエキスパートガイド”. CVMLエキスパートガイド | 毎日の探求力向上を支援する，中級者むけ拠点サイト. 2024年7月30日閲覧。 “空間ピラミッドプーリング”
^ "we propose a recurrent CNN (RCNN) for object recognition by incorporating recurrent connections into each convolutional layer" p.3367 and "This work shows that it is possible to boost the performance of CNN by incorporating more facts of the brain. " p.3374 of Liang, et al. (2015). Recurrent Convolutional Neural Network for Object Recognition.
^ "we propose a recurrent CNN (RCNN) for object recognition by incorporating recurrent connections into each convolutional layer" Liang, et al. (2015). Recurrent Convolutional Neural Network for Object Recognition.
^ "予測の際に使用する有限長の過去のデータ点数 R は受容野 (receptive field) の大きさを表す．" 松本. (2019). WaveNetによる言語情報を含まない感情音声合成方式の検討. 情報処理学会研究報告.
^ "Effective Receptive Field (ERF): is the area of the original image that can possibly influence the activation of a neuron. ... ERF and RF are sometimes used interchangeably" Le. (2017). What are the Receptive, Effective Receptive, and Projective Fields of Neurons in Convolutional Neural Networks?. Arxiv.
^ "layer k ... R_k be the ERF ... f_k represent the filter size ... the final top-down equation: $R_{k,j}=\left(R_{k,j+1}-1\right)s_{j+1}+f_{j+1}$ "
^ Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). “Subject independent facial expression recognition with robust face detection using a convolutional neural network”. Neural Networks 16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1 2013年11月17日閲覧。.
^ Fukushima, K. (2007). “Neocognitron”. Scholarpedia 2 (1): 1717. doi:10.4249/scholarpedia.1717.
^ Fukushima, Kunihiko (1980). “Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position”. Biological Cybernetics 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364 2013年11月16日閲覧。.
^ LeCun, Yann. “LeNet-5, convolutional neural networks”. 2013年11月16日閲覧。
^ ^a ^b “ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012)”. image-net.org. 2024年7月16日閲覧。
^ “ILSVRC2014 Results”. image-net.org. 2024年7月16日閲覧。
^ “ILSVRC2015 Results”. image-net.org. 2024年7月16日閲覧。
^ 藤吉 2019, p. 293-294.
^ Dosovitskiy, et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
^ Liu, et al. (2021). Pay Attention to MLPs. NeurIPS 2021.

参考文献

藤吉, 弘亘 (2019-04). “リレー解説　機械学習の可能性《第1回》機械学習の進展による画像認識技術の変遷”. 計測と制御 (計測自動制御学会) 58 (4): 291-297. doi:10.11499/sicejl.58.291. ISSN 1883-8170.

ConvNetとは？わかりやすく解説

畳み込みニューラルネットワーク

数式表記

畳み込み層

歴史

脚注

注釈

出典

参考文献

関連項目

「ConvNet」の関連用語

ConvNetとは？ わかりやすく解説

畳み込みニューラルネットワーク

数式表記

畳み込み層

歴史

脚注

注釈

出典

参考文献

関連項目

急上昇のことば

「ConvNet」の関連用語

ConvNetとは？わかりやすく解説