progress:: 1/8.5
fill:๐ฉ
transition:๐จ
empty:โป๏ธ
prefix:[
suffix:]
length:10Abstract
โ์ปจ๋ณผ๋ฃจ์ ๋คํฌ์ํฌ๋ ๋ค์ํ ๋ถ์ผ์ ๊ณผ์ ์ ๋ํ state-of-the-art ์ปดํจํฐ ๋น์ ์๋ฃจ์ ์ ํต์ฌ์ ๋๋ค. 2014๋ ๋ ๋ถํฐ ๋งค์ฐ ๊น์ ์ปจ๋ณผ๋ฃจ์ ๋คํธ์ํฌ๊ฐ ๋ค์ํ ๋ฒค์น๋งํฌ์์ ์๋นํ ๋ฐ์ ์ ์ทจํ๋ฉฐ ์ฃผ๋ฅผ ์ด๋ฃจ์์ต๋๋ค. ๋ชจ๋ธ์ ํฌ๊ธฐ์ ๊ณ์ฐ ๋น์ฉ์ ์ฆ๊ฐ๋ ๋๋ถ๋ถ์ ๊ณผ์ ์์ ์ฆ๊ฐ์ ์ธ ํ์ง ํฅ์์ผ๋ก ์ด์ด์ง๋ ๊ฒฝํฅ์ด ์์ง๋ง (ํ๋ จ์ ์ํ ๋ผ๋ฒจ๋ง ๋ฐ์ดํฐ๊ฐ ์ถฉ๋ถํ ์ฃผ์ด์ง๋ ํ์์), ๊ณ์ฐ ํจ์จ์ฑ๊ณผ ์ ์ ๋งค๊ฐ๋ณ์๋ ๋ชจ๋ฐ์ผ ๋น์ ๋ฐ ๋น ๋ฐ์ดํฐ์ ๊ฐ์ ๋ค์ํ ํ์ฉ ์ฌ๋ก์์ ์ฌ์ ํ ์ค์ํ ์์ธ์ ๋๋ค. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. ์ฌ๊ธฐ์์ ์ฐ๋ฆฌ๋ ์ ์ ํ ์ธ์๋ถํด๋ ์ปจ๋ณผ๋ฃจ์ ๊ณผ ๊ณต๊ฒฉ์ ์ธ ์ ๊ทํ๋ฅผ ํตํด ์ถ๊ฐ๋ ๊ณ์ฐ์ ์ต๋ํ ํจ์จ์ ์ผ๋ก ํ์ฉํ ์ ์๋๋ก ๋คํธ์ํฌ์ ํฌ๊ธฐ๋ฅผ ๋๋ฆฌ๋ ๋ฐฉ๋ฒ์ ํ์ํฉ๋๋ค. ์ฐ๋ฆฌ๋ ILSVRC 2012 classification challenge validation set์ ํตํด ์ฐ๋ฆฌ์ ๋ฐฉ๋ฒ์ ๋ฒค์น๋งํนํ์๊ณ , ์ด๋ state-of-the-art์ ๋นํด ์๋นํ ๊ฐ์ ์ ์ด๋ฃจ์์ต๋๋ค: ์ถ๋ก ๋น 50์ต๋ฒ์ ๊ณฑ์ ๋ฐ ๋ง์ ์ ํด๋นํ๋ ๊ณ์ฐ ๋น์ฉ๊ณผ 2500๋ง๊ฐ ๋ฏธ๋ง์ ํ๋ผ๋ฏธํฐ๋ฅผ ์ฌ์ฉํ๋ ๋คํธ์ํฌ๋ฅผ ํตํด ๋จ์ผ ํ๋ ์ ํ๊ฐ์์ 21.2%์ top-1 error ๊ทธ๋ฆฌ๊ณ 5.6%์ top-5 error๋ฅผ ๋ฌ์ฑํ์์ต๋๋ค. 4๊ฐ์ ๋ชจ๋ธ์ ์์๋ธํ ๋ค multi-crop evaluation์ ํ ๊ฒฐ๊ณผ 3.5%์ top-5 error์ 17.3%์ top-1 error๊ฐ ๋ํ๋จ์ ํ์ธํ์์ต๋๋ค.
1. Introduction
โ2012๋ ImageNet competition [16]์์ Krizhevsky et al [9] ์ด ์ ์ํ ์ดํ, ๊ทธ๋ค์ ๋คํธ์ํฌ์ธ โAlexNetโ์ object detection [5], segmentation [12], human pose estimation [22], video classification [8], object tracking [23] ๊ทธ๋ฆฌ๊ณ superresolution [3] ๋ฑ ๋ค์ํ ์ปดํจํฐ ๋น์ ๊ณผ์ ์ ์ฑ๊ณต์ ์ผ๋ก ๋์ ๋์ด ์์ต๋๋ค.
โ์ด๋ฌํ ์ฑ๊ณต์ ๋ ๋์ ์ฑ๋ฅ์ CNN์ ์ฐพ๋ ๋ฐ์ ์ด์ ์ ๋ ์๋ก์ด ์ฐ๊ตฌ ๋ฐฉํฅ์ ์ด์ง์์ผฐ์ต๋๋ค. 2014๋ ๋ถํฐ ๋ ๊น๊ณ ๋์ ๋คํธ์ํฌ๋ค์ ํ์ฉํจ์ผ๋ก์จ ๋คํธ์ํฌ ์ํคํ ์ฒ์ ์ฑ์ด ํ์ฐํ ๊ฐ์ ๋์์ต๋๋ค. VGGNet [18] ๊ณผ GoogLeNet [20]์ 2014 ILSVRC [16] classifiacation challenge์์ ๋น์ทํ๊ฒ ๋์ ์ฑ๋ฅ์ ๋ด์์ต๋๋ค. ํ ๊ฐ์ง ํฅ๋ฏธ๋ก์ด ๊ด์ธก ๊ฒฐ๊ณผ๋ classification ์ฑ๋ฅ์ด ๋ค์ํ ์์ฉ ๋ถ์ผ์์์ ์๋นํ ํ์ง ํฅ์์ผ๋ก ์ด์ด์ง๋ค๋ ๊ฒ์ ๋๋ค. ์ด๋ ๊น์ ์ปจ๋ณผ๋ฃจ์ ์ํคํ ์ฒ์์์ ๊ตฌ์กฐ์ ๊ฐ์ ์ด ๋์ ํ์ง์ ํ์ต๋ ์๊ฐ์ ํน์ง์ ์ ์ ๋ ์์กดํ๋ ๋๋ถ๋ถ์ ๋ค๋ฅธ ์ปดํจํฐ ๋น์ ๊ณผ์ ์์์ ์ฑ๋ฅ์ ๊ฐ์ ์ํค๋ ๋ฐ์ ํ์ฉ๋ ์ ์์์ ์๋ฏธํฉ๋๋ค. ๋ํ, AlexNet์ด ์์์ ์ผ๋ก ๋ง๋ค์ด์ง solution์ ์คํ๋ ์ฑ๋ฅ์ ๋ด์ง ๋ชปํ๋ ๊ฒ๊ณผ ๊ฐ์ ๊ฒฝ์ฐ์์(e.g. proposal generation in detection[4]),๋คํธ์ํฌ ํ์ง์ ๊ฐ์ ์ด ์๋ก์ด ์ปจ๋ณผ๋ฃจ์ ๋คํธ์ํฌ์ ์์ฉ ๋ถ์ผ๋ฅผ ํ์์์ผฐ์ต๋๋ค.
โVGGNet [18] ์ด ๊ตฌ์กฐ์ ๋จ์์ฑ์ ๊ฐ์ ์ ๊ฐ์ง์ง๋ง, ์ด๋ ๋์ ๋น์ฉ์ ์ด๋ํฉ๋๋ค: ๋คํธ์ํฌ๋ฅผ ํ๊ฐํ๋ ๋ฐ์ ๋ง์ ๊ณ์ฐ์ด ์๊ตฌ๋ฉ๋๋ค. ๋ฐ๋ฉด์, GoogLeNet [20] ์ inception ์ํคํ ์ฒ๋ ๋ฉ๋ชจ๋ฆฌ์ ๊ณ์ฐ ํ๋์ ๋ํ ์๊ฒฉํ ์ ํ ๋ด์์๋ ์ข์ ์ฑ๋ฅ์ ๋ผ ์ ์๋๋ก ์ค๊ณ๋์์ต๋๋ค. ์๋ฅผ ๋ค์ด, GoogLeNet์ 500๋ง๊ฐ์ ํ๋ผ๋ฏธํฐ๋ง์ ์ฌ์ฉํ์์ผ๋ฉฐ ์ด๋ 600๋ง๊ฐ์ ํ๋ผ๋ฏธํฐ๋ฅผ ์ฌ์ฉํ์๋ AlexNet์ ๋นํด 12๋ฐฐ ๊ฐ์ํ ์์น์ ๋๋ค. ๊ฒ๋ค๊ฐ, VGGNet์ AlexNet ๋ณด๋ค ์ฝ 3๋ฐฐ ๋ ๋ง์ ํ๋ผ๋ฏธํฐ๋ฅผ ์ฌ์ฉํ์์ต๋๋ค.
โInception์ ๊ณ์ฐ ๋น์ฉ์ VGGNet์ด๋ ๋ ๋์ ์ฑ๋ฅ์ ํ์ ๋ชจ๋ธ๋ค ๋ณด๋ค ํจ์ฌ ๋ฎ์ต๋๋ค [6]. ์ด๋ ๋๊ท๋ชจ ๋ฐ์ดํฐ๋ฅผ ํฉ๋ฆฌ์ ์ธ ๋น์ฉ์ผ๋ก ์ฒ๋ฆฌํด์ผ ํ๋ ๋น ๋ฐ์ดํฐ ๋ถ์ผ๋ ๋ชจ๋ฐ์ผ ๋น์ ๊ณผ ๊ฐ์ด ๋ฉ๋ชจ๋ฆฌ ๋๋ ๊ณ์ฐ ์ฉ๋์ด ๋ณธ์ง์ ์ผ๋ก ์ ํ๋ ํ๊ฒฝ์์ Inception ๋คํธ์ํฌ๋ฅผ ํ์ฉ ๊ฐ๋ฅํ๋๋ก ๋ง๋ค์ด์์ต๋๋ค.
2. General Design Principles
3. Factorizing Convolutions with Large Filter Size
3.1. Factorization into smaller convolutions
3.2. Spatial Factorization into Asymmetric Convolutions
4. Utility of Auxiliary Classifiers
5. Efficient Grid Size Reduction
6. Inception-v2
References
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,ยด R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. War- ยด den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [2] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In Proceedings of The 32nd International Conference on Machine Learning, 2015. [3] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In Computer VisionโECCV 2014, pages 184โ199. Springer, 2014. [4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2155โ2162. IEEE, 2014. [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [6] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015. [7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448โ456, 2015. [8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725โ1732. IEEE, 2014. [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097โ1105, 2012. [10] A. Lavin. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015. [11] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv preprint arXiv:1409.5185, 2014. [12] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431โ3440, 2015. [13] Y. Movshovitz-Attias, Q. Yu, M. C. Stumpe, V. Shet, S. Arnoud, and L. Yatziv. Ontological supervision for fine grained classification of street view storefronts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1693โ1702, 2015. [14] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012. [15] D. C. Psichogios and L. H. Ungar. Svd-net: an algorithm that automatically selects network structure. IEEE transactions on neural networks/a publication of the IEEE Neural Networks Council, 5(3):513โ515, 1993. [16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. 2014. [17] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. arXiv preprint arXiv:1503.03832, 2015. [18] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [19] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1139โ1147. JMLR Workshop and Conference Proceedings, May 2013. [20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1โ9, 2015. [21] T. Tieleman and G. Hinton. Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012. Accessed: 2015- 11-05. [22] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1653โ1660. IEEE, 2014. [23] N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pages 809โ817, 2013.