progress:: 1/8.5
fill:๐ŸŸฉ
transition:๐ŸŸจ
empty:โ—ป๏ธ
prefix:[
suffix:]
length:10

Abstract

โ€ƒ์ปจ๋ณผ๋ฃจ์…˜ ๋„คํฌ์›Œํฌ๋Š” ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์˜ ๊ณผ์ œ์— ๋Œ€ํ•œ state-of-the-art ์ปดํ“จํ„ฐ ๋น„์ „ ์†”๋ฃจ์…˜์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. 2014๋…„๋„ ๋ถ€ํ„ฐ ๋งค์šฐ ๊นŠ์€ ์ปจ๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๊ฐ€ ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ƒ๋‹นํ•œ ๋ฐœ์ „์„ ์ทจํ•˜๋ฉฐ ์ฃผ๋ฅผ ์ด๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ํฌ๊ธฐ์™€ ๊ณ„์‚ฐ ๋น„์šฉ์˜ ์ฆ๊ฐ€๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๊ณผ์ œ์—์„œ ์ฆ‰๊ฐ์ ์ธ ํ’ˆ์งˆ ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์ง€๋งŒ (ํ›ˆ๋ จ์„ ์œ„ํ•œ ๋ผ๋ฒจ๋ง ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ฃผ์–ด์ง€๋Š” ํ•œ์—์„œ), ๊ณ„์‚ฐ ํšจ์œจ์„ฑ๊ณผ ์ ์€ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋ชจ๋ฐ”์ผ ๋น„์ „ ๋ฐ ๋น…๋ฐ์ดํ„ฐ์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ํ™œ์šฉ ์‚ฌ๋ก€์—์„œ ์—ฌ์ „ํžˆ ์ค‘์š”ํ•œ ์š”์ธ์ž…๋‹ˆ๋‹ค. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. ์—ฌ๊ธฐ์—์„œ ์šฐ๋ฆฌ๋Š” ์ ์ ˆํžˆ ์ธ์ˆ˜๋ถ„ํ•ด๋œ ์ปจ๋ณผ๋ฃจ์…˜๊ณผ ๊ณต๊ฒฉ์ ์ธ ์ •๊ทœํ™”๋ฅผ ํ†ตํ•ด ์ถ”๊ฐ€๋œ ๊ณ„์‚ฐ์„ ์ตœ๋Œ€ํ•œ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„คํŠธ์›Œํฌ์˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์„ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ILSVRC 2012 classification challenge validation set์„ ํ†ตํ•ด ์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•์„ ๋ฒค์น˜๋งˆํ‚นํ•˜์˜€๊ณ , ์ด๋Š” state-of-the-art์— ๋น„ํ•ด ์ƒ๋‹นํ•œ ๊ฐœ์„ ์„ ์ด๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค: ์ถ”๋ก ๋‹น 50์–ต๋ฒˆ์˜ ๊ณฑ์…ˆ ๋ฐ ๋ง์…ˆ์— ํ•ด๋‹นํ•˜๋Š” ๊ณ„์‚ฐ ๋น„์šฉ๊ณผ 2500๋งŒ๊ฐœ ๋ฏธ๋งŒ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ด ๋‹จ์ผ ํ”„๋ ˆ์ž„ ํ‰๊ฐ€์—์„œ 21.2%์˜ top-1 error ๊ทธ๋ฆฌ๊ณ  5.6%์˜ top-5 error๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. 4๊ฐœ์˜ ๋ชจ๋ธ์„ ์•™์ƒ๋ธ”ํ•œ ๋’ค multi-crop evaluation์„ ํ•œ ๊ฒฐ๊ณผ 3.5%์˜ top-5 error์™€ 17.3%์˜ top-1 error๊ฐ€ ๋‚˜ํƒ€๋‚จ์„ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค.

1. Introduction

โ€ƒ2012๋…„ ImageNet competition [16]์—์„œ Krizhevsky et al [9] ์ด ์ž…์ƒํ•œ ์ดํ›„, ๊ทธ๋“ค์˜ ๋„คํŠธ์›Œํฌ์ธ โ€œAlexNetโ€์€ object detection [5], segmentation [12], human pose estimation [22], video classification [8], object tracking [23] ๊ทธ๋ฆฌ๊ณ  superresolution [3] ๋“ฑ ๋‹ค์–‘ํ•œ ์ปดํ“จํ„ฐ ๋น„์ „ ๊ณผ์ œ์— ์„ฑ๊ณต์ ์œผ๋กœ ๋„์ž…๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค.

โ€ƒ์ด๋Ÿฌํ•œ ์„ฑ๊ณต์€ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์˜ CNN์„ ์ฐพ๋Š” ๋ฐ์— ์ดˆ์ ์„ ๋‘” ์ƒˆ๋กœ์šด ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ์ด‰์ง„์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. 2014๋…„๋ถ€ํ„ฐ ๋” ๊นŠ๊ณ  ๋„“์€ ๋„คํŠธ์›Œํฌ๋“ค์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜์˜ ์„ฑ์ด ํ™•์—ฐํžˆ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. VGGNet [18] ๊ณผ GoogLeNet [20]์€ 2014 ILSVRC [16] classifiacation challenge์—์„œ ๋น„์Šทํ•˜๊ฒŒ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ด์—ˆ์Šต๋‹ˆ๋‹ค. ํ•œ ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด ๊ด€์ธก ๊ฒฐ๊ณผ๋Š” classification ์„ฑ๋Šฅ์ด ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ์˜ ์ƒ๋‹นํ•œ ํ’ˆ์งˆ ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์ง„๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊นŠ์€ ์ปจ๋ณผ๋ฃจ์…˜ ์•„ํ‚คํ…์ฒ˜์—์„œ์˜ ๊ตฌ์กฐ์  ๊ฐœ์„ ์ด ๋†’์€ ํ’ˆ์งˆ์˜ ํ•™์Šต๋œ ์‹œ๊ฐ์  ํŠน์ง•์— ์ ์  ๋” ์˜์กดํ•˜๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๋‹ค๋ฅธ ์ปดํ“จํ„ฐ ๋น„์ „ ๊ณผ์ œ์—์„œ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„  ์‹œํ‚ค๋Š” ๋ฐ์— ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, AlexNet์ด ์ˆ˜์ž‘์—…์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ solution์— ์ค€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋‚ด์ง€ ๋ชปํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ์—์„œ(e.g. proposal generation in detection[4]),๋„คํŠธ์›Œํฌ ํ’ˆ์งˆ์˜ ๊ฐœ์„ ์ด ์ƒˆ๋กœ์šด ์ปจ๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ์˜ ์‘์šฉ ๋ถ„์•ผ๋ฅผ ํƒ„์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

โ€ƒVGGNet [18] ์ด ๊ตฌ์กฐ์  ๋‹จ์ˆœ์„ฑ์— ๊ฐ•์ ์„ ๊ฐ€์ง€์ง€๋งŒ, ์ด๋Š” ๋†’์€ ๋น„์šฉ์„ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค: ๋„คํŠธ์›Œํฌ๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ์— ๋งŽ์€ ๊ณ„์‚ฐ์ด ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด์—, GoogLeNet [20] ์˜ inception ์•„ํ‚คํ…์ฒ˜๋Š” ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ํ•œ๋„์— ๋Œ€ํ•œ ์—„๊ฒฉํ•œ ์ œํ•œ ๋‚ด์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, GoogLeNet์€ 500๋งŒ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ ์ด๋Š” 600๋งŒ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋˜ AlexNet์— ๋น„ํ•ด 12๋ฐฐ ๊ฐ์†Œํ•œ ์ˆ˜์น˜์ž…๋‹ˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, VGGNet์€ AlexNet ๋ณด๋‹ค ์•ฝ 3๋ฐฐ ๋” ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

โ€ƒInception์˜ ๊ณ„์‚ฐ ๋น„์šฉ์€ VGGNet์ด๋‚˜ ๋” ๋†’์€ ์„ฑ๋Šฅ์˜ ํ›„์† ๋ชจ๋ธ๋“ค ๋ณด๋‹ค ํ›จ์”ฌ ๋‚ฎ์Šต๋‹ˆ๋‹ค [6]. ์ด๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ๋ฆฌ์ ์ธ ๋น„์šฉ์œผ๋กœ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š” ๋น…๋ฐ์ดํ„ฐ ๋ถ„์•ผ๋‚˜ ๋ชจ๋ฐ”์ผ ๋น„์ „๊ณผ ๊ฐ™์ด ๋ฉ”๋ชจ๋ฆฌ ๋˜๋Š” ๊ณ„์‚ฐ ์šฉ๋Ÿ‰์ด ๋ณธ์งˆ์ ์œผ๋กœ ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ Inception ๋„คํŠธ์›Œํฌ๋ฅผ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋„๋ก ๋งŒ๋“ค์–ด์™”์Šต๋‹ˆ๋‹ค.

2. General Design Principles

3. Factorizing Convolutions with Large Filter Size

3.1. Factorization into smaller convolutions

3.2. Spatial Factorization into Asymmetric Convolutions

4. Utility of Auxiliary Classifiers

5. Efficient Grid Size Reduction

6. Inception-v2

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,ยด R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. War- ยด den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [2] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In Proceedings of The 32nd International Conference on Machine Learning, 2015. [3] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In Computer Visionโ€“ECCV 2014, pages 184โ€“199. Springer, 2014. [4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2155โ€“2162. IEEE, 2014. [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [6] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015. [7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448โ€“456, 2015. [8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725โ€“1732. IEEE, 2014. [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097โ€“1105, 2012. [10] A. Lavin. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015. [11] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv preprint arXiv:1409.5185, 2014. [12] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431โ€“3440, 2015. [13] Y. Movshovitz-Attias, Q. Yu, M. C. Stumpe, V. Shet, S. Arnoud, and L. Yatziv. Ontological supervision for fine grained classification of street view storefronts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1693โ€“1702, 2015. [14] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012. [15] D. C. Psichogios and L. H. Ungar. Svd-net: an algorithm that automatically selects network structure. IEEE transactions on neural networks/a publication of the IEEE Neural Networks Council, 5(3):513โ€“515, 1993. [16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. 2014. [17] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. arXiv preprint arXiv:1503.03832, 2015. [18] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [19] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1139โ€“1147. JMLR Workshop and Conference Proceedings, May 2013. [20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1โ€“9, 2015. [21] T. Tieleman and G. Hinton. Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012. Accessed: 2015- 11-05. [22] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1653โ€“1660. IEEE, 2014. [23] N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pages 809โ€“817, 2013.