progress:: 3.5/8
fill:๐ŸŸฉ
transition:๐ŸŸจ
empty:โ—ป๏ธ
prefix:[
suffix:]
length:10

Abstract

โ€ƒ๋‹จ์–ด์˜ ๋ณต์žกํ•œ ํŠน์„ฑ(e.g, ๋ฌธ๋ฒ• ์˜๋ฏธ)์™€ ์ด๋“ค์ด ์–ธ์–ด์  ๋งฅ๋ฝ์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์ง€(i.e. ๋‹ค์˜์–ด)๋ฅผ ๋ชจ๋‘ ๋ชจ๋ธ๋งํ•˜๋Š” ์ƒˆ๋กœ์šด ์œ ํ˜•์˜ ๊นŠ์€ ๋ฌธ๋งฅํ™”๋œ ๋‹จ์–ด ํ‘œํ˜„์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋Š” ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜๋ฅผ ํ•™์Šต์‹œํ‚จ bidirectional Language Model(์ดํ•˜ biLM)์˜ ๋‚ด๋ถ€ ์ƒํƒœ์— ๋Œ€ํ•ด ํ•™์Šต๋œ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ํ‘œํ˜„ ๋ฐฉ์‹์ด ๋‹ค๋ฅธ ๊ธฐ์กด์˜ ๋ชจ๋ธ๋“ค์— ๋„์ž…๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ question answering, textual entailment, sentiment analysis๋ฅผ ํฌํ•จํ•˜๋Š” 6๊ฐ€์ง€ ๊นŒ๋‹ค๋กœ์šด NLP ๊ณผ์ œ์—์„œ ์ตœ์‹  ๊ธฐ์ˆ ๋“ค์„ ํ˜„์ €ํ•˜๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ๋ ˆ์ด์–ด๋ฅผ ๋…ธ์ถœ์‹œํ‚ค๋Š” ๊ฒƒ์ด ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ์ค€๊ฐ๋… ์‹ ํ˜ธ๋ฅผ ํ˜ผํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•˜๋‹ค๋Š” ๋ถ„์„์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

1. Introduction

โ€ƒ์‚ฌ์ „ ํ•™์Šต๋œ ๋‹จ์–ด ํ‘œํ˜„์€ ๋งŽ์€ ์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ชจ๋ธ์—์„œ ํ•ต์‹ฌ์ ์ธ ์š”์†Œ์ž…๋‹ˆ๋‹ค. ์–ด์จŒ๋“ , ๊ณ ํ’ˆ์งˆ์˜ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋“ค์€ ์ด์ƒ์ ์œผ๋กœ ๋‹จ์–ด์˜ ๋ณต์žกํ•œ ํŠน์„ฑ(e.g., ๋ฌธ๋ฒ•๊ณผ ์˜๋ฏธ)๊ณผ ์ด๋“ค์ด ์–ธ์–ด์  ๋งฅ๋ฝ์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋˜๋Š”์ง€(i.e., ๋‹ค์˜์–ด๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ)๋ฅผ ๋ชจ๋‘ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ๋‘ ๋ฌธ์ œ๋ฅผ ์ง์ ‘์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ณ , ๊ธฐ์กด ๋ชจ๋ธ์— ์‰ฝ๊ฒŒ ํ•ฉ์ณ์งˆ ์ˆ˜ ์žˆ๊ณ , ์ œ์‹œ๋œ ๋ชจ๋“  ์–ด๋ ค์šด ์–ธ์–ด ์ดํ•ด ๋ฌธ์ œ๋“ค์—์„œ ์ตœ์‹  ๊ธฐ์ˆ ์˜ ์ƒ๋‹นํ•œ ๊ฐœ์„ ์„ ์ด๋ฃฌ ์ƒˆ๋กœ์šด ํ˜•ํƒœ์˜ deepย  contextualized word representation์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

โ€ƒ์šฐ๋ฆฌ์˜ representation์€ ๊ฐ ํ† ํฐ์— ์ „์ฒด ์ž…๋ ฅ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ํ•จ์ˆ˜ ํ‘œํ˜„์ด ํ• ๋‹น๋œ๋‹ค๋Š” ์ ์—์„œ ๊ธฐ์กด ๋‹จ์–ด ์œ ํ˜• ์ž„๋ฒ ๋”ฉ๊ณผ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜๋ฅผ ๋ชฉ์ ์œผ๋กœ ํ•™์Šต๋œ ์ง์ง€์–ด์ง„ ์–ธ์–ด ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ํ•™์Šต๋œ ์–‘๋ฐฉํ–ฅ LSTM์œผ๋กœ ๋ถ€ํ„ฐ ์–ป์€ ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋กœ, ์šฐ๋ฆฌ๋Š” ๊ทธ๋“ค์„ ELMo(Embeddings from Language Models) representations๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ๋ฌธ๋งฅํ™”๋œ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ์ด์ „์˜ ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ELMo representations๋Š” ์–‘๋ฐฉํ–ฅ LSTM์˜ ๋ชจ๋“  ๋‚ด๋ถ€ ๋ ˆ์ด์–ด๋“ค์˜ ํ•จ์ˆ˜๋ผ๋Š” ์ ์—์„œ ๊นŠ์Šต๋‹ˆ๋‹ค. ๋” ๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ๊ฐ ์ž…๋ ฅ ๋‹จ์–ด ์œ„์— ์Œ“์ธ ๋ฒกํ„ฐ์˜ ์„ ํ˜• ์กฐํ•ฉ์„ ํ•™์Šต์‹œ์ผฐ๊ณ , ์ด๋Š” LSTM์˜ ์ตœ์ƒ์œ„ ๋ ˆ์ด์–ด๋งŒ์„ ์‚ฌ์šฉํ•  ๋•Œ ๋ณด๋‹ค ์„ฑ๋Šฅ์„ ํ˜„์ €ํ•˜๊ฒŒ ๊ฐœ์„ ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

ย โ€ƒ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‚ด๋ถ€ ์ƒํƒœ๋“ค์„ ํ•ฉ์น˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ํ’๋ถ€ํ•œ ๋‹จ์–ด ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ๊ณ ์œ ํ•œ ํ‰๊ฐ€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์šฐ๋ฆฌ๋Š” ์ƒ์œ„ LSTM์˜ ์ƒํƒœ๊ฐ€ ๋ฌธ๋งฅ์˜ ์˜์กด์„ฑ ์ธก๋ฉด์„ ํฌ์ฐฉํ•˜๋Š” ๋ฐ˜๋ฉด ํ•˜์œ„ LSTM์˜ ์ƒํƒœ๋Š” ๋ฌธ๋ฒ•์  ์ธก๋ฉด์„ ํฌ์ฐฉํ•˜๋Š” ๊ฒƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋™์‹œ์— ๋ชจ๋“  ์ด๋Ÿฌํ•œ ์‹ ํ˜ธ๋“ค์„ ๋…ธ์ถœ์‹œํ‚ค๋Š” ๊ฒƒ์€ ํ•™์Šต๋œ ๋ชจ๋ธ์ด ๊ฐ ์ตœ์ข… ์ž‘์—…์— ๊ฐ€์žฅ ์œ ์šฉํ•œ ์ค€๊ฐ๋… ์œ ํ˜•์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋งค์šฐ ์œ ์ตํ•ฉ๋‹ˆ๋‹ค.

ย โ€ƒ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜๋“ค์„ ํ†ตํ•ด ELMo representation์ด ์‹ค์ œ๋กœ ๋งค์šฐ ์ž˜ ์ž‘๋™ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์ด ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋จผ์ € ๊ทธ๋“ค์ด question answering, textual entailment, sentiment analysis๋ฅผ ํฌํ•จํ•˜๋Š” 6๊ฐœ์˜ ๋‹ค์–‘ํ•˜๊ณ  ์–ด๋ ค์šด ์–ธ์–ด ์ดํ•ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์กด์˜ ๋ชจ๋ธ์— ์‰ฝ๊ฒŒ ์ถ”๊ฐ€๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ELMo representation์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ ๋งŒ์œผ๋กœ๋„ ๋ชจ๋“  ์ผ€์ด์Šค์—์„œ ์ตœ์‹  ๊ธฐ์ˆ ๋“ค์„ ์ƒ๋‹นํžˆ ๊ฐœ์„ ์‹œ์ผฐ์œผ๋ฉฐ ์ƒ๋Œ€ ์˜ค๋ฅ˜๋ฅผ 20% ์ด์ƒ ๊ฐ์†Œ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์ง์ ‘์ ์ธ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•œ ๊ณผ์ œ์—์„œ, ELMo๋Š” ์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•ด ๊ณ„์‚ฐํ•˜๋Š” ๋ฌธ๋งฅํ™”๋œ representation์ธ CoVe๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ELMo์™€ CoVe์— ๋Œ€ํ•œ ๋ถ„์„์€ deep representation์ด LSTM์˜ ์ตœ์ƒ์œ„ ๋ ˆ์ด์–ด๋กœ ๋ถ€ํ„ฐ ์–ป์€ representation ๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ํ•™์Šต๋œ ๋ชจ๋ธ๊ณผ ์ฝ”๋“œ๋Š” ๊ณต๊ฐœ๋˜์—ˆ์œผ๋ฉฐ, ์šฐ๋ฆฌ๋Š” ELMo๊ฐ€ ๋งŽ์€ ๋‹ค๋ฅธ NLP ๋ฌธ์ œ๋“ค์— ๋Œ€ํ•ด ์œ ์‚ฌํ•œ ์ด์ ์„ ์ œ๊ณตํ•  ๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค.

โ€ƒ๋ผ๋ฒจ๋ง ๋˜์ง€ ์•Š์€ ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ์—์„œ ๋ฌธ๋ฒ•์ , ์˜๋ฏธ์  ์ •๋ณด๋ฅผ ์ถ”์ถœํ•ด๋‚ด๋Š” ๋Šฅ๋ ฅ ๋•๋ถ„์—, ์‚ฌ์ „ ํ•™์Šต๋œ ๋‹จ์–ด ๋ฒกํ„ฐ๋Š” question answering, textual entailment, semantic role labeling ๋“ฑ์„ ํฌํ•จํ•˜๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ตœ์‹  NLP ์•„ํ‚คํ…์ฒ˜์—์„œ ์ผ๋ฐ˜์ ์ธ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์–ด์จŒ๋“ , ๋‹จ์–ด ๋ฒกํ„ฐ ํ•™์Šต์„ ์œ„ํ•œ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ๋“ค์€ ๊ฐ ๋‹จ์–ด๋งˆ๋‹ค ๋ฌธ๋งฅ์— ๋น„์˜์กด์ ์ธ ํ•˜๋‚˜์˜ representation๋งŒ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

โ€ƒ์ด์ „์— ์ œ์‹œ๋œ ๋ฐฉ๋ฒ•๋“ค์€ ํ•˜์œ„ ๋‹จ์–ด ์ •๋ณด๋ฅผ ํ’๋ถ€ํ•˜๊ฒŒ ํ•˜๊ฑฐ๋‚˜ ๊ฐ ๋‹จ์–ด์˜ ์˜๋ฏธ์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•˜์—ฌ ๊ธฐ์กด ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ€ ๊ฐ€์ง€๋Š” ๋ช‡๋ช‡ ๋ฌธ์ œ๋“ค์„ ๊ทน๋ณตํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์€ ๋ฌธ์ž ์ปจ๋ณผ๋ฃจ์…˜์„ ํ†ตํ•ด ๋ถ€๋ถ„ ๋ฌธ์ž์— ๋Œ€ํ•œ ์ด์ ์„ ๊ฐ€์ง€๊ณ , ์‚ฌ์ „ ์ •์˜๋œ ์˜๋ฏธ๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋ช…์‹œ์ ์ธ ํ›ˆ๋ จ ์—†์ด ๋‹ค์˜์–ด ์ •๋ณด๋ฅผ ์›ํ™œํ•˜๊ฒŒ ๋‹ค์šด์ŠคํŠธ๋ฆผ์œผ๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

โ€ƒ๋˜ํ•œ ๋‹ค๋ฅธ ์ตœ๊ทผ์˜ ์—ฐ๊ตฌ๋Š” ๊ตฌ๋ฌธ์— ์˜์กด์ ์ธ representation์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. context2vec์€ ๋Œ€์ƒ ์ฃผ์œ„์˜ ๋ฌธ๋งฅ์„ ์ธ์ฝ”๋”ฉํ•˜๊ธฐ ์œ„ํ•ด ์–‘๋ฐฉํ–ฅ LSTM์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. contextual embedding์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ๋ฒ•๋“ค์€ representation์— pivot word ์ž์ฒด๋ฅผ ํฌํ•จํ•˜๊ณ  ์ง€๋„ ํ•™์Šต ์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„ ๋ฒˆ์—ญ(CoVe) ๋˜๋Š” ๋น„์ง€๋„ ํ•™์Šต ์–ธ์–ด ๋ชจ๋ธ์˜ ์ธ์ฝ”๋”๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๋ฐฉ์‹์ด ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์˜ ํฌ๊ธฐ์— ์ œํ•œ์„ ๋ฐ›์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋ฐฉ์‹๋“ค์€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ด์ ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ’๋ถ€ํ•œ ๋‹จ์ผ ์–ธ์–ด ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•จ์œผ๋กœ์จ ์–ป๋Š” ์ตœ๋Œ€ํ•œ์˜ ์ด์ ์„ ์–ป๊ณ , ์•ฝ 3์ฒœ๋งŒ ๊ฐœ์˜ ๋ฌธ์žฅ์ด ํฌํ•จ๋œ ๋ง๋ญ‰์น˜๋กœ biLM์„ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” deep contextual representation์— ๋Œ€ํ•œ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ•๋“ค์„ ๋„“์€ ๋ฒ”์œ„์˜ ๋‹ค์–‘ํ•œ NLP ๊ณผ์ œ์— ๋Œ€ํ•ด ์ž˜ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ ์ผ๋ฐ˜ํ™” ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

โ€ƒ์ด์ „์˜ ์—ฐ๊ตฌ๋“ค์€ ์–‘๋ฐฉํ–ฅ RNN์˜ ๋ ˆ์ด์–ด๋“ค์ด ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๊ฒƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, deep LSTM์˜ ํ•˜์œ„ ๋ ˆ์ด์–ด์— multi-task ๊ตฌ๋ฌธ ๊ฐ๋…(e.g., ํ’ˆ์‚ฌ ํƒœ๊น…)์„ ๋„์ž…ํ•˜๋Š” ๊ฒƒ์€ ์ข…์†์„ฑ ๊ตฌ๋ฌธ ๋ถ„์„์ด๋‚˜ CCG super tagging๊ณผ ๊ฐ™์€ ๋†’์€ ๋ ˆ๋ฒจ์˜ task์—์„œ ์ „์ฒด์ ์ธ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. RNN ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์‹œ์Šคํ…œ์—์„œ, Belinkov et al. (2017)์€ 2-layer LSTM ์ธ์ฝ”๋”์˜ ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ ํ•™์Šต๋œ representation์ด ๋‘ ๋ฒˆ์งธ ๋ ˆ์ด์–ด์— ๋น„ํ•ด ํ’ˆ์‚ฌ ํƒœ๊ทธ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ์— ๋” ๋‚ซ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋‹จ์–ด ๋ฌธ๋งฅ ์ธ์ฝ”๋”ฉ์„ ์œ„ํ•œ LSTM์˜ ์ตœ์ƒ์œ„ ๋ ˆ์ด์–ด๋Š” ๋‹จ์–ด์˜ ์˜๋ฏธ์— ๋Œ€ํ•œ representation์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ฆ๋ช…๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ELMo representations์˜ ์ˆ˜์ •๋œ ์–ธ์–ด ๋ชจ๋ธ ๋ชฉ์ ์— ์˜ํ•ด ์œ ์‚ฌํ•œ ์‹ ํ˜ธ๋“ค์ด ์œ ๋„๋จ์„ ๋ณด์˜€๊ณ , ์ด๋Š” ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ์ค€๊ฐ๋…์„ ํ˜ผํ•ฉํ•˜๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์„ ํ•œ ๋ชจ๋ธ๋“ค์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์— ๋งค์šฐ ์œ ์ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โ€ƒDai and Le (2015)์™€ Ramachandran et al. (2017)์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ์Œ์„ ์–ธ์–ด ๋ชจ๋ธ๊ณผ sequence autoencoder๋ฅผ ์‚ฌ์šฉํ•ด ์‚ฌ์ „์— ํ•™์Šต์‹œ์ผฐ๊ณ  task๋ณ„ ์ง€๋„๋ฅผ ํ†ตํ•ด fine tuningํ•˜์˜€๋‹ค. ์ด์™€ ๋ฐ˜๋Œ€๋กœ, ์šฐ๋ฆฌ๋Š” biLM์„ ๋ผ๋ฒจ๋ง๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์‹œํ‚จ ๋’ค ๊ฐ€์ค‘์น˜๋ฅผ ์ˆ˜์ •ํ•˜๊ณ  task์— ๋”ฐ๋ผ ๋ชจ๋ธ ์šฉ๋Ÿ‰์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๊ฐ€ ๋” ์ž‘์€ ์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์„ ์ง€์‹œํ•˜๋Š” ๊ฒฝ์šฐ ํฌ๊ณ  ํ’๋ถ€ํ•˜๊ณ  ๋ณดํŽธ์ ์ธ biLM representation์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

3. ELMo: Embeddings from Language Models

โ€ƒ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ELMo word representation์€ ์ด ์„น์…˜์— ์„ค๋ช…๋œ ๋Œ€๋กœ ์ „์ฒด ์ž…๋ ฅ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค . ๊ทธ๋“ค์€ ๋ฌธ์ž ์ปจ๋ณผ๋ฃจ์…˜์„ ํ†ตํ•ด (Sec. 3.1), ๋‚ด๋ถ€ ๋„คํŠธ์›Œํฌ ์ƒํƒœ์— ๋Œ€ํ•œ ์„ ํ˜• ํ•จ์ˆ˜๋กœ์จ 2-layer biLM์˜ ์ตœ์ƒ์œ„ layer์—์„œ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค (Sec. 3.2). ์ด ์„ค์ •์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” biLM์ด ํฐ ๊ทœ๋ชจ์—์„œ ์‚ฌ์ „์— ํ•™์Šต๋˜์–ด ์žˆ์„ ๋•Œ (Sec. 3.4) ์ค€์ง€๋„ ํ•™์Šต์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ , ๋„“์€ ๋ฒ”์œ„์˜ ๊ธฐ์กด์˜ neural NLP ์•„ํ‚คํ…์ฒ˜์— ํ†ตํ•ฉ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (Sec. 3.3).

3.1 Bidirectional langualge models

โ€ƒ N๊ฐœ์˜ token ์ด ์ฃผ์–ด์ง€๋ฉด, ์ •๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ์€ ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์˜ ํ™•๋ฅ ์„ ๋ชจ๋ธ๋งํ•˜์—ฌ ์‹œํ€€์Šค์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

์ตœ๊ทผ์˜ ์ตœ์‹  ์‹ ๊ฒฝ๋ง ์–ธ์–ด ๋ชจ๋ธ์€ ํ† ํฐ ์ž„๋ฒ ๋”ฉ ๋˜๋Š” ๋ฌธ์ž CNN์— ๋Œ€ํ•œ ํ†ตํ•ด ๋ฌธ๋งฅ์— ๋น„์˜์กด์ ์ธ token representation ์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ •๋ฐฉํ–ฅ LSTM์˜ ๊ฐœ์˜ layer๋ฅผ ํ†ตํ•ด ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์œ„์น˜ ์—์„œ, ๊ฐ๊ฐ์˜ LSTM layer๋Š” ์—์„œ ๋ฌธ๋งฅ์— ์˜์กด์ ์ธ representation ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. LSTM์˜ ์ตœ์ƒ์œ„ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์€ Softmax layer๋ฅผ ํ†ตํ•ด ๋‹ค์Œ token ์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

โ€ƒ์—ญ๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ์€ ์ดํ›„์˜ ๋ฌธ๋งฅ์ด ์ฃผ์–ด์งˆ ๋•Œ ์ด์ „์˜ token์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ๋ฌธ์žฅ์„ ๋ฐ›์•„๋“ค์ด๋Š” ๊ฒƒ์„ ์ œ์™ธํ•˜๋ฉด ์ •๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์ด ์ฃผ์–ด์งˆ ๋•Œ ์˜ representation ๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” layer backward LSTM์˜ ๋ ˆ์ด์–ด ์—์„œ ์ •๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๊ตฌํ˜„๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โ€ƒbiLM์€ ์ •๋ฐฉํ–ฅ๊ณผ ์—ญ๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๊ณต์‹์€ ์ •๋ฐฉํ–ฅ๊ณผ ์—ญ๋ฐฉํ–ฅ์˜ log likelihood๋ฅผ ๋ชจ๋‘ ์ตœ๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ๊ฐ ๋ฐฉํ–ฅ์˜ LSTM์ด ๊ฐ€์ง€๋Š” parameter๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ token representation()๊ณผ Softmax layer()์— ํ•„์š”ํ•œ parameter๋ฅผ ์ •๋ฐฉํ–ฅ๊ณผ ์—ญ๋ฐฉํ–ฅ์— ์—ฐ๊ฒฐํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์™„์ „ํžˆ ๋…๋ฆฝ๋œ parameter๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹ ์— ์ผ๋ถ€ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณต์œ ํ•œ ๋‹ค๋Š” ์ ์„ ์ œ์™ธํ•˜๋ฉด ์ „์ฒด์ ์œผ๋กœ ์ด ๊ณต์‹์€ Peters et al. (2017)์˜ ์ ‘๊ทผ ๋ฐฉ๋ฒ•๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์„น์…˜์—์„œ, biLM์˜ ์„ ํ˜• ์กฐํ•ฉ์ธ word representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ๋„์ž…ํ•˜๋ฉฐ ์ด์ „์˜ ์—ฐ๊ตฌ๋“ค๋กœ ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

3.2 ELMo

โ€ƒELMo๋Š” task์— ํŠน์ด์ ์ธ biLM์˜ intermediate layer representation์˜ ์กฐํ•ฉ์ž…๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ token ์— ๋Œ€ํ•˜์—ฌ, -layer biLM์€ ๊ฐœ์˜ representation์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

์ด๋•Œ ์€ token layer์ด๊ณ  ๊ฐ ์–‘๋ฐฉํ–ฅ LSTM์—์„œ ์ž…๋‹ˆ๋‹ค.

โ€ƒ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ชจ๋ธ์— ๋„์ž…ํ•˜๊ธฐ ์œ„ํ•ด, ELMo๋Š” ์˜ ๋ชจ๋“  ๋ ˆ์ด์–ด๋ฅผ ๋‹จ์ผ ๋ฒกํ„ฐ๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค . ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ๊ฒฝ์šฐ, ELMo๋Š” TagLM (Peters et al., 2017) ๊ทธ๋ฆฌ๊ณ  CoVe (Mc-Cann et al., 2017) ์™€ ๊ฐ™์ด ์ตœ์ƒ์œ„ ๋ ˆ์ด์–ด๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค . ๋ณด๋‹ค ์ผ๋ฐ˜์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” task์— ํŠน์ด์ ์ธ ๋ชจ๋“  biLM ๋ ˆ์ด์–ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ„์‚ฐํ•˜์˜€์Šต๋‹ˆ๋‹ค.

(1)์—์„œ, ๋Š” Softmax์— ์˜ํ•ด ์ •๊ทœํ™”๋œ ๊ฐ€์ค‘์น˜์ด๊ณ  scalar parameter ๋Š” task model์ด ELMo vector ์ „์ฒด๋ฅผ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ๋Š” ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ์‹ค์งˆ์ ์œผ๋กœ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค (์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ณด์ถฉ ์ž๋ฃŒ ์ฐธ์กฐ). ๊ฐ๊ฐ์˜ biLM ๋ ˆ์ด์–ด์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ๊ณ ๋ คํ•  ๋•Œ, ์ผ๋ถ€ ๊ฒฝ์šฐ์—์„œ ์ด๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๊ธฐ ์ „์— ๊ฐ biLM ๋ ˆ์ด์–ด์— layer normalization์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐ์— ๋„์›€์„ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค (Ba et al., 2016).

3.3 Using biLMs for supervised NLP tasks

โ€ƒ์‚ฌ์ „ ํ•™์Šต๋œ biLM๊ณผ target NLP task๋ฅผ ์œ„ํ•œ ์ง€๋„ ํ•™์Šต ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, task model์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด biLM์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ biLM์„ ์ž‘๋™์‹œํ‚ค๊ณ , ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ชจ๋“  ๋ ˆ์ด์–ด์˜ representation์„ ๊ธฐ๋กํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚˜์„œ, ์šฐ๋ฆฌ๋Š” ํ›„์ˆ  ๋˜์–ด ์žˆ๋“ฏ์ด ๋งˆ์ง€๋ง‰ task model์ด ์ด๋Ÿฌํ•œ representation์˜ ์„ ํ˜• ์กฐํ•ฉ์„ ํ•™์Šตํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

โ€ƒ์šฐ์„  biLM ์—†์ด ์ง€๋„ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ์ตœํ•˜์œ„ ๋ ˆ์ด์–ด๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ์ง€๋„ ํ•™์Šต๋œ NLP ๋ชจ๋ธ๋“ค์€ ์ตœํ•˜์œ„ ๋ ˆ์ด์–ด์—์„œ ๊ณตํ†ต๋œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ณต์œ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ผ๊ด€์ ์ด๊ณ  ํ†ต์ผ๋œ ๋ฐฉ์‹์œผ๋กœ ELMo๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. token์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ์„œ์—ด ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๊ฐ token์˜ ์œ„์น˜์— ๋Œ€ํ•ด์„œ ์‚ฌ์ „ ํ•™์Šต๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ๊ณผ ์„ ํƒ์ ์œผ๋กœ ๋ฌธ์ž ๊ธฐ๋ฐ˜์˜ representation์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ๋งฅ ์˜์กด์ ์ธ token representation ๋ฅผ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚˜์„œ model์ด ๋ฌธ๋งฅ์— ๋ฏผ๊ฐํ•œ represeentation ๋ฅผ ํ˜•์„ฑํ•˜๋ฉฐ, ์ „ํ˜•์ ์œผ๋กœ bidirectional RNN, CNN ๋˜๋Š” feed forward network๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. โ€ƒELMo๋ฅผ ์ง€๋„ ํ•™์Šต๋œ ๋ชจ๋ธ์— ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„œ, ์šฐ๋ฆฌ๋Š” ์šฐ์„  biLM์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ์ •ํ•˜๊ณ  ELMo vector ์™€ ๋ฅผ concatenateํ•˜์—ฌ ELMo enhanced representation ๋ฅผ task RNN์œผ๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ task(e.g., SNLI, SQuAD)์˜ ๊ฒฝ์šฐ ์ƒˆ๋กœ์šด ์ถœ๋ ฅ ๊ฐ€์ค‘์น˜๋ฅผ ๋„์ž…ํ•˜๊ณ  ๋ฅผ ๋กœ ๋Œ€์ฒดํ•˜์—ฌ task RNN์˜ ์ถœ๋ ฅ์— ELMo๋ฅผ ๋„์ž…ํ•จ์œผ๋กœ์จ ์ถ”๊ฐ€์ ์ธ ๊ฐœ์„ ์„ ๊ด€์ธกํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ง€๋„ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์€ ๋ณ€๊ฒฝ๋˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ด๋Ÿฌํ•œ ๋„์ž…์€ ๋ณต์žกํ•œ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์˜ ๋งฅ๋ฝ์—์„œ ์ผ์–ด๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, bi-attention layer๊ฐ€ biLSTM์„ ๋’ค๋”ฐ๋ฅผ ๋•Œ์ธ Sec. 4์˜ SNLI ์‹คํ—˜ ๋˜๋Š” biLSTM์˜ ์ตœ์ƒ์œ„ ๋ ˆ์ด์–ด์— clustering model์ด ์ถ”๊ฐ€ ๋˜์—ˆ์„ ๋•Œ ์ƒํ˜ธ ์ฐธ์กฐ ํ•ด๊ฒฐ ์‹คํ—˜๋“ค์„ ๋ณด์ž.

โ€ƒ์ตœ์ข…์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ELMo์— ์ ๋‹นํ•œ ์–‘์˜ ๋“œ๋กญ์•„์›ƒ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ๊ณผ (Srivastava et al., 2014) ๋ช‡๋ช‡ ์ƒํ™ฉ์—์„œ ELMo์˜ ๊ฐ€์ค‘์น˜์— ์„ ๋”ํ•˜์—ฌ ๊ทœ์ œํ•˜๋Š” ๊ฒƒ์ด ์œ ์ตํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„๋‚ด์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋“  biLM ๋ ˆ์ด์–ด์˜ ํ‰๊ท ์— ๊ฐ€๊น๊ฒŒ ์œ„์น˜ํ•˜๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ELMo์˜ ๊ฐ€์ค‘์น˜์— bias์˜ ์œ ๋„๋ฅผ ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค.

3.4 Pre-trained bidirectional language model architecture

โ€ƒ์ด ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ ์‚ฌ์ „ ํ•™์Šต๋œ biLM์€ Jozefowicz et al. (2016) ๋ฐ Kim et al. (2015)๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, ์–‘๋ฐฉํ–ฅ์˜ ํ•ฉ๋™ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์ˆ˜์ •๋˜์—ˆ๊ณ  LSTM ๋ ˆ์ด์–ด ์‚ฌ์ด์— residual connection์„ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด ์—ฐ๊ตฌ์—์„œ Peters et al. (2017)์ด ์ •๋ฐฉํ–ฅ LM๊ณผ ๋Œ€๊ทœ๋ชจ ํ•™์Šต์—์„œ biLM์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•˜์˜€๋“ฏ์ด, ๋Œ€๊ทœ๋ชจ biLM์— ์ง‘์ค‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

โ€ƒ๋ฌธ์ž ๊ธฐ๋ฐ˜ ์ž…๋ ฅ representation์„ ์œ ์ง€ํ•  ๋•Œ, ์ „์ฒด์ ์ธ ๋ชจ๋ธ์˜ ๋ณต์žก์„ฑ๊ณผ ๋ชจ๋ธ์˜ ํฌ๊ธฐ, ๋‹ค์šด์ŠคํŠธ๋ฆผ task๋ฅผ ์œ„ํ•ด ์š”๊ตฌ๋˜๋Š” ๊ณ„์‚ฐ์˜ ๊ท ํ˜•์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” Jozefowicz et al. (2016)์˜ CNN-BIG-LSTM์—์„œ ๋ชจ๋“  ์ž„๋ฒ ๋”ฉ ๋ฐ ํžˆ๋“  ๋ ˆ์ด์–ด์˜ ์ฐจ์›์„ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์˜€์Šต๋‹ˆ๋‹ค. ์ตœ์ข… ๋ชจ๋ธ์€ 4096์˜ unit๊ณผ 512๊ฐœ์˜ ์ฐจ์› ํˆฌ์˜ ๊ทธ๋ฆฌ๊ณ  ์ฒซ ๋ฒˆ์งธ์™€ ๋‘ ๋ฒˆ์งธ ๋ ˆ์ด์–ด ์‚ฌ์ด์— residual connection์„ ๊ฐ€์ง„ biLSTM์ž…๋‹ˆ๋‹ค. context insensitive type representation์€ 2048 ๋ฌธ์ž n-๊ทธ๋žจ ์ปจ๋ณผ๋ฃจ์…˜ ํ•„ํ„ฐ์™€ ๋‘ ๊ฐœ์˜ highway layer (Srivastava et al., 2015) ๊ทธ๋ฆฌ๊ณ  512 representation์œผ๋กœ์˜ ์„ ํ˜• ์‚ฌ์˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ, biLM

References

[1] Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR abs/1607.06450. [2] Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James R. Glass. 2017. What do neural machine translation models learn about morphology? In ACL. [3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5:135โ€“146. [4] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. [5] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH. [6] Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In ACL. [7] Jason Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. In TACL. [8] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In SSST@EMNLP. [9] Christopher Clark and Matthew Gardner. 2017. Simple and effective multi-paragraph reading comprehension. CoRR abs/1710.10723. [10] Kevin Clark and Christopher D. Manning. 2016. Deep reinforcement learning for mention-ranking coreference models. In EMNLP. [11] Ronan Collobert, Jason Weston, Leon Bottou, Michael ยด Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. In JMLR. [12] Andrew M. Dai and Quoc V. Le. 2015. Semisupervised sequence learning. In NIPS. [13]Greg Durrett and Dan Klein. 2013. Easy victories and uphill battles in coreference resolution. In EMNLP. [14] Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In NIPS. [15] Yichen Gong, Heng Luo, and Jian Zhang. 2018. Natural language inference over interaction space. In ICLR. [16] Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint many-task model: Growing a neural network for multiple nlp tasks. In EMNLP 2017. [17] Luheng He, Kenton Lee, Mike Lewis, and Luke S. Zettlemoyer. 2017. Deep semantic role labeling: What works and whatโ€™s next. In ACL. [18] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long ยจ short-term memory. Neural Computation 9. [19] Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Embeddings for word sense disambiguation: An evaluation study. In ACL. [20] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam ยด Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. CoRR abs/1602.02410. [21] Rafal Jozefowicz, Wojciech Zaremba, and Ilya ยด Sutskever. 2015. An empirical exploration of recurrent network architectures. In ICML. [22] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2015. Character-aware neural language models. In AAAI 2016. [23] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR. [24] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, Ishaan Gulrajani James Bradbury, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In ICML. [25] John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML. [26] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In NAACL-HLT. [27] Kenton Lee, Luheng He, Mike Lewis, and Luke S. Zettlemoyer. 2017. End-to-end neural coreference resolution. In EMNLP. [28] Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luยดฤฑs Marujo, and Tiago Luยดฤฑs. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In EMNLP. [29] Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2017. Stochastic answer networks for machine reading comprehension. arXiv preprint arXiv:1712.03556 . [30] Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. In ACL. [31] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19:313โ€“330. [32] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS 2017. [33] Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context embedding with bidirectional lstm. In CoNLL. [34] Gabor Melis, Chris Dyer, and Phil Blunsom. 2017. On ยด the state of the art of evaluation in neural language models. CoRR abs/1707.05589. [35] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing lstm language models. CoRR abs/1708.02182. [36] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. [37] George A. Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G. Thomas. 1994. Using a semantic concordance for sense identification. In HLT. [38 ]Tsendsuren Munkhdalai and Hong Yu. 2017. Neural tree indexers for text understanding. In EACL. [39] Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient nonparametric estimation of multiple embeddings per word in vector space. In EMNLP. [40] Martha Palmer, Paul Kingsbury, and Daniel Gildea. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics 31:71โ€“ 106. [41] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. [42] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL. [43] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bjorkelund, Olga Uryupina, ยจ Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. In CoNLL. [44] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In EMNLPCoNLL Shared Task. [45] Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017a. Neural sequence learning models for word sense disambiguation. In EMNLP. [46] Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. 2017b. Word sense disambiguation: A unified evaluation framework and empirical comparison. In EACL. [47] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP. [48] Prajit Ramachandran, Peter Liu, and Quoc Le. 2017. Improving sequence to sequence learning with unlabeled data. In EMNLP. [49] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL. [50] Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR. [51] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP. [52] Anders Sรธgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In ACL 2016. [53] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15:1929โ€“1958. [54] Rupesh Kumar Srivastava, Klaus Greff, and Jurgen ยจ Schmidhuber. 2015. Training very deep networks. In NIPS. [55] Joseph P. Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In ACL. [56] Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In ACL. [57] John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Charagram: Embedding words and sentences via character n-grams. In EMNLP. [58] Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning global features for coreference resolution. In HLT-NAACL. [59] Matthew D. Zeiler. 2012. Adadelta: An adaptive learning rate method. CoRR abs/1212.5701. [60] Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural networks. In ACL. [61] Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. 2016. Text classification improved by integrating bidirectional lstm with twodimensional max pooling. In COLING