progress:: 4/9.5
fill:๐ŸŸฉ
transition:๐ŸŸจ
empty:โ—ป๏ธ
prefix:[
suffix:]
length:10

Abstract

โ€ƒ์ž์—ฐ์–ด ์ƒ์„ฑ ๋ชจ๋ธ์€ ์ด์ „์˜ ๋ฌธ๋งฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ์‹์€ ๋ชจ๋ธ์˜ ์˜ˆ์ธก์— ๋Œ€ํ•œ ์„ค๋ช…์œผ๋กœ ์ž…๋ ฅ ๊ธฐ์—ฌ๋„๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ, ์•ž์„  ๋‹จ์–ด๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์ณ ๋ชจ๋ธ์— ์˜ํ–ฅ์„ ๋ผ์น˜๋Š”์ง€ ์•„์ง ๋ถˆ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ์—์„œ, ์šฐ๋ฆฌ๋Š” Transformer์˜ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ตœ๊ทผ์˜ ๋ฐœ์ „์„ ํ™œ์šฉํ•˜๊ณ  ์–ธ์–ด ์ƒ์„ฑ ๋ชจ๋ธ์„ ๋ถ„์„ํ•˜๋Š” ์ ˆ์ฐจ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์กฐ์ ์ธ ์˜ˆ์‹œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์šฐ๋ฆฌ์˜ ์„ค๋ช…์ด ์–ธ์–ด ํ˜„์ƒ์˜ ์ฆ๊ฑฐ^[๋ฌธ๋ฒ•์ ์œผ๋กœ ๋‹ค์Œ ๋‹จ์–ด์— ๋ฌด์—‡์ด ์™€์•ผ ํ•œ๋‹ค๊ณ  ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์ฃผ๋œ ๋‹จ์„œ๋“ค์„ ์–ธ์–ด ํ˜„์ƒ์˜ ์ฆ๊ฑฐ๋ผ๊ณ  ํ‘œํ˜„ํ•˜๋Š” ๋“ฏํ•˜๋‹ค. ์˜ˆ์‹œ๋Š” Table 2 ์ฐธ์กฐ.]์™€ ์–ด๋–ป๊ฒŒ ์ •๋ ฌ๋˜๋Š”์ง€ ๋น„๊ตํ•˜๊ณ , ์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•์ด ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ๋ฐ ์„ญ๋™ ๊ธฐ๋ฐ˜์˜ ๊ธฐ์ค€๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ๋” ์ž˜ ์ •๋ ฌ๋จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ ๋‹ค์Œ, Transformer ๋‚ด๋ถ€์˜ MLPs์˜ ์—ญํ• ์„ ์กฐ์‚ฌํ•˜๊ณ , ์ด๋“ค์ด ๋ฌธ๋ฒ•์ ์œผ๋กœ ํ—ˆ์šฉ๋˜๋Š” ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ํŠน์ง•์„ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•์„ ์‹ ๊ฒฝ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๋ชจ๋ธ์— ์ ์šฉํ•˜์—ฌ, ์ด๋“ค์ด ์˜ˆ์ธก์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ source-target alignment^[์ผ๋ฐ˜์ ์œผ๋กœ alignment๋Š” ๋‹จ์–ด ๋˜๋Š” ๋ฌธ์žฅ ๊ฐ„์˜ ๋Œ€์‘์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋“ฏ ํ•จ. source vector๋ฅผ ์•Œ๋งž์€ target vector์— ๋Œ€์‘ ์‹œํ‚จ๋‹ค๋Š” ์ ์—์„œ ์ •๋ ฌ์ด๋ผ๊ณ  ํ‘œํ˜„ํ•˜๋Š” ๋“ฏ]์„ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.

1 Introduction

โ€ƒ์–ธ์–ด ๋ชจ๋ธ๋“ค, ํŠนํžˆ Transformer ๊ธฐ๋ฐ˜์˜ ์–ธ์–ด ๋ชจ๋ธ๋“ค (Brown et al., 2020; Zhang et al., 2022a)์€ ์ตœ๊ทผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ์— ํ˜๋ช…์„ ์ผ์œผ์ผฐ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์ด ๋ชจ๋ธ๋“ค์ด ์–ด๋–ป๊ฒŒ ์ธ๊ฐ„์˜ ์–ธ์–ด์™€ ์œ ์‚ฌํ•œ ์–ธ์–ด๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ง€์— ๋Œ€ํ•œ ์ดํ•ด์—๋Š” ์—ฌ์ „ํžˆ ๊ฐ„๊ทน์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํŠน์ • ์ƒํ™ฉ์—์„œ ๋ชจ๋ธ์˜ ์‹คํŒจ ์›์ธ์„ ๊ฒฐ์ •ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์ด๋กœ ์ธํ•ด halluination์„ ํฌํ•จํ•˜๊ฑฐ๋‚˜ ์œ ํ•ดํ•œ ๋‚ด์šฉ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โ€ƒNLP ๋ชจ๋ธ ์˜ˆ์ธก์˜ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์•ž์„  ์—ฐ๊ตฌ๋“ค ์ค‘ ๋Œ€๋‹ค์ˆ˜๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ž‘์€ ์ถœ๋ ฅ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” ํ…์ŠคํŠธ ๋ถ„๋ฅ˜๋‚˜ ์ž์—ฐ์–ด ์ถ”๋ก ๊ณผ ๊ฐ™์€ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์„ ์ค‘์‹ฌ์œผ๋กœ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค (Atanasova et al., 2020; Bastings et al., 2022; Zaman and Belinkov, 2022). ์ด ์—ฐ๊ตฌ ๋ถ„์•ผ์—๋Š” attention mechanism ๋ถ„์„์— ์ค‘์ ์„ ๋‘” ๋งŽ์€ ์—ฐ๊ตฌ(Jain and Wallace, 2019; Serrano and Smith, 2019; Pruthi et al., 2020)์™€ ์ž…๋ ฅ ๊ธฐ์—ฌ๋„ ์ ์ˆ˜๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•(Li et al., 2016a; Sundararajan et al., 2017)์„ ์ ์šฉํ•˜๋Š” ์—ฐ๊ตฌ๋„ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. Table 1: โ€ƒ์ตœ๊ทผ ๋“ค์–ด, ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋“ค์€ ์–ธ์–ด ๋ชจ๋ธ๋ง ์ž‘์—…์—์„œ Transformer์˜ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค (Vaswani et al., 2017). Elhage et al. (2021)์€ Figure 1์— ์„ค๋ช…๋œ Transformer๋ฅผ ๋‹ค์–‘ํ•œ ์š”์†Œ(MLPs, attention headsโ€ฆ)๊ฐ€ residual stream์˜ ํ•˜์œ„ ๊ณต๊ฐ„์„ ์ฝ๊ณ  ์“ฐ๋Š” residual stream์˜ ๊ด€์ ^[residual connection์œผ๋กœ ์—ฐ๊ฒฐ๋œ stream, ํ๋ฆ„์— ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•ด ๋‚˜๊ฐ€๋Š” ๊ฒƒ์„ write into the residual stream์ด๋ผ๊ณ  ํ‘œํ˜„ํ•œ ๊ฑฐ ๊ฐ™๋‹ค. ์ฆ‰, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” attention๊ณผ MLP ๋“ฑ์„ residual stream์˜ ์ •๋ณด๋ฅผ ์ฝ๊ณ  ์ˆ˜์ •ํ•ด ๋‚˜๊ฐ€๋Š” ์—ญํ• ์ด๋ผ๊ณ  ํ•ด์„ํ•œ๋‹ค (Figure 1 ์ฐธ์กฐ).]์—์„œ ์—ฐ๊ตฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์€ attention heads๊ฐ€ ๋งฅ๋ฝ์„ ํƒ์ƒ‰ํ•˜์—ฌ ๋™์ผํ•œ ํ† ํฐ์˜ ์ด์ „ ๋ฐ˜๋ณต์„ ์ฐพ๊ณ  ๋‹ค์Œ ํ† ํฐ์„ ๋ณต์‚ฌํ•˜๋Š” induction heads (Olsson et al., 2022)๋‚˜ Indirect Object Identification (IOI) ํ•ด๊ฒฐ์— ํŠนํ™”๋œ heads(Wang et al., 2023) ๊ฐ™์ด ์–ธ์–ด ๋ชจ๋ธ๋“ค์˜ ํŠน์ • ํ–‰๋™์„ ์„ค๋ช…ํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋น„์Šทํ•˜๊ฒŒ, Transformer ๋‚ด๋ถ€์˜ MLPs ๋˜ํ•œ residual stream์— ์“ฐ๋Š” ์š”์†Œ๋กœ ์—ฐ๊ตฌ๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค. Geva et al. (2022)์€ MLP ๋ธ”๋ก์ด value๋ฅผ residual์— ์ถ”๊ฐ€ํ•˜๋Š” key-value meory ์ฒ˜๋Ÿผ ๋™์ž‘ํ•˜์—ฌ ์œ ์‚ฌํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด๊ฐ€ ์˜ˆ์ธก๋˜๋„๋ก ํ•  ์ˆ˜ ์žˆ์Œ์„ ๊ด€์ธกํ•˜์˜€์Šต๋‹ˆ๋‹ค.

โ€ƒ๋” ๋‚˜์•„๊ฐ€, attention heads, output weigh matrix ๊ทธ๋ฆฌ๊ณ  layer normalization์œผ๋กœ ๊ตฌ์„ฑ๋œ transformer์˜ attention mechnism์€ ํ•ด์„ ๊ฐ€๋Šฅํ•œ ์ž‘์—…์œผ๋กœ ๋ถ„ํ•ด ๊ฐ€๋Šฅํ•˜๊ณ  (Kobayashi et al., 2020, 2021), ์‹ ๋ขฐ์„ฑ์ด ๋งค์šฐ ๋†’๋‹ค๊ณ  ์ฆ๋ช…๋œ ๋ ˆ์ด์–ด๋ณ„ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค (Ferrando et al., 2022b,a).

โ€ƒ์ด ์—ฐ๊ตฌ์—์„œ ์šฐ๋ฆฌ๋Š” Transformers language generators์˜ ์˜ˆ์ธก์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด attention ๋ถ„ํ•ด์™€ ํ•จ๊ป˜ residual stream analysis์˜ ๊ด€์ ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๊ฐ ๋ ˆ์ด์–ด์—์„œ ๊ฐ๊ฐ์˜ token representation์— ์˜ํ•ด ๋”ํ•ด์ง€๊ฑฐ๋‚˜ ๋นผ์ง„ logit์˜ ์–‘์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์ณ ์ง‘๊ณ„ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ ์ž…๋ ฅ์œผ๋กœ logit ๊ธฐ์—ฌ๋„๋ฅผ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค (Logit explanation). ์ถ”๊ฐ€์ ์œผ๋กœ, ALTI(Ferrando et al., 2022b)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์—์„œ ์ •๋ณด์˜ ํ˜ผํ•ฉ์„ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค (ALTI-Logit explanation).

โ€ƒ์ œ์•ˆ๋œ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ๋ฐฉ์‹์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์ตœ๊ทผ์— ์†Œ๊ฐœ๋œ constrastive explanation framework(Yin and Neubig, 2022)๋ฅผ ๋”ฐ๋ฅด๋ฉฐ ์ด๋Š” ๋ชจ๋ธ์ด ์ด๋ฏธ ๋ช‡๋ช‡ ์–ธ์–ด์  phenomena evidence์— ์˜ํ•ด ์„ค๋ช…๋œ foil token ๋Œ€์‹  ํŠน์ • token์„ ์˜ˆ์ธกํ•œ ์ด์œ ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚˜์„œ ์šฐ๋ฆฌ๋Š” MLPs์˜ ์—ญํ• ์„ ๋ถ„์„ํ•˜๊ณ  ๊ทธ๋“ค์ด ๋ฌธ๋ฒ•์„ ๋”ฐ๋ฅด๋Š” prediction์„ ์„ ํƒํ•˜๋Š” ๋ฐ์— ๋„์›€์„ ์ค€๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์šฐ๋ฆฌ๋Š” NMT ๋ชจ๋ธ๋“ค์ด ๋ฒˆ์—ญ๋ฌธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์‚ฌ๋žŒ๊ณผ ์œ ์‚ฌํ•œ source-target alignment๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.1

2 Approach

2.1 Residual Stream

Figure 1: residual stream์— ์“ฐ๋Š” ๋ชจ๋“ˆ๋กœ์จ ํ‘œํ˜„๋œ Transformer ์–ธ์–ด ๋ชจ๋ธ

โ€ƒ์–ธ์–ด ์ƒ์„ฑ์ด timestep ๋ฅผ ๋”ฐ๋ผ ์ฃผ์–ด์งˆ ๋•Œ, ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ2 ์€ ๋‹ค์Œ token ์˜ˆ์ธก์˜ logit์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด unembedding matrix ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ token embedding space๋กœ ์‚ฌ์˜๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚˜์„œ, ์–ดํœ˜์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด softmax ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค:

โ€ƒTransformer์˜ residual connection์€ ๊ฐ ๋ธ”๋ก ์ดํ›„์— ์—…๋ฐ์ดํŠธ๋˜๋Š” ์ •๋ณด์˜ stream์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (nostalgebraist, 2020; Elhage et al., 2021; Mickus et al., 2022). ๋ ˆ์ด์–ด ์—์„œ ์œ„์น˜์˜ residual stream์— โ€œ์“ฐ๋Š”โ€ MLP์™€ self-attention ๋ธ”๋ก์„ , ์ด๋ผ๊ณ  ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค (Figure 1). residual stream์˜ ์ตœ์ข… ์ƒํƒœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

ํŠน์ •ํ•œ next token prediction์˜ ์ตœ์ข… logit ๋Š” residual stream์˜ ์ตœ์ข… ์ƒํƒœ์™€ ์˜ ๋ฒˆ์งธ ์—ด3์„ ๊ณฑํ•˜์—ฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค:

์„ ํ˜•์„ฑ์— ์˜ํ•ด:

Figure 2: self attention ๋ธ”๋ก์˜ ์ถœ๋ ฅ์ด ๊ฐ ๋ ˆ์ด์–ด์—์„œ ์˜ logit์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค (์™ผ์ชฝ). logit์˜ ์—…๋ฐ์ดํŠธ๋Š” ๊ฐ input token์— ๋Œ€ํ•ด ๋ถ„ํ•ด๋  ์ˆ˜ ์žˆ๋‹ค (์˜ค๋ฅธ์ชฝ).

2.2 Multi-head Attention as a Sum of Vectors

โ€ƒKobayashi et al. (2021)์˜ Post-LN self attention ๋ธ”๋ก ๋ถ„ํ•ด์— ์˜๊ฐ์„ ๋ฐ›์•„, ์šฐ๋ฆฌ๋Š” ํ˜„์žฌ์˜ LMs์—์„œ ํ”ํžˆ ๋ณผ ์ˆ˜ ์žˆ๋Š” Pre-LN ์„ค์ •์— ์œ ์‚ฌํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค (์ „์ฒด ์œ ๋„ ๊ณผ์ •์€ ๋ถ€๋ก A ์ฐธ์กฐ). ๊ฐ ์ƒ์„ฑ ๋‹จ๊ณ„ ์—์„œ self-attention ๋ธ”๋ก์˜ ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

์ด ๊ฐ ๋ ˆ์ด์–ด์˜ input token representation (๋˜๋Š” residual stream) ์— ์ ์šฉ๋œ Affine transformation์ด๋ผ๊ณ  ํ•˜๋ฉด:

๋Š” value๋ฅผ ์ด๋ฃจ๋Š” ํ–‰๋ ฌ, attention ์ถœ๋ ฅ ํ–‰๋ ฌ (head ๋‹น), ๊ทธ๋ฆฌ๊ณ  ์ด์— ๋Œ€์‘๋˜๋Š” bias๋Š” ์ด๋‹ค. ์ด๋•Œ ๋Š” attention weight ํ–‰๋ ฌ, ๋Š” bias์—์„œ ์œ ๋ž˜ํ•œ remaining terms ๊ทธ๋ฆฌ๊ณ  ๋Š” centering, normalizing ๊ทธ๋ฆฌ๊ณ  layer normalization์˜ scaling ์—ฐ์‚ฐ์„ ํ†ตํ•ฉํ•œ ๊ฒƒ์ด๋‹ค (๋ถ€๋ก A ์ฐธ์กฐ).

2.3 Layer-wise Contributions to the Logits

์‹ (4)์™€ ์‹ (5)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์Œ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค4 :

โ€ƒ๊ฐ self-attention์— ๋Œ€ํ•œ logit์˜ ๋ณ€ํ™”๋Ÿ‰ ์€ ๊ฐ๊ฐ์˜ โ€‹์— ๋Œ€ํ•œ ๊ฐœ๋ณ„ ์—…๋ฐ์ดํŠธ๋กœ ํ™•์žฅ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๊ทธ๋ฆผ 2 ์ฐธ์กฐ). ๊ทธ๋Ÿฌ๋ฏ€๋กœ, output token ์— ๋Œ€ํ•œ ๊ฐ ๋ ˆ์ด์–ด์˜ input token representation์˜ ๊ธฐ์—ฌ๋Š” ์˜ logit์„ ๋ณ€ํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค:

์ด์™€ ๋น„์Šทํ•˜๊ฒŒ, logit์˜ ๋ณ€ํ™”๋Š” ์‹ (6)์˜ affine transformation์— unembedding ํ–‰๋ ฌ์„ ๊ณฑํ•˜์—ฌ head level์—์„œ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2.4 Tracking Logit Updates to the Input Tokens

โ€ƒ๊ฐ๊ฐ์˜ residual stream์ด ๋ ˆ์ด์–ด ์ „๋ฐ˜์— ๊ฑธ์ณ token์˜ identity๋ฅผ ์œ ์ง€ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด, input token ์— ์˜ํ•ด ์ƒ์„ฑ๋œ ์— ๋Œ€ํ•œ ์ „์ฒด logit ๋ณ€ํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค.

์ด๋Š” ์ „์ฒด ๋ ˆ์ด์–ด์—์„œ s๋ฒˆ์งธ ํ† ํฐ์˜ intermediate representation์— ์˜ํ•œ logit ๋ณ€ํ™”์˜ ํ•ฉ ์ž…๋‹ˆ๋‹ค. ์ด์ œ๋ถ€ํ„ฐ, ์šฐ๋ฆฌ๋Š” ์ด๋ฅผ explanation์ด๋ผ๊ณ  ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

โ€ƒํ•˜์ง€๋งŒ, ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์—์„œ ๊ฐ๊ฐ์˜ residual stream์€ ํ˜ผํ•ฉ๋œ input token๋“ค์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ, ์€ ๋ชจ๋ธ์˜ input token s=j์— ์˜ํ•œ ๊ฒƒ์ด๋ผ๊ณ  ์ง์ ‘์ ์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” residual stream์— ๋ฌธ๋งฅ ์ •๋ณด๊ฐ€ ํ˜ผํ•ฉ๋˜๋Š” ๊ฒƒ์„ ์ธก์ •ํ•˜์—ฌ ๋ชจ๋ธ์˜ input์— ๋Œ€ํ•œ logit ๋ณ€ํ™”๋ฅผ ์ถ”์ ํ•˜๋Š” ๊ฒƒ์„ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ALTI (Ferrando et al., 2022b)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ALTI์™€ rollout method๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•(Abnar and Zuidema, 2020; Mohebbi et al., 2023)์€ token representation์ด ์ด์ „ ๋ ˆ์ด์–ด์˜ representation์„ ์„ ํ˜• ๊ฒฐํ•ฉํ•˜์—ฌ ํ˜•์„ฑ๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ด๊ณ  ์ด๋•Œ ์ž…๋‹ˆ๋‹ค. ์€ ์— ๋Œ€ํ•œ ์˜ ๊ธฐ์—ฌ๋„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋ ˆ์ด์–ด๋ณ„ ๊ณ„์ˆ˜ ํ–‰๋ ฌ์„ ๊ณฑํ•จ์œผ๋กœ์จ ์„ ์–ป์„ ์ˆ˜ ์žˆ๊ณ  ์ด๋ฅผ ํ†ตํ•ด ๊ฐ ์ค‘ ๋ ˆ์ด์–ด์˜ representation์„ input token์˜ ์„ ํ˜• ๊ฒฐํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค .

โ€ƒ์˜ ์—ด s๋Š” ๋ ˆ์ด์–ด ์— ์ž…๋ ฅ๋˜๋Š” ๊ฐ token representation์— ์ธ์ฝ”๋”ฉ๋œ s๋ฒˆ์งธ input token์˜ ๊ธฐ์—ฌ๋„ ๋น„์œจ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์„ ํ†ตํ•ด input token (Figure 3, ์˜ค๋ฅธ)์— ์˜ํ•œ next predicition token ์˜ logit ๋ณ€ํ™”๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

โ€ƒ๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ๋ถ€๋ก B์— ์žˆ์Šต๋‹ˆ๋‹ค. prediction token ์— ๋Œ€ํ•œ ๋ฒˆ์งธ input token์˜ ์ตœ์ข…์ ์ธ ๊ธฐ์—ฌ๋Š” ๊ฐ ๋ ˆ์ด์–ด logit ๋ณ€ํ™”์˜ ํ•ฉ์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

์šฐ๋ฆฌ๋Š” ์ด ๋ฐฉ๋ฒ•์„ explanation์ด๋ผ๊ณ  ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ๋ฌธ๋งฅ์ ์ธ ์ •๋ณด๊ฐ€ ํ˜ผํ•ฉ๋˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด, ์ด ๋‹จ์œ„ ํ–‰๋ ฌ์ด ๋˜์–ด explanation์ด ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๊ธฐ์–ตํ•ด๋‘์„ธ์š” (์‹ (9)).

2.5 Constrastive Explanations

โ€ƒconstrastive explanation (Yin and Neubig, 2022)์€ ๋˜๋‹ค๋ฅธ foil token ๋Œ€์‹ ์— ์™œ target token ๋ฅผ ์„ ํƒํ•˜์˜€๋Š”์ง€์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ๊ฒฐ์ •์„ ๊ฐ token์ด ์™€ ๊ฐ„์˜ final logit difference()์— ์–ผ๋งˆ๋‚˜ ๊ธฐ์—ฌํ–ˆ๋Š”์ง€๋ฅผ ํ†ตํ•ด ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ (9)์™€ ์‹(11)์— ๋”ฐ๋ผ, input token์˜ Constrastive Logit๊ณผ Constrastive ALTI-Logit5 saliency score๋ฅผ ๊ทธ๋“ค์˜ logit ์ฐจ์— ๋Œ€ํ•œ ๋ณ€ํ™”๋กœ ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

3 Experimental Setup

โ€ƒ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์˜ ์„ฑ๋Šฅ์„ constrastive explanation์„ ํ†ตํ•ด ํ‰๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. Yin and Neubig (2022)์— ๋”ฐ๋ผ, ๋ฌธ๋ฒ•์ ์œผ๋กœ ๋งž๋Š” ์•ฝ๊ฐ„ ๋ณ€ํ˜•๋œ ๋ฌธ์žฅ๋“ค์ด ์ง์ง€์–ด์ง„ BLiMP dataset (Warstadt et al., 2020)์˜ ์ผ๋ถ€๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. 11๊ฐœ์˜ subset์€ 5๊ฐœ์˜ ์–ธ์–ด์  ํ˜„์ƒ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค: anaphor agreement, arguent structure, determiner-noun agreement, NPI licensing ๊ทธ๋ฆฌ๊ณ  subject-verb agreement.

PhenomenaIDExample(Acceptable/Unacceptable)
Anaphor AgreementagaKarla could listen to herself/himself.
anaEva approached herself/themselves.
Argument StuctureaspGerald is hated by the teachers/pie.
Determiner-Noun AgreementdnaEva has scared these children/child.
dnaiTammy was observing that man/men.
dnaaThe driver sees that unlucky person/people.
dnaaiPhillip liked that smooth horse/horses.
NPI LicensingnpiEven Danielle also/ever leaves.
Subject-Verb AgreementdarnThe grandfathers of Diana drink/drinks.
ipsvMany people have/has hidden away.
rpsvMost associations buy/buys those libraries.
Table 2: ์˜ˆ์‹œ: Table 8์˜ Yin and Neubig (2022)์— ์˜ํ•ด ์‚ฌ์šฉ๋œ BLiMP phenomenons (acceptable/unacceptableํ•œ ๋‹จ์–ด๋ฅผ bold๋กœ ํ‘œ๊ธฐ). ๋ฐ‘์ค„๋กœ ํ‘œ์‹œ๋œ ๋‹จ์–ด๋“ค์€ ์–ธ์–ด์  ํ˜„์ƒ์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•œ ์ฆ๊ฑฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. (๊ทœ์น™์— ๋”ฐ๋ผ ์ถ”์ถœ๋จ)

โ€ƒ๊ฐ ์–ธ์–ด์  ํ˜„์ƒ์— ๋Œ€ํ•ด, ์šฐ๋ฆฌ๋Š” spaCy (Honnibal and Montani, 2017)์„ ์‚ฌ์šฉํ•˜์˜€๊ณ  (previous tokens์—์„œ) ๋ฌธ๋ฒ•์  ์ˆ˜์šฉ์„ฑ์„ ๋’ท๋ฐ›์นจํ•˜๋Š” ์ฆ๊ฑฐ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด Yin and Neubig (2022)์˜ ๊ทœ์น™์„ ๋”ฐ๋ž์Šต๋‹ˆ๋‹ค (Table 2). anaphor agreement๋ฅผ ์œ„ํ•ด, target token๊ณผ ์ƒํ˜ธ ์—ฐ๊ด€๋œ ๋ชจ๋“  context token์„ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. Determiner-noun agreement์˜ ์ฆ๊ฑฐ๋Š” ๋Œ€์ƒ์ด ๋˜๋Š” ๋ช…์‚ฌ์˜ determiner(ํ•œ์ •์ž)๋กœ ๋ถ€ํ„ฐ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. NPI licensing์—์„œ, โ€œevenโ€์ด๋ผ๋Š” ๋‹จ์–ด๋Š” acceptableํ•œ ๋Œ€์ƒ์—์„œ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์ง€๋งŒ, unacceptableํ•œ ๋‹จ์–ด์—์„œ๋Š” ๋‚˜ํƒ€๋‚  ์ˆ˜ ์—†๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, subject-verb agreement ํ˜„์ƒ์—์„œ, ๋™์ƒ์˜ ํ˜•ํƒœ๋Š” ๋Œ€์ƒ์ด ๋˜๋Š” ๋ช…์‚ฌ์™€ ์ˆ˜์ ์œผ๋กœ ์ผ์น˜ํ•ด์•ผ ํ•˜๋ฉฐ, ์ด๋Š” ์ฆ๊ฑฐ๋กœ์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” Yin and Neubig (2022)์™€ ๋‹ฌ๋ฆฌ, ipsv์™€ rpsv subset์— ํฌํ•จ๋œ ๋ฌธ์žฅ์˜ ๋Œ€๋ถ€๋ถ„์ด โ€˜์ •๋Ÿ‰์‚ฌ+์ฃผ์–ด์˜ ์ค‘์‹ฌ์–ด+๋™์‚ฌโ€™๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ณ , ์ •๋Ÿ‰์‚ฌ์™€ ์ฃผ์–ด์˜ ์ค‘์‹ฌ์–ด ๋‘˜ ๋‹ค agreement๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

โ€ƒ์šฐ๋ฆฌ๋Š” ๋ถ„์„์— SVA (subject-verb agreement) (Linzen et al., 2016)๊ณผ Indirect Object Identification (IOI) (Wang et al. 2023, Fahamu, 2022) dataset์„ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. SVA dataset์—๋Š” ์ฃผ์–ด์™€ ๋‹ค๋ฅธ ์ˆ˜์˜ ๋ช…์‚ฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์–ด saliency method๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. Indirect object identification (IOI)๋Š” โ€˜After Lee and Evelyn went to the lakeโ€™์™€ ๊ฐ™์€ ์ดˆ๊ธฐ ์ข…์†์ ˆ์„ ๊ฐ€์ง„ ๋ฌธ์žฅ๋“ค์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ํŠน์ง•์ด๋ฉฐ, ์ด์–ด์ง€๋Š” ์ฃผ์ ˆ์€ โ€˜Lee gave a grape to Evelynโ€™๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ฐ„์ ‘ ๋ชฉ์ ์–ด โ€œEvelynโ€๊ณผ ์ฃผ์–ด โ€œLeeโ€๋Š” ์ดˆ๊ธฐ ์ ˆ์—์„œ ๋ฐœ๊ฒฌ๋ฉ๋‹ˆ๋‹ค. IOI dataset์˜ ๋ชจ๋“  ์˜ˆ์‹œ์—์„œ, ์ฃผ์ ˆ์€ ๋‹ค์‹œ ์ฃผ์–ด๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ„์ ‘ ๋ชฉ์ ์–ด์— ๊ฐ์ฒด๋ฅผ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. IOI task์˜ ๋ชฉํ‘œ๋Š” ๋ฌธ์žฅ์˜ ๋งˆ์ง€๋ง‰ ๋‹จ์–ด๊ฐ€ IO์ธ์ง€ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. IOI์˜ ์˜ˆ์—์„œ, IO๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ทœ์น™์€ IO ์ž์‹ ์ด ์ฒซ ์ ˆ์— ์žˆ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

โ€ƒ์šฐ๋ฆฌ๋Š” HuggingFace library (Wolf et al., 2020)๋ฅผ ํ†ตํ•ด (Yin and Neubig, 2022) ์—์„œ์™€ ๊ฐ™์ด GPT-2 XL (1.5B) model (Radford et al., 2019)์„ ์‚ฌ์šฉํ•˜์˜€๊ณ , GPT-2 Small (124M)์™€ GPT-2 Large models (774M), OPT 125M (Zhang et al., 2022b) ๊ทธ๋ฆฌ๊ณ  BLOOMโ€™s 560M and 1.1B variants (Workshop et al., 2022)๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ autoregressive Transformer language models ๋˜ํ•œ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Alignment Metricsโ€ƒYin and Neubig(2022)์— ๋”ฐ๋ผ, ์šฐ๋ฆฌ๋Š” ๋ฅผ previous token ์ˆ˜ ๋งŒํผ์˜ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” binary vector ์ด๊ณ , evidence์— ํฌํ•จ๋˜๋Š” token์˜ ์œ„์น˜๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ๋ชจ๋‘ 0์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์˜ˆ์ธก์ด ์˜์กดํ•˜๋Š” token์€ ๊ทœ์น™์— ์˜ํ•ด ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค. explanation์€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ธ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค. explanation๊ณผ evidence๊ฐ„์˜ alignment๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” MRR (Mean Reciprocal Analysis)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. token์„ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•˜์—ฌ, MRR์€

References

Footnotes

  1. ์ด ๋…ผ๋ฌธ์„ ๋”ฐ๋ฅด๋Š” ์ฝ”๋“œ๋Š” https://github.com/mt-upc/logit-explanations ์— ์žˆ์Šต๋‹ˆ๋‹ค. โ†ฉ

  2. ์šฐ๋ฆฌ๋Š” ์ด๋ฅผ ์—ด ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์„ ์„ ํ˜ธํ•ฉ๋‹ˆ๋‹ค. โ†ฉ

  3. ์šฐ๋ฆฌ๊ฐ€ ํ–‰๋ ฌ ์˜ j๋ฒˆ์งธ ํ–‰์„ ๋Œ€์‹ ์— ๋ผ๊ณ  ํ‘œ๊ธฐํ•˜๋Š” ๊ฒƒ์„ ์„ ํ˜ธํ•œ๋‹ค๋Š” ์ ์„ ์•Œ์•„๋‘์„ธ์š”. โ†ฉ

  4. bias๋Š” ๊ณต๊ฐ„์„ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด ํ‘œ๊ธฐํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. โ†ฉ

  5. ์ด ๋…ผ๋ฌธ ์ „์ฒด์—์„œ ์šฐ๋ฆฌ๋Š” Logit๊ณผ ALTI-Logit์„ ๋Œ€์กฐ์  ๋ณ€ํ™”๋ฅผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. โ†ฉ