Doc2Vec (PV-DM)

문장들은 모여서 문단을 이루고 하나의 문서가 된다.

이러한 문단(paragraph) 혹은 문서(document)에 대한 representation learning 방법은 굉장히 다양하고 Doc2Vec은 그중 하나 이다.

Doc2Vec은 Word2Vec의 확장 개념으로 문서를 임베딩하는 모델이며, 대표적으로는 PV-DM라고 하는 모델이 있다.

이 모델은 아래의 아키텍쳐에서 볼 수 있듯이, 다음 단어를 예측하며 로그 확률 평균을 최대화하는 과정에서 paragraph id에 대한 학습(paragraph representation)과정을 추가하여 paragraph embedding을 한다고 이해할 수 있다.

위의 그림에서 paragraph_id를 나타내는 embedding vector는 해당 paragraph(document, doc)가 학습될 때 지속적으로 학습된다. 따라서 위의 모델을, Distributed Memory version of Pharagraph Vector (PV-DM)이라고 부른다.

위 모델은 문서가 학습될 때, 단어 사이의 연관성 뿐만이 아니라 Paragraph vector가 지속적으로 학습이 되는 데, 이를 paragraph-wise 대신 document-wise로 진행하면 document embedding이 되는 것이다.

Training 이후 prediction 단계에서는 weight parameter를 freeze한 뒤 동일한 프로세스를 진행하며 이에 가장 fit한 paragraph vector를 추론한다.

'deeplearning' 카테고리의 다른 글

GPT3(Generative Pretrained Transformer) - Language Models are Few-Shot Learners (0)	2021.06.05
Conversation Model Fine-Tuning for Classifying Client Utterances in Counseling Dialogues (Sungjoon Park et al., 2019) (0)	2021.03.16
Coreference Resolution 관련 논문 정리(5) - Pre-training Mention Representations in Coreference Models(Yuval Varkel et al., 2020) (0)	2020.12.07
Coreference Resolution Metrics (0)	2020.11.30
BERT for Coreference Resolution 모델코딩 #pytorch #한국어 (3)	2020.11.28

jjw

Doc2Vec (PV-DM)

'deeplearning' 카테고리의 다른 글

티스토리툴바

Doc2Vec (PV-DM)

'deeplearning' 카테고리의 다른 글

'deeplearning' Related Articles

티스토리툴바