Conversation Dialog Corpus

Example-based dialogue systems often require natural conversation templates as examples for response generation. However, in previous work most conversation corpora have been created by hand and do not well portray actual conversations between two people. One way to overcome this problem is to record and transcribe real human-to-human conversation. However, this work is tedious and time consuming. In this work, we utilize conversation scripts from television and movies. We extract conversations from television and movie scripts from the web and perform various types of filtering. In order to ensure that the conversation is performed by two speakers, we introduce a unit of conversation called a tri-turn (a trigram conversation turn) which allows us to filter conversations with more than two speakers. In the end, our conversation corpus contains 86,719 query-response pairs that represent conversation turns performed by two speakers talking to each other.

Research Paper

Lasguido Nio, Sakriani Sakti, Graham Neubig, Tomoki Toda, Satoshi Nakamura.
Utilizing Human-to-Human Conversation Examples for a Multi Domain Chat-oriented Dialog System.
IEICE Transactions on Information and Systems. June 2014.
Lasguido Nio, Sakriani Sakti, Graham Neubig, Tomoki Toda, Satoshi Nakamura.
Conversation Dialog Corpora from Television and Movie Scripts.
The Oriental Chapter of International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (OCOCOSDA). September 2014.

How to download

This corpus is only available for academic purpose, non-commercial use only. If you are interested in using the corpus please fill out the user agreement and send a signed and scanned copy to Sakriani Sakti (ssakti at is.naist.jp). Note that we only accept user agreement signed by research laboratory professor or head/leader of research team. Upon receipt of the valid agreement, we will provide a link to download the corpus. The size of the corpus is 13.4 MB.