Nikkaji Parallel Corpus

Masao Utiyama
Jun Kawai
Mon May 7 12:29:16 JST 2018

Download

About

This parallel corpus, Nikkaji Parallel Corpus, is released under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

The contents are:

wc -l dev.* test.* train.*
     5607 dev.en
     5607 dev.id
     5607 dev.ja
     5705 test.en
     5705 test.id
     5705 test.ja
  2818784 train.en
  2818784 train.id
  2818784 train.ja

We have extracted only 1-to-1 corresponding chemical compounds written in Japanese and English.

{train,dev,test}.en have the English chemical compounds
{train,dev,test}.ja have the Japanese chemical compounds
{train,dev,test}.id have the Nikkaji IDs

Why this parallel corpus is interesting

The translations of chemical compounds should be exact. It means that even a slight mistake in a translation results in a different chemical compound.

Under this condition, BLEU scores are not very useful, because a translation/transliteration system should produce exact chemical compounds.

Since the requirement for the accuracy is different from the standard machine translation settings, this parallel corpus may bring interesting case studies to the machine translation community.

In addition, a good machine translation system for chemical compounds is really needed in the industry. The results of the research is important for them.

Where this corpus from?

This parallel corpus has been made from Nikkaji DB, which is a DB for chemical compounds and available from NBDC_NikkajiRDF_main.tar.gz in https://dbarchive.biosciencedbc.jp/jp/nikkaji/download.html. The copyright of NBDC_NikkajiRDF is "NBDC NikkajiRDF (C) 国立研究開発法人 科学技術振興機構 licensed under CC 表示2.1日本"