This tutorial is adapted from on a collection of other tutorials in both English and Japanese. It assumes you have Python 3 installed. Links are provided to the original source pages periodically.



Installing Mecab

{.tabset}

On Windows

In Windows you have several options. One is to use apt-get to install Mecab

sudo apt-get install libmecab-dev
sudo apt-get install mecab mecab-ipadic-utf8
pip3 install mecab-python3

https://pypi.python.org/pypi/mecab-python3/0.7

If you get the following error:

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/bg/7wyn3chj2m3bhmxrglv6m0wr0000gn/T/pip-install-_wqpyjn5/mecab-python3/
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.


Make sure you upgrade your pip

pip3 install --upgrade setuptools

You can also use the .exe (executable) file for a user-friendly installation.

Notes:

  • You will get mojibake (e.g. “æ–‡å—化㠒”) after you first intall Mecab. To fix this, you must change the region settings/locale to Japan
  • You can use Mecab using the command prompt. Navigate to the directory with mecab and type “MeCab.Ink” or “MeCab.Iink -h” for help
  • If you’re still getting moji bake, then you may not have the correct dictionaries loaded. You can check this by writing “MeCab.Ink -D”, which will give you info on the dictionary.
    • The charset has to be “SHIFT-JIS”
    • You can change it by going into the windows menu, navigating to Mecab, and recompiling the software with SHIFT-JIS or something else

There is a YouTube tutorial for Mecab, which is pretty helpful, showing you how to use wakati and chasen (parser and part of speech tagger).
- https://www.youtube.com/watch?v=1wqwWji4u0E&feature=g-upl




On Mac


First, download the source file for MeCab from:
http://taku910.github.io/mecab/#download

In terminal, go to the directory where the mecab tar file is. Do the following:

$ tar xvfz mecab-0.996.tar.gz  #to extract the files (untar it)
$ cd mecab-0.996               #go to the directory with the configure files
$ ./configure --enable-utf8-only --prefix=/usr/local/mecab
$ make
$ sudo make install


Note: make sure you move the downloaded source file to somewhere besides Dropbox. If you don’t, you may get an error like the one below

test -z "/usr/local/lib" || ./install-sh -c -d "/usr/local/lib"
 /bin/sh ./libtool   --mode=install /usr/bin/install -c   libcrfpp.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/libcrfpp.0.dylib /usr/local/lib/libcrfpp.0.dylib
libtool: install: (cd /usr/local/lib && { ln -s -f libcrfpp.0.dylib libcrfpp.dylib || { rm -f libcrfpp.dylib && ln -s libcrfpp.0.dylib libcrfpp.dylib; }; })
libtool: install: /usr/bin/install -c .libs/libcrfpp.lai /usr/local/lib/libcrfpp.la
libtool: install: /usr/bin/install -c .libs/libcrfpp.a /usr/local/lib/libcrfpp.a
libtool: install: chmod 644 /usr/local/lib/libcrfpp.a
libtool: install: ranlib /usr/local/lib/libcrfpp.a
/bin/sh: /Users/auroratsai/Dropbox: No such file or directory
make[1]: *** [install-libLTLIBRARIES] Error 127
make: *** [install-am] Error 2


Installing the IPA dictionary (IPA辞書のインストール)


1) Download the IPA dictionary file (IPA辞書のソースをダウンロード)
http://taku910.github.io/mecab/#download

2) In terminal, go to the directory where the downloaded tar file is located.

$ tar xvfz mecab-ipadic-2.7.0-20070801.tar.gz   #untar the file
$ cd mecab-ipadic-2.7.0-20070801
$ ./configure --prefix=/usr/local/mecab --with-mecab-config=/usr/local/mecab/bin/mecab-config --with-charset=utf8
$ make
$ sudo make install

Source: https://qiita.com/taroc/items/b9afd914432da08dafc8 This page provides suggestions for fixing it when you get the error for the IPA dictionary: https://cryptogun.blogspot.com/2017/06/mecab-and-ipadic-installation.html



Install the “easy” way with Homebrew (MacOS) or LinuxBrew (Debian/Ubuntu)


This is the method I used, since I encountered a number of errors with my installation. Molly DesJardin provides the following documentation.

1) Paste into terminal:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"


or get LinuxBrew

#prepare environment for download
$ sudo apt-get update
$ sudo apt-get upgrade -y
$ sudo sudo apt-get install -y build-essential make cmake scons curl git \
                               ruby autoconf automake autoconf-archive \
                               gettext libtool flex bison \
                               libbz2-dev libcurl4-openssl-dev \
                               libexpat-dev libncurses-dev

#clone linuxBrew
$ git clone https://github.com/Homebrew/linuxbrew.git ~/.linuxbrew

# Until LinuxBrew is fixed, the following is required.
# See: https://github.com/Homebrew/linuxbrew/issues/47
$ export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/local/lib64/pkgconfig:/usr/lib64/pkgconfig:/usr/lib/pkgconfig:/usr/lib/x86_64-linux-gnu/pkgconfig:/usr/lib64/pkgconfig:/usr/share/pkgconfig:$PKG_CONFIG_PATH

## Setup linux brew and update environmental variables
$ export LINUXBREWHOME=$HOME/.linuxbrew
$ export PATH=$LINUXBREWHOME/bin:$PATH
$ export MANPATH=$LINUXBREWHOME/man:$MANPATH
$ export PKG_CONFIG_PATH=$LINUXBREWHOME/lib64/pkgconfig:$LINUXBREWHOME/lib/pkgconfig:$PKG_CONFIG_PATH
$ export LD_LIBRARY_PATH=$LINUXBREWHOME/lib64:$LINUXBREWHOME/lib:$LD_LIBRARY_PATH

#test installation
$ which brew
#(path of installation displays)
$ echo $PKG_CONFIG_PATH
#(path of config displays)

2) Then install mecab

$ brew install mecab
$ brew install mecab-ipadic


3) Optional: Change Your MeCab dictionary
You can find all of the NINJAL dictionaries for unidic here:
http://chamame.ninjal.ac.jp/chamame_unidic_download.html

You can’t use -Ochasen option with unidic; it has to be -Owataki. See those posts for sample code.

Open file /usr/local/etc/mecabrc and change:

dicdir =  /usr/local/lib/mecab/dic/ipadic
dicdir =  /usr/local/lib/mecab/dic/unidic

Make sure you move your preferred dictionary to the unidic directory. Just copy and paste everything that you downloaded from the NINJAL site in there, with the directory structure intact. Now, your unidic will be your default dictionary and you can use it (remember, with - Owakati as your option for parser, not -Ochasen) in MeCab Python (and also in rmecab, but as I’m not an R programmer, I won’t cover that here.)

Adapted from: http://mollydesjardin.com/guides/mecabinstructions.html
https://github.com/buruzaemon/natto-py/wiki/Requirements


4) Check your installation

Check if it works by typing mecab into your terminal and then enter some Japanese text.




Use Python/Mecab in R


If you are interested in running python chucks in R Markdown, the reticulate package provides easy interoperability between the two.

Warning: The communication between R and Python chunks (the pieces of code in an R-Markdown document) is only supported since RStudio v1.2 preview release. Otherwise it will only work when you knit the document; it doesn’t happen if you are running chunk by chunk–only after you knit.

Documentation:
https://cran.r-project.org/web/packages/reticulate/vignettes/r_markdown.html



1) Set your default python in R
It is helpful to make sure your default python version is set up in your .Rprofile. You can edit your .Rprofile by typing the following into your Console:

file.edit(file.path("~", ".Rprofile"))

And then edit then paste the following into it:

library(reticulate)
Sys.setenv(RETICULATE_PYTHON = "/usr/local/bin/python3")  #Put this in your .Rprofile


2) Load reticulate and Python3 in your R setup chunk

Python chunks work similarly to R chunks within R Markdown, providing text or graphical outputs. The two languages have full access to each other’s objects if you convert them in your code, including NumPy arrays and Pandas data frames.
By default, reticulate uses the version of Python found on your PATH (i.e. Sys.which(“python”)). If you want to use an alternate version you should add one of the use_python() family of functions to your R Markdown setup chunk. If you want to use Python3, make sure you indicate that in your setup.

library(reticulate)
knitr::opts_chunk$set(echo = TRUE)

Sys.which("python3")
##                  python3 
## "/usr/local/bin/python3"
# use_python(python = "/usr/local/bin/python3", required = T) or
use_python(python = Sys.which("python3"), required = T)

# py_discover_config()#discover which python will be used without actually unloading python
py_config()
## python:         /usr/local/bin/python3
## libpython:      /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/config-3.7m-darwin/libpython3.7.dylib
## pythonhome:     /Library/Frameworks/Python.framework/Versions/3.7:/Library/Frameworks/Python.framework/Versions/3.7
## version:        3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28)  [Clang 6.0 (clang-600.0.57)]
## numpy:          /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy
## numpy_version:  1.15.4
## 
## NOTE: Python version was forced by use_python function

If you want to import packages such as “MeCab”, you can do that using the “import” function

mecab <- import("MeCab")  
mecab$VERSION
## [1] "0.996"
# py_help(mecab)



See what paths for Python are being referenced

import sys
for p in sys.path:
  print(p)
## /Library/Frameworks/Python.framework/Versions/3.7/bin
## /Library/Frameworks/Python.framework/Versions/3.7/lib/python37.zip
## /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7
## /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload
## /Users/auroratsai/Library/Python/3.7/lib/python/site-packages
## /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages
## /Library/Frameworks/R.framework/Versions/3.5/Resources/library/reticulate/python



Now python chunks should work in R markdown. Keep reading to see examples for how the python chunks look.

—————-

Install the mecab python wrapper


This is pretty straightforward.
1) In Terminal:

pip3 install mecab-python3
#if you want to use a tagger
echo `mecab-config --dicdir`"/mecab-ipadic-neologd"



2) Test out Mecab in Python

# -*- coding: utf-8 -*-
import MeCab
t = MeCab.Tagger(" ".join(sys.argv))
ex = """
見た人たちがみんな携帯に話していること。
誰でも連絡が取らない、仕事や他のことから離れて、心を留守になること。
誰でも連絡が取れない、自分の時間があること。"""
ex2 = """
仕事や他のことから離れて、自分のメンータルを世の中から留守になること。
鎖が長くなるなら、他の人から離れて自分のだけのことになる。もし鎖が短くしばりつけるなら、自分の空間が失い、己をなくしてしまうかもしれない。
こういう気持ちがよくある。例えば、なんか気分がモヤモヤしている時は、自分だでの時間と空間が欲しく、他の人からの連絡が取りたくなくなる。
自分の部屋を閉じて、布団の中で好きな本を読む。携帯もパソコンも電源をつかない。
鎖がある。例えば、授業を受けて、クラスメートたちと連絡すること。こういうことは鎖のように、私と他の人をつないでいる。
私は愛を求めている。みんなは自分だけの時間が欲しいので、管理官のようにいつでもみんな在りかを分かるより、一緒に過ごしたい時間はみんなで、自分で過ごしたい時間は自分で過ごすことにする。
中国では、人間関係のことを「飲み物」の比喩をしている。いい友達の関係は水のようにちょっと離れて、薄いけど、人にとって不可欠だ。別の原因で、友達に見えるような関係は油のようにいつも濃くて、美味しく見えるけど、毎日するとすぐ飽きてしまう。
もちろん必要だ。誰でも自分しか知らないことがある。だから、こんなことのためにも、自分だけの時は必要だ。人の考えが一切構わず、やりたいことだけをやる。
"""
#using the Mecab rc tagger
tagger = MeCab.Tagger('mecabrc')
mecab_result = tagger.parse(ex)
print(mecab_result)
## 見    動詞,自立,*,*,一段,連用形,見る,ミ,ミ
## た    助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
## 人    名詞,一般,*,*,*,*,人,ヒト,ヒト
## たち   名詞,接尾,一般,*,*,*,たち,タチ,タチ
## が    助詞,格助詞,一般,*,*,*,が,ガ,ガ
## みんな  名詞,代名詞,一般,*,*,*,みんな,ミンナ,ミンナ
## 携帯   名詞,サ変接続,*,*,*,*,携帯,ケイタイ,ケイタイ
## に    助詞,格助詞,一般,*,*,*,に,ニ,ニ
## 話し   動詞,自立,*,*,五段・サ行,連用形,話す,ハナシ,ハナシ
## て    助詞,接続助詞,*,*,*,*,て,テ,テ
## いる   動詞,非自立,*,*,一段,基本形,いる,イル,イル
## こと   名詞,非自立,一般,*,*,*,こと,コト,コト
## 。    記号,句点,*,*,*,*,。,。,。
## 誰    名詞,代名詞,一般,*,*,*,誰,ダレ,ダレ
## でも   助詞,副助詞,*,*,*,*,でも,デモ,デモ
## 連絡   名詞,サ変接続,*,*,*,*,連絡,レンラク,レンラク
## が    助詞,格助詞,一般,*,*,*,が,ガ,ガ
## 取ら   動詞,自立,*,*,五段・ラ行,未然形,取る,トラ,トラ
## ない   助動詞,*,*,*,特殊・ナイ,基本形,ない,ナイ,ナイ
## 、    記号,読点,*,*,*,*,、,、,、
## 仕事   名詞,サ変接続,*,*,*,*,仕事,シゴト,シゴト
## や    助詞,並立助詞,*,*,*,*,や,ヤ,ヤ
## 他    名詞,一般,*,*,*,*,他,タ,タ
## の    助詞,連体化,*,*,*,*,の,ノ,ノ
## こと   名詞,非自立,一般,*,*,*,こと,コト,コト
## から   助詞,格助詞,一般,*,*,*,から,カラ,カラ
## 離れ   動詞,自立,*,*,一段,連用形,離れる,ハナレ,ハナレ
## て    助詞,接続助詞,*,*,*,*,て,テ,テ
## 、    記号,読点,*,*,*,*,、,、,、
## 心    名詞,一般,*,*,*,*,心,ココロ,ココロ
## を    助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
## 留守   名詞,サ変接続,*,*,*,*,留守,ルス,ルス
## に    助詞,格助詞,一般,*,*,*,に,ニ,ニ
## なる   動詞,自立,*,*,五段・ラ行,基本形,なる,ナル,ナル
## こと   名詞,非自立,一般,*,*,*,こと,コト,コト
## 。    記号,句点,*,*,*,*,。,。,。
## 誰    名詞,代名詞,一般,*,*,*,誰,ダレ,ダレ
## でも   助詞,副助詞,*,*,*,*,でも,デモ,デモ
## 連絡   名詞,サ変接続,*,*,*,*,連絡,レンラク,レンラク
## が    助詞,格助詞,一般,*,*,*,が,ガ,ガ
## 取れ   動詞,自立,*,*,一段,未然形,取れる,トレ,トレ
## ない   助動詞,*,*,*,特殊・ナイ,基本形,ない,ナイ,ナイ
## 、    記号,読点,*,*,*,*,、,、,、
## 自分   名詞,一般,*,*,*,*,自分,ジブン,ジブン
## の    助詞,連体化,*,*,*,*,の,ノ,ノ
## 時間   名詞,副詞可能,*,*,*,*,時間,ジカン,ジカン
## が    助詞,格助詞,一般,*,*,*,が,ガ,ガ
## ある   動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
## こと   名詞,非自立,一般,*,*,*,こと,コト,コト
## 。    記号,句点,*,*,*,*,。,。,。
## EOS



Yay, it works! You now can segment Japanese into morphemes for natural language processing with part-of-speech tags.