入門自然言語処理第一章

February 13, 2011 - 入門自然言語処理

入門自然言語処理は11月に購入してから寝かせたままでしたが、本日より読み始めることにしました。

本日は第一章です。内容は大きく以下2点。

pythonとnltkでの簡単なテキスト処理方法
自然言語処理を俯瞰する

内容は本に任せるとして、ここでは演習問題の解答をしようと思います。

問題文は簡潔に記述するため、本書に記載されているものとは異なる表記をしています。

納得いかない問題

ひとつだけ納得いかない問題がありました。「17. text9からtext.9.index()を使って’sunset’を含む一文を抜き出す」です。

解答をググっても、こちらしか出てこず。 1-15. bで始まる単語抽出 — 入門自然言語処理

たしかに、

>>> text9.index(‘sunset’)
629

と帰ってきます。でも、実際には’sunset’を含む一文は複数あるわけです。確認するとこんな感じ。

>>> [t for t in text9 if t==‘sunset’]
[‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’, ‘sunset’]

複数ある一文を抽出しなくていいの？という疑問が出てきました。すべての文を抽出する方法はさておき、今回はひとつだけ抽出する解答だけを書きました。

演習問題

準備

nltkとmatplotlibがインポートできていれば、以下コードだけですみます。

from nltk.book import *

以下が、すべての問題の解答です。 (私個人の解答であり、正解の保証はどこにもありません。）

1. Pythonインタプリタを電卓として、12 / (4 + 1)のような計算を入力してみよう

>>> 12 / 4 + 1
2.3999999999999999

2. 26文字のアルファベットが与えられたとき、10字の文字列は26の10乗（あるいは26**10）種類も作ることができるが、100文字だったら何種類か。

>>> 26 ** 100
3142930641582938830174357788501626427282669988762475256374173175398995908420104023465432599069702289330964075081611719197835869803511992549376L

3. [‘Monty’, ‘Python’] * 20や 3 * sent1を実行すると何が起こるか

>>> [‘Monty’, ‘Python’] * 3
[‘Monty’, ‘Python’, ‘Monty’, ‘Python’, ‘Monty’, ‘Python’]

4. text2には単語がいくつ含まれているか。また重複を除くといくつか。

>>> #単語数
>>> len(text2)
141576
>>> #重複除いた単語数
>>> len(set(text2))
6833

5. ユーモア小説とロマンス小説の間で、どちらが語彙の多様性が高いか。

ユーモア小説：多様性＝6.9
ロマンス小説：多様性＝8.3

?ロマンス小説が高い。

6. Sense and Sensibilityのなかで、エリナ、マリアン、エドワード、ウィロビーの分散プロットを表示してみよう。

名前まで日本語表記されていて元の綴りがわからない… なのでまず探すところから。

>>>[t for t in set(text2) if t.startswith(‘El’)]
[‘Elinor’, ‘Eliza’]

な感じで4人分探します。で、プロットする

>>> text2.dispersion_plot([‘Elinor’, ‘Mrianne’, ‘Edward’, ‘Willoughbys’])

※ここでプロットウインドウが表示される

男女の役割の違い

女性が高頻度に出現

カップルの特定は可能か

できない。どうしろと…

7. text5のコロケーションを探してみよう。

>>> text5.collocations()
Building collocations list
wanna chat; PART JOIN; MODE #14-19teens; JOIN PART; PART PART;
cute.-ass MP3; MP3 player; JOIN JOIN; times .. .; ACTION watches; guys
wanna; song lasts; last night; ACTION sits; -…)…- S.M.R.; Lime
Player; Player 12%; dont know; lez gurls; long time

8. len(set(text4))の目的

単語のユニーク数を求める

9. リストと文字列

(a) 変数の定義と2種類の出力を試してみよう

>>> my_string = ‘My String’
>>> my_string
My String
>>> print my_string
My String

(b). 連結と演算

>>> my_string + my_string
‘My StringMyString’
>>> my_string * 3
‘My StringMy StringMy String’

10. 単語のリストを保持するmy_sentという変数を定義しよう

(a) ‘ ‘.joiin(my_sent)を使ってリストを文字列に変換してみよう。

>>> ‘ ‘.join(my_string.split(’ ‘))
‘My String’

(b) split()を使ってできた

>>> my_sent = [‘My’, ‘Sent’]
>>> ‘ ‘.join(my_sent)
‘My Sent’
>>> ‘ ‘.join(‘hoge moge’.split(’ ‘))
‘My Sent’

11. リストの連結

>>> phrase1 = [‘hoge’]
>>> phrase1 += [‘moge’]
>>> phrase2 = [‘foo’]
>>> phrase2 += [‘bar’]
>>> phrase1 + phrase2
[‘hoge’, ‘moge’, ‘foo’, ‘bar’]

len(phrase1 + phrase2)とlen(phrase1) + len(phrase2)の違いはなにか。

連結した後の長さと、それぞれの長さの加算

12. NLPと関連あるものはどれか。

[‘Monty Python’][6:12]
[‘Monty’, ‘Python’][1]

13. sent1[2][2]はなにを表しているだろうか。

2要素目の2文字目

14. 要素のインデックス取得

微妙だけど。

>>> i = 0
>>> for t in sent3:
…   if t==‘the’:
…     print i
…   i += 1
…
1
5
8

15. bから始まる単語の取得

>>> sorted(set([t for t in text5 if t.startswith(‘b’)]))
[‘b’, ‘b-day’, ‘b/c’, ‘b4’,……

16. range()について

>>> range(10)
[1,2,3….9]
>>> range(10,20)
[10,11,12,…., 20]
>>> range(10,20,2)
[10, 12, …, 18]
>>> range(20,10,-2)
[20, 18, …, 12]

17. text9からtext.9.index()を使って’sunset’を含む一文を抜き出す

これで良いのかわかりませんが、ひとまずの解答として。

dot_pre = 0 # 直前の’.‘の位置
dot_aft = 0 # 直後の’.‘の位置
found = False # ‘.‘が見つかったかどうか
i = 0 # 現在位置
for t in text9:
  if t==‘sunset’:
    found = True
  if t==‘.’:
    if found==True:
      dot_aft = i
      break
    else:
      dot_pre = i
  i += 1

‘sunset’を含む一文を生成する。

’ ‘.join([text9[t] for t in range(dot_pre+1, dot_aft+1)])

‘CHAPTER I THE TWO POETS OF SAFFRON PARK THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .’

18. sent1からsent8までに含まれる語彙を計算

len(set(sent1 + sent2 + sent3 + sent4 + sent5 + sent6 + sent7 + sent8))

19. 以下2行の違いは何か。

sorted(set([w.lower() for w in text1]))
sorted([w.lower() for w in set(text1)])

sorted(set([w.lower() for w in text1]))

小文字のリストを作ってからユニークにしてソート

sorted([w.lower() for w in set(text1)])

ユニークリストから小文字にしてソートこちらが大きい。重複をふくんでいるので。

20. w.isupper()とw.islower()の違いは何か

大文字ならTrueと小文字ならTrue

21. 最後の2単語を取り出すスライス式を書いてみよう。

>>> text2[-2:]

21. 4文字の単語のうち、頻度の高い順に取得してみよう。

>>> FreqDist([t for t in text4 if len(t)==4]).keys()

21. 大文字の単語を一行ずつ表示

>>>  for str in [t for t in text6 if t.isupper()]:
>>>   print str

24. 条件にあったものを含むリスト

a: izeで終わる

[t for t in text6 if t.endswith(‘ize’)]

b: zを含む

[t for t in text6 if ‘z’ in t]

c: ptを含む

[t for t in text6 if ‘pt’ in t]

d: 先頭大文字あと小文字（＝タイトルケース)

[t for t in text6 if t.isalpha()==True and t==t.title()]

25. listed = [‘she’, ‘sells’, ‘sea’, ‘shells’, ‘by’, ‘the’, ‘sea’, ‘shore’]

shではじまる単語

[t for t in listed if t.startswith(‘sh’)]

4文字より大きい単語

[t for t in listed if len(t)>=4]

26. sum([len(w) for w in text1])について

どんな処理？

リスト中のすべての単語長の合計

これをつかって平均はだせる？

>>> sum([len(w) for w in text1])/len(text1)
3.8304111280236488

27. 語彙サイズを返すvocab_size(text)を定義

>>> def vocab_size(text):
…  return len(set(text))
…
>>> vocab(text1)
19317

28. percent(word, text)を定義

>>> def percent(word, text):
…  return len([t for t in text if t==word]) / len(text)

29. set(sent3) < set(text1)について

実行してみる

True

29. 異なるテキストで実行して何が起こるか

>>> set(sent3) < set(text3)
True

実用的な応用は何があるか

ドキュメント間の語彙数の比較

入門自然言語処理 [大型本]