需要把句子切成詞彙。
R裡面我有用到的library是tmcn跟jiebar。
jiebar很直覺,
先用 cutter=worker() 產生一個切詞器,
便可以用cutter來切割句子。
我們還可以使用new_user_word來將新詞彙加入詞庫,
而當使用cutter=worker("tag")除了可以切割出詞彙之外,
還會提供詞彙的詞性,
使用者就可以抓出想要詞性的詞彙出來。
範例如下:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(jiebaR) | |
cutter=worker() | |
##將句子切成詞彙 | |
segment('中文切詞彙', cutter) ## [1] "中文" "切" "詞彙" | |
cutter['中文切詞彙'] ## [1] "中文" "切" "詞彙" | |
##一樣是將句子切成詞彙,不過會附註詞性 | |
cutter=worker("tag") | |
cutter['中文切詞彙'] # nz v n | |
# "中文" "切" "詞彙" | |
##增加詞彙,原本是切成 "中文" "切",現在手動輸入"中文切",就會分割成"中文切" | |
new_user_word(cutter,'中文切',"n") ## "n" 是設給他的詞性 | |
cutter['中文切詞彙'] # "中文切" "詞彙" | |
##取出名詞參考 | |
##https://groups.google.com/forum/#!topic/jiebar/nBfIizyVEUw | |
res = cutter["測試抓出名詞"] | |
get_noun = function(x){ | |
stopifnot(inherits(x,"character")) | |
index = names(res) %in% c("n","nr","nr1","nr2","nrj","nrf","ns","nsf","nt","nz","nl","ng") | |
x[index] | |
} | |
##顯示所有詞彙 | |
res | |
##只顯示名詞 | |
get_noun(res) |
沒有留言:
張貼留言