[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
OT: detecting proper nouns
Background: I'm trying to extract likely sentence templates from a large
corpus (*). My basic plan is to change numbers to [NUMBER] and proper
nouns to [PROPER_NOUN] (I might break them into [NAME], [PLACE],
[OTHER], etc.).
Numbers are easy; I'm now considering ways to handle proper nouns.
I'm considering using chasen, but its output is rather fine-grained. Is
there an alternative where e.g. verb endings are kept with the verb and
successive verbs are merged [1], and runs of nouns are merged [2]?
I think I could write a post-parser for chasen to do this, by detecting
the main type (名詞, 動詞, etc.) and grouping words when it doesn't
change. With a few extra rules to handle things like "助詞-接続助詞"
coming after a verb. But I wondered if anyone here knows of an existing
project that has already done something similar? Or if there is an
alternative to chasen that would suit my purposes better?
(I also thought of using the list of Wikipedia articles as a list of
proper nouns. But it contains normal nouns too, and anyway I suspect
trying to match with it will get messy.)
Any suggestions welcome, thanks,
Darren
*: As background to the background, this "large corpus" is actually
sentences for which my experimental MT system has no translation, or a
low-confidence translation, and at this stage I am trying to get a feel
for how I can most efficiently increase its coverage.
[1]: E.g.
強まっ ツヨマッ 強まる 動詞-自立 五段・ラ行 連用タ接続
て テ て 助詞-接続助詞
売り ウリ 売る 動詞-自立 五段・ラ行 連用形
--> 強まって売り
[2]: E.g.
東京 トウキョウ 東京 名詞-固有名詞-地域-一般
株式 カブシキ 株式 名詞-一般
市場 シジョウ 市場 名詞-一般
--> 東京株式市場
--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)