[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

"P" Markers revisited



G'day,

there is an interesting article in the journal of lexicography on an
equivalent of P-markers:


 An Investigation into the Star-rated Words in English-Japanese
Learner's Dictionaries    Kiyomi Chujo and Shuji Hasegawa
(chujo@cit.nihon-u.ac.jp)
(shase@alto.ocn.ne.jp)

  Most English-Japanese learner's dictionaries indicate the
importance and vocabulary level of specific entries by attaching one,
two, or three stars to each word. Using one monolingual (COBUILD) and
four bilingual (Genius, Lexis, Wisdom, and Progressive) learner's
dictionaries, the researchers compared the star-rated words with (a)
junior/senior high school (JSH) vocabulary to determine denotation
validity, (b) high frequency words in the British National Corpus to
assess similarity to present-day English, and (c) other materials such
as university exams, TOEIC tests, magazines and news broadcasts.
Findings show minimal consistency in the selection of star-rated words
between the examined dictionaries, and although generally very useful,
a large percentage of the JSH level vocabulary found in the
dictionaries might not be taught in junior and senior high school
textbooks in Japan.

     ________________________________
    1 Progressive was first published in 1980 and its JSH level
star-rated words were assumed to be selected in accordance with the
Course of Study issued by the Japan's Ministry of Education in 1969
and 1970, in which 4,700 words at most were recommended for teaching
at JSH schools. The Course of Study was next revised in 1977 and 1978
and the recommended number of the JSH words was reduced to 2,950 words
at most. Although Progressive was revised in 1987, 1998, and 2003,
considering the large number of Progressive's JSH level star-rated
words, the dictionary might have its own policy on these star-rated
words selection.

 2 One learner normally uses only one series of JSH textbooks. In
this study the most widely used textbooks in Japanese schools from the
7th to the 12th grade were used as a representative of the JSH texts.
In our 1994 study (Chujo et al. 1994) we looked at five series of
junior high school textbooks (15 books) and 37 series of senior high
school textbooks (95 books) and found insignificant differences in the
number of known words in the targeted texts among each textbook
series.

 3 Diametrically opposite to this observation, we can see the text
coverage of each dictionary's JSH level star-rated entries over the
JSH textbook vocabulary. The star-rated words of Genius cover 96.6%,
Lexis covers 95.4%, Wisdom covers 96.5%, Progressive covers 97.6%, and
COBUILD covers 92.0% of the words used in the JSH textbooks. The
percentages demonstrate that the dictionaries' JSH level star-rated
words sufficiently, though not completely, cover the JSH textbook
vocabulary. This information might be useful for dictionary compilers.

The full paper is linked from the author's web-site:
http://ijl.oxfordjournals.org/cgi/reprint/ecl008?ijkey=zBR9EiNOs7EOjGJ&keytype=ref
It appears tha deciding on the importance of vocabulary is a difficult task.

I could think of three ways of trying to improve the current ratings,
but as I don't use them myself, am not so motivated.

(a) we could do a corpus count (or web corpus count) and mark the most frequent
 - it is very hard to get a fully representative corpus.  Amano and
Kondo, for example found that in over ten years of newspaper text the
word 唐揚げ never appeared, although it is a very familiar word.

(b) we could compare the vocabulary to the familiarity ratings in
日本語語彙特性, and mark various bands of familiarity, although the IP issues
could be murky.
@Book{Goitokusei,
 author =	 "Shigeaki Amano and Tadahisa Kondo",
 title =	 "Nihongo-no Goi-Tokusei (Lexical properties of
                 Japanese)",
 publisher =	 "Sanseido",
 year =	 1999
}

(c) we should keep the current markers and correct individual ones
that appear to be wrong, preferably backing up our intuitions with
some kind of evidence (such as GPB "Ghits per billion documents"
http://itre.cis.upenn.edu/~myl/languagelog/archives/000953.html).
There is currently a way to add P-markers (spec1 and spec2: I don't
know the difference), but no way of removing them...  I am not sure if
it is worth adding another flag just to show this.

I think IP problems rule out (b), if anyone has a big corpus then it
would be nice to try (a), but for the moment we should stick with (c).

--
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Computational Linguistics Group