[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Search-only senses



I like this approach. LLM's seem like they could make quick work of this, but are too expensive/slow to run client-side. Moving that to the database generation layer, and then having human review seems like a good idea.

Chris

On Fri Dec 8, 2023, 08:46 AM GMT, Adam Nohejl <mailto:adam@***********> wrote:
> Hi everyone,
> I also feel that it's better to think about this as of two problems outlined by Kim, even though there is potentially a lot of overlap, and definitely lot of space for a fuzzy search. Let me add a few points:
> - Why adding English verb glosses to n,vs entries is a good idea:
>   - All other dictionaries I know do that (at least if the verb usage is common), people expect that.
>   - Many of these words are used as verbs in most of their occurrences, and noun (gerund) glosses often feel a little forced (e.g. 就職: finding employment; getting a job).
>   - Adding the verb glosses would make "n,vs" entries consistent with vs-only entries, which already have them.
>   - In addition to improving E->J search, verb glosses would be a clear and conspicuous way of stating that the word can be used as a verb (compared to the POS tags, which are not very conspicuous in most apps and I assume most users don't bother to read them).
>   - Not all applications are going to do elaborate client-side processing of search queries. High quality on-the-fly processing (beyond lemmatization/stemming/rules) would either require a large language model or a small on-device fine-tuned model.
> - What might be a good way to start adding the glosses:
>   - Use a corpus to dermine vs+n that are frequently used as verbs (e.g. have high N+する frequency, ignoring other occurrences), so that we know which entries need the verb senses the most.
>   - Optionally, use a "computational method" to annotate them with provisional verb glosses to be reviewed by humans.
> As for the "computational method": With very little money it would be easy to do using a commercial LLM. I tried ChatGPT on a dozen examples and it did a pretty good (i.e. educated human-grade) job, given that I used only the English glosses. Adding the Japanese words may (or may not) improve it. GPT 3.5 now costs $0.001/$0.002 for 1K input/output tokens, so potentially we could get provisional glosses for a thousand entries for a few dollars. The glosses would still require human review, but this would save a ton of work. It seems that there is only 13,969 "vs" entries (I guess most of them "n,vs"), correct?
> --
> Adam Nohejl
> On 8 Dec 2023, at 16:24, Kim Ahlström wrote:
>> I wonder if there’s two separate problems to solve here, with some overlap - having glosses that explain the full range of uses of a headword, and handling user queries of a form that’s not in the dictionary data.
>>
>> I think my original idea of search-only senses only solve the latter, and I agree that computational tools can go a long way here. As Jim mentioned it can be hard to get this working well though. For example I think it would be hard for a lemmatizer to find 結婚 when searching for “marry" since the noun entry is “marriage”, and they are separate lemmas. Ironically some stemmers do a better job here, but I’ve generally avoided them since the output is not always natural language.
>>
>> I think there are several upsides to adding verb forms to n,vs entries. It would make common Japanese words be findable using common English words. It would benefit all clients using JMdict, not just the systems that implement linguistics smarts. It would also clarify word usage. Someone searching for “to marry” and finding 結婚/“marriage” would not necessarily know that this is the most common way, or a way at all, to write “to marry”. This could be especially hard for non-native English speakers.
>>
>> I quite like Jim’s idea of delineating verb forms with something like {vf}, since it would allow clients to format the entry as they prefer - ~する like GG5, or maybe an English explanation like “as a verb:”, without requiring changes to the XML schema. It would still require some language smarts to turn “to land” queries into “land”, but would be simple enough that clients could do it brute force without a separate stemmer/lemmatizer.
>>
>> Since adding the verb forms would be quite an undertaking, maybe a combined approach could be used. A one time computational process to add verb forms as hidden glosses. These could then bit by bit be looked over by editors, starting with more common words, turning them into well written visible glosses. Yes, I’m aware that I’m asking for a lot from the editorial group here 😅 But the more I think about this approach the less I like it. I shudder a bit at the thought of having machine made text inside JMdict, even if it would be hidden data.
>>
>> Cheers
>> Kim
>>
>>> On Dec 7, 2023, at 17:24, Chris Vasselli <clindsay@gmail.com> wrote:
>>>
>>> Personally my instinct is that this should be handled by clients computationally, but I could be convinced otherwise!
>>>
>>> It seems like part of a broader problem of matching user queries that use words in a different form from how they're written in the dictionary. For example, someone might still search for "made a landing" instead of "make a landing", even if "make a landing" were added to the dictionary. So you still need to deal with transforming user queries in some way. In my iOS app I use the "porter" tokenizer of sqlite3 for this for English, and Apple's built-in natural language lemma support for non-English languages.
>>>
>>> Granted, it's not perfect, and I just checked and my app also fails to find 着陸 for "to land". But I feel like trying to solve this with manual additions to JMdict will only solve one small part of a larger problem that kind of inherently needs a computational solution.
>>>
>>> Just my initial thought though, curious to hear what others think.
>>>
>>> Chris
>>>
>>> On Thu Dec 7, 2023, 06:31 AM GMT, Jim Breen <mailto:jimbreen@*********> wrote:
>>>> Thanks, Kim, for raising this.
>>>>
>>>> Support for E->J lookups has always been a thing of interest, and we
>>>> often included glosses that can assist. That said, it's recognized
>>>> that the practice of not including verb or adjective glosses for
>>>> (n.vs) and (n,adj-*) entries can make such lookups difficult. You
>>>> won't easily find 料理 by looking up "to cook". About 20 years ago I
>>>> did some experimenting within WWWJDIC with taking a search key such as
>>>> "to XXXX" and converting it to possible targets such as "XXXXing". It
>>>> was noisy and only partially successful. and eventually I gave up.
>>>> (Ironically it would have worked with 着陸.)
>>>>
>>>> Certainly adding verb glosses, either as new senses or within the
>>>> existing senses would help, but it would be a major task - about
>>>> 13,000 entries are of the "n,vs" variety. I hadn't even thought about
>>>> "hidden glosses", but it's an interesting concept. Rough versions
>>>> could be created automatically, but I think human involvement would be
>>>> needed to get any reliability. and if work is going to be needed the
>>>> results may as well be visible.
>>>>
>>>> If you look at the 着陸 entry in GG5, it has:
>>>> (a) landing; alighting; 〔接地〕 a touchdown.
>>>> ~する land; make a landing; alight; 〔接地〕 touch [put, set] down.
>>>>
>>>> You could envisage the current JMdict glosses ("landing; alighting;
>>>> touch down") being extended with something like:
>>>> "{vf} land; alight; set down". That would allow dictionary systems to
>>>> respond to keys such as "to land". A sense extension of this form
>>>> would not upset the sense numbering.
>>>>
>>>> Anyway, food for thought, and thanks for raising it. I'll be
>>>> interested to see what the community thinks.
>>>>
>>>> Jim
>>>>
>>>> On Wed, 6 Dec 2023 at 18:34, Kim Ahlström <kim.ahlstrom@gmail.com> wrote:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> A Jisho.org user recently emailed me to ask why he could not find 着陸 when searching for "to land". Since it's tagged as a noun and suru verb the definition is written in the noun form (landing; alighting; touch down).
>>>>>
>>>>> The editorial policy specifically states that these entries should not include verb glosses, but allows it for entries where the verb sense can not be easily derived from the noun sense, and for vs entries that are also not n.
>>>>>
>>>>> Is the intent here that verb senses could be derived computationally by dictionary software for vs+n entries to make them findable as verbs? A computational approach seems within the realm of possibility, but a human curated approach would be more accurate.
>>>>>
>>>>> Since we now have search-only readings, could we introduce search-only senses or glosses to make finding these vs+n entries easier when searching in English using verb forms?
>>>>>
>>>>> Cheers
>>>>> Kim
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict+unsubscribe@googlegroups.com.
>>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/02d86fe4-7412-4352-91da-f3125a55dc3an%40googlegroups.com.
>>>>
>>>>
>>>>
>>>> --
>>>> Jim Breen
>>>> Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
>>>> http://www.jimbreen.org/
>>>> http://nihongo.monash.edu/
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict+unsubscribe@googlegroups.com.
>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq6ekz4gJbDVosaXa3fTo%2B_ghgYZevNw124JK41%2BaJ01rg%40mail.gmail.com.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict+unsubscribe@**************** <mailto:edict-jmdict+unsubscribe@****************>.
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/7327b78d-951b-41c2-91f3-9c80ee14ed17%40mail.shortwave.com <https://groups.google.com/d/msgid/edict-jmdict/7327b78d-951b-41c2-91f3-9c80ee14ed17%40mail.shortwave.com?utm_medium=email&utm_source=footer>.
>>
> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict+unsubscribe@**************** <mailto:edict-jmdict+unsubscribe@****************>.
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/FD9ECAD2-FE43-4A10-A8FF-4BA7E1E9C6EE%40gmail.com <https://groups.google.com/d/msgid/edict-jmdict/FD9ECAD2-FE43-4A10-A8FF-4BA7E1E9C6EE%40gmail.com?utm_medium=email&utm_source=footer>.
> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict+unsubscribe@**************** <mailto:edict-jmdict+unsubscribe@****************>.
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CE8B6DDC-6F71-45A7-85F0-DBA290A9DBD3%40nohejl.name <https://groups.google.com/d/msgid/edict-jmdict/CE8B6DDC-6F71-45A7-85F0-DBA290A9DBD3%40nohejl.name?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/a1f3e1f4-657e-44a2-9c6a-208cd0ec05bc%40mail.shortwave.com.