[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Website worriers



I have this problem a lot at www.sljfaq.org/cgi/. One person who'd got fed up with being blocked and offered to pay me to convert his file of mostly alphabetic company names recently sent me the code he was using to access the site, and it was actually scraping the HTML version then turning it into json format, even though the default output of the site is json. I couldn't believe it. Also most of the work was just turning a, b, c etc. into katakana, so all he really needed to do was to write a script using 26 letters of the alphabet, to finish most of his job, yet he thought he needed to send each of these 300,000 company names one by one to my site.

On Thu, Nov 7, 2019, 10:21 AM Jim Breen jimbreen@********* [edict-jmdict] <edict-jmdict@***************> wrote:
One of the fun things about running a busy website is that you
have to watch out for traffic you don't really want. Since there is
a data charge from the hosting company, I watch out for users
who do silly things like trying to download the whole of JMdict
via wwwjdic, one entry at a time. Most of the semi-professional
harvester sites obey the robots.txt go-away rules, but there are
still the rogues.

One user who is annoying me at present is firing requests at the
edform.py script, e.g.
http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1582000
(This is the one that loads up an entry for edit.)

When I noticed them they were sending in 20-30k of these a day.
There is no identifying information in the request, the IP address is
never showing up in the DNS data. I am now blocking them with a
kernel filter and after a couple of hours they switch to another IP
address and resume. The current culprit is at 85.203.22.34 and
has sent in about 2,000 in the last hour. The log shows an odd client
identifier  ending in "Gecko/20041107 Firefox/x.x". That same
pattern is on all the requests I've been blocking, and AFAICT no other
user has it. Anyway another block going in.

Jim


--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/


------------------------------------
Posted by: Jim Breen <jimbreen@*********>
------------------------------------


------------------------------------

Yahoo Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/edict-jmdict/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/edict-jmdict/join
    (Yahoo! ID required)

<*> To change settings via email:
    edict-jmdict-digest@***************
    edict-jmdict-fullfeatured@***************

<*> To unsubscribe from this group, send an email to:
    edict-jmdict-unsubscribe@***************

<*> Your use of Yahoo Groups is subject to:
    https://info.yahoo.com/legal/us/yahoo/utos/terms/