[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Website worriers



One of the fun things about running a busy website is that you
have to watch out for traffic you don't really want. Since there is
a data charge from the hosting company, I watch out for users
who do silly things like trying to download the whole of JMdict
via wwwjdic, one entry at a time. Most of the semi-professional
harvester sites obey the robots.txt go-away rules, but there are
still the rogues.

One user who is annoying me at present is firing requests at the
edform.py script, e.g.
http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1582000
(This is the one that loads up an entry for edit.)

When I noticed them they were sending in 20-30k of these a day.
There is no identifying information in the request, the IP address is
never showing up in the DNS data. I am now blocking them with a
kernel filter and after a couple of hours they switch to another IP
address and resume. The current culprit is at 85.203.22.34 and
has sent in about 2,000 in the last hour. The log shows an odd client
identifier  ending in "Gecko/20041107 Firefox/x.x". That same
pattern is on all the requests I've been blocking, and AFAICT no other
user has it. Anyway another block going in.

Jim


-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/