[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] DTD format



Glenn Maynard wrote:
The <?xml?> tag in jmdict.dtd needs encoding="utf-8" to parse with Expat
and libxml2 (through Python), or they both give obscure errors (Python's
Expat gives "error in processing external entity reference" and with
some extra digging to pull out the error message it helpfully discards,
"text declaration not well-formed".  libxml2 just as usefully says:
"Space needed here", referring to expecting a space after version="1.0".)

Did you mean to say "kanjidic" above?  Looking at a month
old copy of kanjidic2 and JMdict, I see an explicit encoding
declaration in JMdict and no declaration in kanjidic2.
(Don't have copy of kanjidic handy, am guessing it's the
same as kanjidic2).

However, my understanding is that an encoding declaration
is optional, defaulting to utf-8 if not present:

From section 2.8 of the 1.0 spec:

  [23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

The spec also says in sec 4.4,3:

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML
processor in an encoding other than that named in the declaration,
or for an entity which begins with neither a Byte Order Mark nor
an encoding declaration to use an encoding other than UTF-8."

The following script works for me with both jmdict and
kanjidic2 inputs, with either elementTree (which uses expat
under the covers), or lxml.  (Change value of 'outp_encoding'
in main() as needed, run with jmdict or kanjidic2 input
filename as argument.)

~~~~~~~~
import sys
import xml.etree.cElementTree as ElementTree
#import lxml.etree as ElementTree

def main ():
        outp_encoding = 'sjis'
        inp_file = open (sys.argv[1])

        etiter = iter (ElementTree.iterparse (inp_file))
        event, root = etiter.next()

        for event, elem in etiter:
            if elem.tag=="entry" or elem.tag=="character":

                a = elem.findtext ('ent_seq')
                if a: print a

                b = elem.findtext ('literal')
                if b: print b.encode(outp_encoding)

            root.clear()

if __name__ == '__main__': main ()
~~~~~~~~

Are you sure you're not using a modified copy of jmdict?