Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Why can lxml parse my XML from unicode strings?

April 26, 2017lxml parse strings Unicode XML

0

10 Posted

Why can lxml parse my XML from unicode strings?

1 Answer

0

Posted

lxml can read Python unicode strings and even tries to support them if libxml2 does not. However, if the unicode string declares an XML encoding internally (), parsing is bound to fail, as this encoding is most likely not the real encoding used in Python unicode. The same is true for HTML unicode strings that contain charset meta tags, although the problems may be more subtle here. The libxml2 HTML parser may not be able to parse the meta tags in broken HTML and may end up ignoring them, so even if parsing succeeds, later handling may still fail with character encoding errors. Note that Python uses different encodings for unicode on different platforms, so even specifying the real internal unicode encoding is not portable between Python interpreters. Don’t do it. Python unicode strings with XML data or HTML data that carry encoding information are broken. lxml will not parse them. You must provide parsable data in a valid encoding.