解码 python 中的 utf-8 内容

Question

我正在尝试抓取一个字符集像这样的网页

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

当我使用 python 请求获取页面源时，我得到如下内容：

&#2453;&#2469;&#2494;&#2527; &#2476;&#2482;&#2503;- &#2478;&#2494;&#2459;&#2503; &#2477;&#2494;&#2468;&#2503; &#2476;&#2494;&#2457;&#2494;&#2482;&#2495;&#2404;</p> <p>&#2453;&#2476;&#2495; &#2440;&#2486;&#2509;&#2476;&#2480; &#2455;&#2497;&#2474;&#2509;&#2468; &#2438;&#2480;&#2503;&#2453; &#2471;&#2494;&#2474; &#2447;&#2455;&#2495;&#2527;&#2503; &#2476;&#2482;&#2503;&#2472;, '&#2477;&#2494;&#2468;-&#2478;&#2494;&#2459; &#2454;&#2503;&#2527;&#2503; &#2476;&#2494;&#2433;&#2458;&#2503; &#2476;&#2494;&#2457;&#2509;&#2455;&#2494;&#2482;&#2495; &#2488;&#2453;&#2482;/ &#2471;&#2494;&#2472;&#2503; &#2477;&#2480;&#2494; &#2477;

如何从 python 中的这些字符串中获取原始内容？

Answer 1

这些 HTML 编码 Unicode 代码点的实体，并没有真正使用 UTF-8；它可以被编码为 ASCII 而不会丢失功能。使用 HTML 解析器，例如 BeautifulSoup。它将为您处理此类内容：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
... </head><body>
... &#2453;&#2469;&#2494;&#2527; &#2476;&#2482;&#2503;- &#2478;&#2494;&#2459;&#2503; &#2477;&#2494;&#2468;&#2503; &#2476;&#2494;&#2457;&#2494;&#2482;&#2495;&#2404;</p> <p>&#2453;&#2476;&#2495; &#2440;&#2486;&#2509;&#2476;&#2480; &#2455;&#2497;&#2474;&#2509;&#2468; &#2438;&#2480;&#2503;&#2453; &#2471;&#2494;&#2474; &#2447;&#2455;&#2495;&#2527;&#2503; &#2476;&#2482;&#2503;&#2472;, '&#2477;&#2494;&#2468;-&#2478;&#2494;&#2459; &#2454;&#2503;&#2527;&#2503; &#2476;&#2494;&#2433;&#2458;&#2503; &#2476;&#2494;&#2457;&#2509;&#2455;&#2494;&#2482;&#2495; &#2488;&#2453;&#2482;/ &#2471;&#2494;&#2472;&#2503; &#2477;&#2480;&#2494; &#2477;
... </body></html>''', 'lxml')
>>> soup
<html><head><meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/>\n</head><body>\n\u0995\u09a5\u09be\u09df \u09ac\u09b2\u09c7- \u09ae\u09be\u099b\u09c7 \u09ad\u09be\u09a4\u09c7 \u09ac\u09be\u0999\u09be\u09b2\u09bf\u0964 <p>\u0995\u09ac\u09bf \u0988\u09b6\u09cd\u09ac\u09b0 \u0997\u09c1\u09aa\u09cd\u09a4 \u0986\u09b0\u09c7\u0995 \u09a7\u09be\u09aa \u098f\u0997\u09bf\u09df\u09c7 \u09ac\u09b2\u09c7\u09a8, '\u09ad\u09be\u09a4-\u09ae\u09be\u099b \u0996\u09c7\u09df\u09c7 \u09ac\u09be\u0981\u099a\u09c7 \u09ac\u09be\u0999\u09cd\u0997\u09be\u09b2\u09bf \u09b8\u0995\u09b2/ \u09a7\u09be\u09a8\u09c7 \u09ad\u09b0\u09be \u09ad\n</p></body></html>
>>> soup.get_text()
u"\n\n\u0995\u09a5\u09be\u09df \u09ac\u09b2\u09c7- \u09ae\u09be\u099b\u09c7 \u09ad\u09be\u09a4\u09c7 \u09ac\u09be\u0999\u09be\u09b2\u09bf\u0964 \u0995\u09ac\u09bf \u0988\u09b6\u09cd\u09ac\u09b0 \u0997\u09c1\u09aa\u09cd\u09a4 \u0986\u09b0\u09c7\u0995 \u09a7\u09be\u09aa \u098f\u0997\u09bf\u09df\u09c7 \u09ac\u09b2\u09c7\u09a8, '\u09ad\u09be\u09a4-\u09ae\u09be\u099b \u0996\u09c7\u09df\u09c7 \u09ac\u09be\u0981\u099a\u09c7 \u09ac\u09be\u0999\u09cd\u0997\u09be\u09b2\u09bf \u09b8\u0995\u09b2/ \u09a7\u09be\u09a8\u09c7 \u09ad\u09b0\u09be \u09ad\n"
>>> print soup.get_text()


কথায় বলে- মাছে ভাতে বাঙালি। কবি ঈশ্বর গুপ্ত আরেক ধাপ এগিয়ে বলেন, 'ভাত-মাছ খেয়ে বাঁচে বাঙ্গালি সকল/ ধানে ভরা ভ

解码 python 中的 utf-8 内容

decode utf-8 content in python

python

decode

utf-8