HTML 来自网页的外语字符未正确显示

Question

如果标题有误导性，我们深表歉意。

我试图通过查询歌词网站找出给定歌曲的语言，然后使用 CLD2 检查歌词的语言。但是，对于某些歌曲（例如下面给出的示例），外语字符未正确编码，这意味着 CLD2 会抛出此错误：input contains invalid UTF-8 around byte 2121 (of 32761)

import requests
import re
from bs4 import BeautifulSoup
import cld2

response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

def checklang(lyrics):
    try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        language = re.search("ENGLISH", str(details))
        
        if language == None:
            print("foreign lang")
                      
        if len(re.findall("Unknown", str(details))) < 2:
            print("foreign lang")
                      
        if language != None:
            print("english")
            pass

还值得一提的是，这不仅限于 non-latin 个字符，有时还会出现撇号或其他标点符号。

任何人都可以阐明为什么会发生这种情况，或者我可以做些什么来解决这个问题？

Answer 1

Requests 应该根据 HTTP headers.

对响应的编码进行有根据的猜测

不幸的是，在给定的示例中，response.encoding 显示 ISO-8859-1 而 response.content 显示 <meta charset="utf-8">。

这是我基于 Response Content paragraph in the requests documentation.

的解决方案

import requests
import re
from bs4 import BeautifulSoup
# import cld2
import pycld2 as cld2

def checklang(lyrics):
        #try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        # language = re.search("ENGLISH", str(details))
        for detail in details:
            print(detail)

response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')

print(response.encoding)
response.encoding = 'utf-8'                         ### key change ###

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

输出：\SO630066.py

ISO-8859-1
('ENGLISH', 'en', 74, 833.0)
('Korean', 'ko', 20, 3575.0)
('Unknown', 'un', 0, 0.0)
Lyrics found!

HTML 来自网页的外语字符未正确显示

HTML from a webpage does not display foreign language characters correctly

python

encoding

http

utf-8

python-requests