http响应python中的随机3-4长字符串

Question

我正在尝试使用 python 中的套接字模块发出请求。它成功发出请求、获取响应并对其进行解码。当我查看 HTML 文档时，除了 HTML 文档中有随机的 3-4 长随机字符串外，一切都是正确的。我想我的代码是正确的，但我不是 100% 确定。这是我的代码：

def recive_data(get, timeout):
  ready = select.select([get], [], [], timeout)
  if ready[0]:
    return get.recv(4096)
  return b""

def get_file(website, port, file, https=False):
  data = []
  new_data = ""

  if https:
    get = ssl.create_default_context().wrap_socket(socket.socket(socket.AF_INET, socket.SOCK_STREAM), server_hostname=website)
  else:
    get = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  get.connect((website, port))
  get.sendall(f"GET {file} HTTP/1.1\r\nHost: {website}:{port}\r\n\r\n".encode())
  while True:
    new_data = recive_data(get, 5).decode()
    if new_data != "" and new_data != None:
      data.append(new_data)
      new_data = ""
    else:
      break

  data = "".join(data)
  header = data[0:data.find(newline+newline)]
  data = data[data.find(newline+newline):data.rfind(f"{newline}0{newline}{newline}")]

  data = BeautifulSoup(data, 'html.parser').prettify()

  get.close()
  return (header, data)

如果我输入 https://whosebug.com 它输出：

30d
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
 <head>
  <title>
   Stack Overflow - Where Developers Learn, Share, &amp; Build Careers
  </title>
  <link href="https://cdn.sstatic.net/Sites/Whosebug/Img/favicon.ico?v=ec617d715196" rel="shortcut icon"/>
  <link href="https://cdn.sstatic.net/Sites/Whosebug/Img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon"/>
  <link href="https://cdn.sstatic.net/Sites/Whosebug/Img/apple-touch-icon.png?v=c78bd457575a" rel="image_src"/>
  <link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
  <meta content="Stack Overflow is the largest, most trusted online communi
20d0
ty for developers to learn, share their programming knowledge, and build their careers." name="description"/>
  <meta content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
  <meta content="website" property="og:type">

等... 但是，某些网站比其他网站拥有更多，我也无法弄清楚。非常感谢任何帮助！

Answer 1

回复中 header 的最后一行为您提供了线索：

HTTP/1.1 200 OK
Connection: keep-alive
cache-control: private
...
transfer-encoding: chunked

transfer-encoding 表示 header 之后的内容不纯 HTML。来自 the spec:

   The chunked encoding modifies the body of a message in order to
   transfer it as a series of chunks, each with its own size indicator,
   followed by an OPTIONAL trailer containing entity-header fields
...
   The chunk-size field is a string of hex digits indicating the size of
   the chunk. The chunked encoding is ended by any chunk whose size is
   zero, followed by the trailer, which is terminated by an empty line.

换句话说，您看到的是一个十六进制数字，显示下一个块中的字节数。可能有不止一个块。您需要检查该 HTTP header，如果它存在，则在将页面解析为 HTML.

之前找到所有块并将它们连接在一起

http响应python中的随机3-4长字符串

Random 3-4 long strings in a http response pythyon

html

python

https

get

python-3.x