为什么我用 scrapy 而不是 html 得到这些奇怪的字符？

Question

这对我来说只是一个业余爱好任务。我尝试通过 scappy 获取 booking.com 起始页 html。

  def start_requests(self):
        print('step 1')
        start_url = 'https://www.booking.com'
        yield scrapy.Request(url=start_url, headers=self.headers, callback=self.step2)

    def step2(self, response):
        print('step 2')
        print(response.status)
        print(response.headers)
        print(response.text)

我在 response.text 中发现了一些奇怪的东西。这是响应的一部分：

O��xa�X��_\O^'IM�l�F��6(]1�r��LB>�O�g�#p.�:x�}8Rh��ӓ�Q��2h��ƺU�s�&��0{��l]Y&��F9�@�WCR��7�**)JE-�-��&�� )ԼS��y��z�R�@�J��1��N��60��&'�lK�E�R.Ҙɧ�e��S��ϵ��C�(��6$�&��L2��{��B^�@��~~['� ��T2�|"|��X�L 5˔-�خ� AJ�8��X�@5`�y*��:��O��⎻��␊��R��71┴�A"≠�Eٹ��[�9B��6,��#�[=13=]%(L�2'°��≤≥�&�Ď�Lȋ7� <��*p�ABU�ālK�=��iݐ�'�b>I�'�J��o7��e�| �≥�4��Vď�L�0��◆�xՒPef��&l��d{X�h��#�� q$�d�$��?�:�M��&jb{��0��@� ��S�_��4ztlS��4�2^��5^�7'� QFUH:��7▒��│ �┘�.��ݔ��M�␋�ȵ��A⎽┼:�Z�:��F��├�D�-߯8*��ǠH*��ؔ│�J�C�oe2|��}xo�&��"K��j�y�<�%Z�;!M��t ۩~�R�cy2�>D�h�p��3�4��x�y1��T\��IY��F�(�E��ì� �[

这看起来像假数据。每次的反应都不一样。当我尝试通过邮递员提出类似请求时，一切正常。我收到带有 link 的代码 - 302 响应，可以正常打开网站。我认为预订检测到我的代码是一个刮刀，但我不明白如何。 IP地址与邮递员相同，邮递员也忽略javascript，所以我不知道发生了什么。请帮帮我。谢谢！

Answer 1

您得到的是原始压缩响应，这很奇怪，通常 scrapy 会自行处理 HTTP 会话和压缩数据，多亏了 CookiesMiddleware 和 HttpCompressionMiddleware，它们都是默认启用。您是否在代码中硬编码 Accept-Encoding？

# Content-Encoding set to gzip in response, since client supports it.
curl -H 'Accept-Encoding: gzip' -v https://www.booking.com 1>/dev/null 2>&1 | grep content-encoding

# It shows binary data, you can decompress with -c
curl -H 'Accept-Encoding: gzip' https://www.booking.com
curl -c -H 'Accept-Encoding: gzip' https://www.booking.com

# No Content-Encoding from response if the client don't want gzip.
curl -v https://www.booking.com 1>/dev/null 2>&1 | grep content-encoding

为什么我用 scrapy 而不是 html 得到这些奇怪的字符？

Why I get these strange characters with scrapy instead of html?

python

web-crawler

scrapy