Unicode space 自动转义，不再被 strip() 识别

Question

TLDR：Scrapy 转义 Unicode space 代码 \u0020 因此它不再被 strip() 识别。

我正在尝试像这样使用 Scrapy 抓取一些网络链接：

class MySpider(scrapy.Spider): 

    name = 'testSpider'
    start_urls = [<someStartUrls>]

    def parse(self, response): 
        for entry in response:
            yield {<someComplicatedXPath>.xpath('a/@href').get()}

其中一些链接具有奇怪的格式工件，例如，它们可能看起来像 <a href="linkUrl\u0020"> Link Text </a> 或 <a href="\u0020linkUrl2"> Link Text </a> - 即它们中有 Unicode space。这些 spaces 在我的输出中持续存在：

linkUrl\u0020
\u0020linkUrl2

为了至少像这样删除前导和尾随 space，我在 XPath 输出周围包装了一个 "cleaning" 函数：

    <...>
    def parse(self, response): 
        for entry in response:
            yield {cleanStr(<someComplicatedXPath>.xpath('a/@href').get())} 

def cleanStr(webString): # a bit simplified 
    return webString.strip()

那没有任何效果。当我查看字符串的表示时，原因变得很清楚：

def cleanStr(webString): # a bit simplified 
    print(webString)       ##### this prints "linkUrl\u0020"  #####
    print(repr(webString)) ##### this prints "linkUrl\u0020" #####
    return webString.strip()

所以strip()收到带有转义反斜杠的字符串，不再识别Unicode代码。我假设这种转义发生在 get() 执行期间，但我不确定。

虽然可以用蛮力替换这个前 Unicode-space，但这肯定不是正确的方法。稳健处理 HTML 链接内的这些 space 的最佳方法是什么？

Answer 1

如果您有一个字符，则只需使用 replace() 和 '\u0020' 或原始前缀 r'\u0020'

text = r'linkUrl\u0020'
print(text)
text = text.replace(r'\u0020', ' ')
print(text)

结果：

linkUrl\u0020
linkUrl

如果你有其他字符 \u 那么你可以使用 .encode().decode('unicode_escape')

text = r'linkUrl\u0020\u0041\u0042\u0043'
print(text)
text = text.encode().decode('unicode_escape')
print(text)

结果：

linkUrl\u0020\u0041\u0042\u0043
linkUrl ABC

文档：7.2.4. Python Specific Encodings

Unicode space 自动转义，不再被 strip() 识别

Unicode space is automatically escaped and no longer recognized by `strip()`

python

unicode

escaping

scrapy

web-scraping