如何从 Python 2.7 中的 unicode 字符串中删除 \r、\n、\t
How to remove \r, \n, \t from unicode strings in Python 2.7
我有一些爬取的数据,里面充满了烦人的转义字符:
{"website": "http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=7&day=10", "headliner": ["\"Roda Vibe\" with the Tallahassee Choro Society"], "data": [" \r\n ", "\r\n\t\r\n\r\n\t", "\r\n\t\r\n\t\r\n\t", "\r\n\t", "\r\n\t", "\r\n\t", "8:00 PM", "\r\n\t\r\n\tFEE: \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ", "\r\n\tEvery 2nd & 4th Tuesday of the month, the Choro Society returns to Blue Tavern with that subtly infectious Brazilian rhythm and beautiful melodies that will stay with you for days. The perfect antidote to Taylor Swift. for musicians; tips appreciated. ", "\r\n\t", "\r\n\t\r\n\t", "\r\n\t", "\r\n\t", "\r\n\t\r\n\t\r\n\r\n\t\r\n\t", "\r\n\t\r\n\t\t", "\r\n", "\r\n", "\r\n", "\r\n"]},
我正在尝试编写一个函数来删除这些字符,但我的两种策略都不起作用:
# strategy 1
escapes = ''.join([chr(char) for char in range(1, 32)])
table = {ord(char): None for char in escapes}
for item in concert['data']:
item = item.translate(table)
# strategy 2
for item in concert['data']:
for char in item:
char = char.replace("\r", "").replace("\t", "").replace("\n", "")
为什么我的数据仍然充满转义字符我已经尝试了两种不同的方法来删除?
考虑以下几点:
lst = ["aaa", "abc", "def"]
for x in lst:
x = x.replace("a","z")
print(lst) # ['aaa', 'abc', 'def']
列表似乎没有变化。它是(不变的)。 (重新)分配给 for 循环 (x
) 中使用的变量在循环 内部 起作用,但更改永远不会传播回 lst
.
改为:
for (i,x) in enumerate(lst):
lst[i] = x.replace("a","z")
print(lst) # ['zzz', 'zbc', 'def']
或者
for i in range(len(lst)):
lst[i] = lst[i].replace("a","z")
print(lst) # ['zzz', 'zbc', 'def']
编辑
由于您使用的是赋值 (x = ...
),因此您必须将 赋值回原始列表,使用类似 lst[i] = ...
.
对于不可变类型(包括字符串),这确实是您唯一的选择。 x.replace("a","z")
不会更改 x
,它 returns 具有指定替换的新字符串。
使用 mutable 类型(例如列表),您可以对 iterand (?) 对象执行就地修改 - [=21= 中的 x
].
所以类似下面的内容将看到对 x
的更改传播到 lst
。
lst = [[1],[2],[3]]
for x in lst:
x.append('added') # Example of in-place modification
print(lst) # [[1, 'added'], [2, 'added'], [3, 'added']]
As x.append()
(与 str.replace()
不同)确实 更改了 x
对象。
我有一些爬取的数据,里面充满了烦人的转义字符:
{"website": "http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=7&day=10", "headliner": ["\"Roda Vibe\" with the Tallahassee Choro Society"], "data": [" \r\n ", "\r\n\t\r\n\r\n\t", "\r\n\t\r\n\t\r\n\t", "\r\n\t", "\r\n\t", "\r\n\t", "8:00 PM", "\r\n\t\r\n\tFEE: \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ", "\r\n\tEvery 2nd & 4th Tuesday of the month, the Choro Society returns to Blue Tavern with that subtly infectious Brazilian rhythm and beautiful melodies that will stay with you for days. The perfect antidote to Taylor Swift. for musicians; tips appreciated. ", "\r\n\t", "\r\n\t\r\n\t", "\r\n\t", "\r\n\t", "\r\n\t\r\n\t\r\n\r\n\t\r\n\t", "\r\n\t\r\n\t\t", "\r\n", "\r\n", "\r\n", "\r\n"]},
我正在尝试编写一个函数来删除这些字符,但我的两种策略都不起作用:
# strategy 1
escapes = ''.join([chr(char) for char in range(1, 32)])
table = {ord(char): None for char in escapes}
for item in concert['data']:
item = item.translate(table)
# strategy 2
for item in concert['data']:
for char in item:
char = char.replace("\r", "").replace("\t", "").replace("\n", "")
为什么我的数据仍然充满转义字符我已经尝试了两种不同的方法来删除?
考虑以下几点:
lst = ["aaa", "abc", "def"]
for x in lst:
x = x.replace("a","z")
print(lst) # ['aaa', 'abc', 'def']
列表似乎没有变化。它是(不变的)。 (重新)分配给 for 循环 (x
) 中使用的变量在循环 内部 起作用,但更改永远不会传播回 lst
.
改为:
for (i,x) in enumerate(lst):
lst[i] = x.replace("a","z")
print(lst) # ['zzz', 'zbc', 'def']
或者
for i in range(len(lst)):
lst[i] = lst[i].replace("a","z")
print(lst) # ['zzz', 'zbc', 'def']
编辑
由于您使用的是赋值 (x = ...
),因此您必须将 赋值回原始列表,使用类似 lst[i] = ...
.
对于不可变类型(包括字符串),这确实是您唯一的选择。 x.replace("a","z")
不会更改 x
,它 returns 具有指定替换的新字符串。
使用 mutable 类型(例如列表),您可以对 iterand (?) 对象执行就地修改 - [=21= 中的 x
].
所以类似下面的内容将看到对 x
的更改传播到 lst
。
lst = [[1],[2],[3]]
for x in lst:
x.append('added') # Example of in-place modification
print(lst) # [[1, 'added'], [2, 'added'], [3, 'added']]
As x.append()
(与 str.replace()
不同)确实 更改了 x
对象。