推断空间:忽略数字和特殊字符
Infer Spaces: Ignore Numbers and Special Characters
我试图忽略所有不是字母的内容(撇号等),然后继续。该数字应位于结果中的相应位置。这是来自 this accepted answer, and the word list is here。
字符串是"thereare7deadlysins"
下面的代码输出 "there are 7 d e a d l y s i n s"
我正在尝试获取 "there are 7 deadly sins"
我试过了(如下),但我收到了 IndexError: 'string index out of range'
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
if isinstance(s[i], int):
continue
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
整个事情是:
from math import log
import string
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("/Users/.../Desktop/wordlist.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
table = string.maketrans("","")
l = "".join("thereare7deadlysins".split()).lower()
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
def test_trans(s):
return s.translate(table, string.punctuation)
s = test_trans(l)
print(infer_spaces(s))
编辑:根据接受的答案,以下解决了我的问题:
1. 从单词列表中删除单个字母(a、e、i 除外)
2.在wordcost下方添加了以下内容。
nums = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
for n in nums:
wordcost[n] = log(2)
将 wordcost 更改为(下方)的建议未产生最佳结果。
wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )
示例:
字符串:"Recall8importantscreeningquestions"
原词成本:"recall 8 important screening questions"
建议的字词成本:"re call 8 important s c re e n in g question s"
请注意,单词列表包含所有 26 个单独的字母作为单词。
只需进行以下修改,您的算法将正确推断输入字符串 "therearesevendeadlysins" 的空格(即“7”更改为 "seven"):
- 从单词列表中删除单字母单词(可能 "a" 和 "i" 除外。)
- 如@Pm 2Ring 所述,更改 wordcost
的定义
至:
wordcost = wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )
所以有一些关于非字母的东西正在搞砸你的算法。由于您已经删除了标点符号,也许您应该将一串非字母视为一个单词。
例如,如果您添加:
wordcost["7"] = log(2)
(除了上述更改 1 和 2 之外)您的算法适用于原始测试字符串。
i = len(s) -1
避免 IndexError:'string index out of range'
和
if s[i].isdigit():
是您要找的测试。
我试图忽略所有不是字母的内容(撇号等),然后继续。该数字应位于结果中的相应位置。这是来自 this accepted answer, and the word list is here。
字符串是"thereare7deadlysins"
下面的代码输出 "there are 7 d e a d l y s i n s"
我正在尝试获取 "there are 7 deadly sins"
我试过了(如下),但我收到了 IndexError: 'string index out of range'
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
if isinstance(s[i], int):
continue
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
整个事情是:
from math import log
import string
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("/Users/.../Desktop/wordlist.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
table = string.maketrans("","")
l = "".join("thereare7deadlysins".split()).lower()
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
def test_trans(s):
return s.translate(table, string.punctuation)
s = test_trans(l)
print(infer_spaces(s))
编辑:根据接受的答案,以下解决了我的问题:
1. 从单词列表中删除单个字母(a、e、i 除外)
2.在wordcost下方添加了以下内容。
nums = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
for n in nums:
wordcost[n] = log(2)
将 wordcost 更改为(下方)的建议未产生最佳结果。
wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )
示例:
字符串:"Recall8importantscreeningquestions"
原词成本:"recall 8 important screening questions"
建议的字词成本:"re call 8 important s c re e n in g question s"
请注意,单词列表包含所有 26 个单独的字母作为单词。
只需进行以下修改,您的算法将正确推断输入字符串 "therearesevendeadlysins" 的空格(即“7”更改为 "seven"):
- 从单词列表中删除单字母单词(可能 "a" 和 "i" 除外。)
- 如@Pm 2Ring 所述,更改 wordcost 的定义
至:
wordcost = wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )
所以有一些关于非字母的东西正在搞砸你的算法。由于您已经删除了标点符号,也许您应该将一串非字母视为一个单词。
例如,如果您添加:
wordcost["7"] = log(2)
(除了上述更改 1 和 2 之外)您的算法适用于原始测试字符串。
i = len(s) -1
避免 IndexError:'string index out of range' 和
if s[i].isdigit():
是您要找的测试。