如何找到两个字符串的并集并保持顺序
How to find union of two strings and maintain the order
我有两个字符串,我想找到它们的并集。在这样做的同时,我想维持秩序。我这样做的目的是尝试多种方法对图像进行 OCR 并获得不同的结果。我想将所有不同的结果组合成一个内容最多的结果。
这至少是我所追求的:
#example1
string1 = "This is a test trees are green roses are red"
string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
finalstring = "this is a test trees are green roses are red 12.48.1952 anthony gonzalez"
#example2
string2 = "This is a test trees are green roses are red"
string1 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
finalstring = "this is a test trees are green roses are red 12.48.1952 anthony gonzalez"
#example3
string1 = "telephone conversation in some place big image on screen"
String2 = "roses are red telephone conversation in some place big image on screen"
finalstring = "roses are red telephone conversation in some place big image on screen"
#or the following - both are fine in this scenario.
finalstring = "telephone conversation in some place big image on screen roses are red "
这是我试过的:
>>> string1 = "This is a test trees are green roses are red"
>>> string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
>>> list1 = string1.split(" ")
>>> list2 = string2.split(" ")
>>> " ".join(list(set(list1) | set(list2))).lower()
'a gonzalez this is trees anthony roses green are test 12.48.1952 test is red'
" ".join(x if i >= len(string2.split()) or x == string2.lower().split()[i] else " ".join((x, string2.split()[i])) for i, x in enumerate(string1.lower().split()))
您可以像这样使用生成器理解和 join
来完成您想要的。这将 i
设置为 string1
中单词的索引,并将 x
设置为该单词。然后检查该单词是否在 string2
中,如果不在,则将 string2
中的单词添加到 i
到 x
中,以将两个单词放入最终字符串中。
不要为此使用集合。您一定已经注意到,只有一个 进入了最终结果,因为 set()
保留了独特的对象。
string1 = "This is a test trees are green roses are red"
string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
str_lst = string1.split()
for s, t in zip(string1.split(), string2.split()):
if s.lower() == t.lower():
continue
else:
str_lst.append(t)
string = " ".join(s.lower() for s in str_lst)
#this is a test trees are green roses are red 12.48.1952 anthony gonzalez
您可以为此使用 difflib.SequenceMatcher
:
import difflib
def merge (l, r):
m = difflib.SequenceMatcher(None, l, r)
for o, i1, i2, j1, j2 in m.get_opcodes():
if o == 'equal':
yield l[i1:i2]
elif o == 'delete':
yield l[i1:i2]
elif o == 'insert':
yield r[j1:j2]
elif o == 'replace':
yield l[i1:i2]
yield r[j1:j2]
这样使用:
>>> string1 = 'This is a test trees are green roses are red'
>>> string2 = 'This iS a TEST trees 12.48.1952 anthony gonzalez'
>>> merged = merge(string1.lower().split(), string2.lower().split())
>>> ' '.join(' '.join(x) for x in merged)
'this is a test trees are green roses are red 12.48.1952 anthony gonzalez'
如果要在字符级别执行合并,只需修改调用以直接对字符串(而不是单词列表)进行操作:
>>> merged = merge(string1.lower(), string2.lower())
>>> ''.join(merged)
'this is a test trees 12.48.1952 arenthony gronzaleen roses are redz'
此解决方案正确地维护了字符串各个部分的顺序。因此,如果两个字符串都以公共部分结尾,但在结尾之前有不同的段,那么这两个不同的段仍将出现在结果中的公共结尾 before 之前。例如合并 A B D
和 A C D
会给你 A B C D
.
因此,您只需删除部分结果字符串,即可按正确顺序找到每个原始字符串。如果从该示例结果中删除 C
,您将取回第一个字符串;如果您改为删除 B
,则会取回第二个字符串。这也适用于更复杂的合并。
我有两个字符串,我想找到它们的并集。在这样做的同时,我想维持秩序。我这样做的目的是尝试多种方法对图像进行 OCR 并获得不同的结果。我想将所有不同的结果组合成一个内容最多的结果。
这至少是我所追求的:
#example1
string1 = "This is a test trees are green roses are red"
string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
finalstring = "this is a test trees are green roses are red 12.48.1952 anthony gonzalez"
#example2
string2 = "This is a test trees are green roses are red"
string1 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
finalstring = "this is a test trees are green roses are red 12.48.1952 anthony gonzalez"
#example3
string1 = "telephone conversation in some place big image on screen"
String2 = "roses are red telephone conversation in some place big image on screen"
finalstring = "roses are red telephone conversation in some place big image on screen"
#or the following - both are fine in this scenario.
finalstring = "telephone conversation in some place big image on screen roses are red "
这是我试过的:
>>> string1 = "This is a test trees are green roses are red"
>>> string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
>>> list1 = string1.split(" ")
>>> list2 = string2.split(" ")
>>> " ".join(list(set(list1) | set(list2))).lower()
'a gonzalez this is trees anthony roses green are test 12.48.1952 test is red'
" ".join(x if i >= len(string2.split()) or x == string2.lower().split()[i] else " ".join((x, string2.split()[i])) for i, x in enumerate(string1.lower().split()))
您可以像这样使用生成器理解和 join
来完成您想要的。这将 i
设置为 string1
中单词的索引,并将 x
设置为该单词。然后检查该单词是否在 string2
中,如果不在,则将 string2
中的单词添加到 i
到 x
中,以将两个单词放入最终字符串中。
不要为此使用集合。您一定已经注意到,只有一个 进入了最终结果,因为 set()
保留了独特的对象。
string1 = "This is a test trees are green roses are red"
string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
str_lst = string1.split()
for s, t in zip(string1.split(), string2.split()):
if s.lower() == t.lower():
continue
else:
str_lst.append(t)
string = " ".join(s.lower() for s in str_lst)
#this is a test trees are green roses are red 12.48.1952 anthony gonzalez
您可以为此使用 difflib.SequenceMatcher
:
import difflib
def merge (l, r):
m = difflib.SequenceMatcher(None, l, r)
for o, i1, i2, j1, j2 in m.get_opcodes():
if o == 'equal':
yield l[i1:i2]
elif o == 'delete':
yield l[i1:i2]
elif o == 'insert':
yield r[j1:j2]
elif o == 'replace':
yield l[i1:i2]
yield r[j1:j2]
这样使用:
>>> string1 = 'This is a test trees are green roses are red'
>>> string2 = 'This iS a TEST trees 12.48.1952 anthony gonzalez'
>>> merged = merge(string1.lower().split(), string2.lower().split())
>>> ' '.join(' '.join(x) for x in merged)
'this is a test trees are green roses are red 12.48.1952 anthony gonzalez'
如果要在字符级别执行合并,只需修改调用以直接对字符串(而不是单词列表)进行操作:
>>> merged = merge(string1.lower(), string2.lower())
>>> ''.join(merged)
'this is a test trees 12.48.1952 arenthony gronzaleen roses are redz'
此解决方案正确地维护了字符串各个部分的顺序。因此,如果两个字符串都以公共部分结尾,但在结尾之前有不同的段,那么这两个不同的段仍将出现在结果中的公共结尾 before 之前。例如合并 A B D
和 A C D
会给你 A B C D
.
因此,您只需删除部分结果字符串,即可按正确顺序找到每个原始字符串。如果从该示例结果中删除 C
,您将取回第一个字符串;如果您改为删除 B
,则会取回第二个字符串。这也适用于更复杂的合并。