通过索引替换多个字符串的算法
Algorithm for multiple string replacement by index
我在想出一个好的算法来替换文本中的某些实体时遇到了一些问题。以下是详细信息:
我有一段文本需要格式化为 html,有关格式化的信息位于包含实体字典的 python 列表中。比方说原文是这样的(请注意格式):
Lorem Ipsum 只是 dummy printing 和排版行业的文本。
我将得到的文本是这样的(没有格式化):
Lorem Ipsum 只是印刷和排版行业的虚拟文本。
和这样的实体列表:
entities = [{"entity_text":"Lorem Ipsum", "type": "bold", "offset": 0, "length":"11"}, {"entity_text":"dummy", "type": "italic", "offset": 22, "length":"5"},{"entity_text":"printing", "type": "text_link", "offset": 41, "length":"8", "url": "google.com"}]
我的算法应该将给定的未格式化文本和实体翻译成这个 html:
<b>Lorem Ipsum</b> is simply <i>dummy</i> text of the <a href="google.com">printing</a> and typesetting industry
以便可以将其编译成原始消息。
我尝试过字符串替换,但它弄乱了偏移量(实体从文本开始的位置)。请记住,文本中可能有许多带格式的单词没有格式化,所以我必须准确找到应该格式化的单词。有任何人的帮助吗?我正在用 python 编写代码,但您可以用任何语言指定算法
编辑
抱歉,我忘记了 post 我试过的代码。就是这样:
format_html(text, entities):
for entity in entities:
try:
entity_text = entity['entity_text']
position = text.find(entity_text, entity['offset'])
if position == entity['offset']:
before = text[:position]
after = text[min(position+entity['length'], len(text)-1):]
if entity['type'] == 'text_link':
text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
text = before + text_link + after
elif entity['type'] == 'code':
code = '<code>{}</code>'.format(entity_text)
text = before + code + after
elif entity['type'] == 'bold':
bold_text = '<b>{}</b>'.format(entity_text)
text = before + bold_text + after
elif entity['type'] == 'italic':
italic_text = '<i>{}</i>'.format(entity_text)
text = before + italic_text + after
elif entity['type'] == 'pre':
pre_code = '<pre>{}</pre>'.format(entity_text)
text = before + pre_code + after
except:
pass
你的意思可能是这样的?
text = ""
for entry in entries:
line = ""
for key, value in entry.iteritems():
if key == 'entity_text':
line += value
elif key == 'type' and value == 'bold':
line = "<b> {} </b>".format(line)
elif key == 'type' and value == 'italic':
line = "<i> {} </i>".format(line)
elif key == 'type' and value == 'text_link':
line = '<a href="google.com">{}</a>'.format(line)
text += line
text
转换为
'<b> Lorem Ipsum </b><i> dummy </i><a href="google.com">printing</a>'
好吧,我就是这样解决的。每次修改文本时,我都会根据添加到文本中的额外字符串的长度(由于标签)来调整偏移量。这在计算时间方面成本很高,但这是我见过的唯一选择
def format_html(text, entities):
for entity in entities:
try:
modified = None
entity_text = entity['entity_text']
position = text.find(entity_text, entity['offset'])
if position == entity['offset']:
before = text[:position]
after = text[min(position+entity['length'], len(text)-1):]
if entity['type'] == 'text_link':
text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
text = before + text_link + after
modified = 15 + len(entity['url'])
elif entity['type'] == 'code':
code = '<code>{}</code>'.format(entity_text)
text = before + code + after
modified = 13
elif entity['type'] == 'bold':
bold_text = '<b>{}</b>'.format(entity_text)
text = before + bold_text + after
modified = 7
elif entity['type'] == 'italic':
italic_text = '<i>{}</i>'.format(entity_text)
text = before + italic_text + after
modified = 7
elif entity['type'] == 'pre':
pre_code = '<pre>{}</pre>'.format(entity_text)
text = before + pre_code + after
modified = 11
if modified:
for other in entites:
if other['offset'] > entity.offset:
other.offset += modified
except:
pass
我在想出一个好的算法来替换文本中的某些实体时遇到了一些问题。以下是详细信息: 我有一段文本需要格式化为 html,有关格式化的信息位于包含实体字典的 python 列表中。比方说原文是这样的(请注意格式):
Lorem Ipsum 只是 dummy printing 和排版行业的文本。
我将得到的文本是这样的(没有格式化):
Lorem Ipsum 只是印刷和排版行业的虚拟文本。
和这样的实体列表:
entities = [{"entity_text":"Lorem Ipsum", "type": "bold", "offset": 0, "length":"11"}, {"entity_text":"dummy", "type": "italic", "offset": 22, "length":"5"},{"entity_text":"printing", "type": "text_link", "offset": 41, "length":"8", "url": "google.com"}]
我的算法应该将给定的未格式化文本和实体翻译成这个 html:
<b>Lorem Ipsum</b> is simply <i>dummy</i> text of the <a href="google.com">printing</a> and typesetting industry
以便可以将其编译成原始消息。 我尝试过字符串替换,但它弄乱了偏移量(实体从文本开始的位置)。请记住,文本中可能有许多带格式的单词没有格式化,所以我必须准确找到应该格式化的单词。有任何人的帮助吗?我正在用 python 编写代码,但您可以用任何语言指定算法
编辑 抱歉,我忘记了 post 我试过的代码。就是这样:
format_html(text, entities):
for entity in entities:
try:
entity_text = entity['entity_text']
position = text.find(entity_text, entity['offset'])
if position == entity['offset']:
before = text[:position]
after = text[min(position+entity['length'], len(text)-1):]
if entity['type'] == 'text_link':
text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
text = before + text_link + after
elif entity['type'] == 'code':
code = '<code>{}</code>'.format(entity_text)
text = before + code + after
elif entity['type'] == 'bold':
bold_text = '<b>{}</b>'.format(entity_text)
text = before + bold_text + after
elif entity['type'] == 'italic':
italic_text = '<i>{}</i>'.format(entity_text)
text = before + italic_text + after
elif entity['type'] == 'pre':
pre_code = '<pre>{}</pre>'.format(entity_text)
text = before + pre_code + after
except:
pass
你的意思可能是这样的?
text = ""
for entry in entries:
line = ""
for key, value in entry.iteritems():
if key == 'entity_text':
line += value
elif key == 'type' and value == 'bold':
line = "<b> {} </b>".format(line)
elif key == 'type' and value == 'italic':
line = "<i> {} </i>".format(line)
elif key == 'type' and value == 'text_link':
line = '<a href="google.com">{}</a>'.format(line)
text += line
text
转换为
'<b> Lorem Ipsum </b><i> dummy </i><a href="google.com">printing</a>'
好吧,我就是这样解决的。每次修改文本时,我都会根据添加到文本中的额外字符串的长度(由于标签)来调整偏移量。这在计算时间方面成本很高,但这是我见过的唯一选择
def format_html(text, entities):
for entity in entities:
try:
modified = None
entity_text = entity['entity_text']
position = text.find(entity_text, entity['offset'])
if position == entity['offset']:
before = text[:position]
after = text[min(position+entity['length'], len(text)-1):]
if entity['type'] == 'text_link':
text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
text = before + text_link + after
modified = 15 + len(entity['url'])
elif entity['type'] == 'code':
code = '<code>{}</code>'.format(entity_text)
text = before + code + after
modified = 13
elif entity['type'] == 'bold':
bold_text = '<b>{}</b>'.format(entity_text)
text = before + bold_text + after
modified = 7
elif entity['type'] == 'italic':
italic_text = '<i>{}</i>'.format(entity_text)
text = before + italic_text + after
modified = 7
elif entity['type'] == 'pre':
pre_code = '<pre>{}</pre>'.format(entity_text)
text = before + pre_code + after
modified = 11
if modified:
for other in entites:
if other['offset'] > entity.offset:
other.offset += modified
except:
pass