更新 python 中的元组列表
Update list of tuples in python
我有一个数据框,其中每一行都是一个元组列表,例如
[('This', 'DET'), ('is', 'VERB'), ('an', 'DET'), ('example', 'NOUN'), ('text', 'NOUN'), ('that', 'DET'), ('I', 'PRON'), ('use', 'VERB'), ('in', 'ADP'), ('order', 'NOUN'), ('to', 'PART'), ('get', 'VERB'), ('an', 'DET'), ('answer', 'NOUN')]
然后,在每一行中,我用<IN>word</IN>
或<TA>word</TA>
标记一些元组的单词。例如:
updated_word : <IN>example</IN>
updated_word : <TA>answer</TA>
我想更新数据框的每一行,以便它包含我的元组的更新版本,并且有类似的东西:
[('This', 'DET'), ('is', 'VERB'), ('an', 'DET'), ('<IN>example</IN>', 'NOUN'), ('text', 'NOUN'), ('that', 'DET'), ('I', 'PRON'), ('use', 'VERB'), ('in', 'ADP'), ('order', 'NOUN'), ('to', 'PART'), ('get', 'VERB'), ('an', 'DET'), ('<TA>answer</TA>', 'NOUN')]
我设法分别更新了每个元组,但我找不到将它们附加到数据框行并更新每行 元组列表 的方法。有人可以帮助我吗?
代码如下:
cols = list(df.columns)[4:]
for idx, row in df.iterrows():
doc = nlp(row['title'])
pos_tags = [(token.text, token.pos_) for token in doc if not token.pos_ == "PUNCT"]
for position, tuple_ in enumerate(pos_tags, start=1):
word = tuple_[0]
spacy_pos_tag = tuple_[1]
word = re.sub(r'[^\w\s]', '', word)
for col in cols:
if position in row[col]:
word = f'<{col.upper()}>{word}</{col.upper()}>'
else:
word = word
new_text.append(' '.join(word))
tuple_ = (word, spacy_pos_tag)
pos_tags[position] = tuple_
df['title'] = pos_tags
print(df.title)
更新
我使用@Peter White 的建议来获取元组列表,但是当我想将每个 pos_tags 元组列表附加到名为 [= 的数据框列的每一行时,我仍然遇到错误18=]。错误信息是:
raise ValueError(
ValueError: Length of values (23) does not match length of index (500)
把pos_tags[position] = tuple_?最后,从 enumerate:
中删除 start=1
cols = list(df.columns)[4:]
for idx, row in df.iterrows():
doc = nlp(row['title'])
pos_tags = [(token.text, token.pos_) for token in doc if not token.pos_ == "PUNCT"]
for position, tuple_ in enumerate(pos_tags):
word = tuple_[0]
spacy_pos_tag = tuple_[1]
word = re.sub(r'[^\w\s]', '', word)
for col in cols:
if position in row[col]:
word = f'<{col.upper()}>{word}</{col.upper()}>'
else:
word = word
new_text.append(' '.join(word))
tuple_ = (word, spacy_pos_tag)
print(tuple_)
pos_tags[position] = tuple_
我有一个数据框,其中每一行都是一个元组列表,例如
[('This', 'DET'), ('is', 'VERB'), ('an', 'DET'), ('example', 'NOUN'), ('text', 'NOUN'), ('that', 'DET'), ('I', 'PRON'), ('use', 'VERB'), ('in', 'ADP'), ('order', 'NOUN'), ('to', 'PART'), ('get', 'VERB'), ('an', 'DET'), ('answer', 'NOUN')]
然后,在每一行中,我用<IN>word</IN>
或<TA>word</TA>
标记一些元组的单词。例如:
updated_word : <IN>example</IN>
updated_word : <TA>answer</TA>
我想更新数据框的每一行,以便它包含我的元组的更新版本,并且有类似的东西:
[('This', 'DET'), ('is', 'VERB'), ('an', 'DET'), ('<IN>example</IN>', 'NOUN'), ('text', 'NOUN'), ('that', 'DET'), ('I', 'PRON'), ('use', 'VERB'), ('in', 'ADP'), ('order', 'NOUN'), ('to', 'PART'), ('get', 'VERB'), ('an', 'DET'), ('<TA>answer</TA>', 'NOUN')]
我设法分别更新了每个元组,但我找不到将它们附加到数据框行并更新每行 元组列表 的方法。有人可以帮助我吗?
代码如下:
cols = list(df.columns)[4:]
for idx, row in df.iterrows():
doc = nlp(row['title'])
pos_tags = [(token.text, token.pos_) for token in doc if not token.pos_ == "PUNCT"]
for position, tuple_ in enumerate(pos_tags, start=1):
word = tuple_[0]
spacy_pos_tag = tuple_[1]
word = re.sub(r'[^\w\s]', '', word)
for col in cols:
if position in row[col]:
word = f'<{col.upper()}>{word}</{col.upper()}>'
else:
word = word
new_text.append(' '.join(word))
tuple_ = (word, spacy_pos_tag)
pos_tags[position] = tuple_
df['title'] = pos_tags
print(df.title)
更新
我使用@Peter White 的建议来获取元组列表,但是当我想将每个 pos_tags 元组列表附加到名为 [= 的数据框列的每一行时,我仍然遇到错误18=]。错误信息是:
raise ValueError(
ValueError: Length of values (23) does not match length of index (500)
把pos_tags[position] = tuple_?最后,从 enumerate:
中删除 start=1cols = list(df.columns)[4:]
for idx, row in df.iterrows():
doc = nlp(row['title'])
pos_tags = [(token.text, token.pos_) for token in doc if not token.pos_ == "PUNCT"]
for position, tuple_ in enumerate(pos_tags):
word = tuple_[0]
spacy_pos_tag = tuple_[1]
word = re.sub(r'[^\w\s]', '', word)
for col in cols:
if position in row[col]:
word = f'<{col.upper()}>{word}</{col.upper()}>'
else:
word = word
new_text.append(' '.join(word))
tuple_ = (word, spacy_pos_tag)
print(tuple_)
pos_tags[position] = tuple_