列出数据框列中每一行的拼接
List splice for each row in column of dataframe
我有一列包含字符串。我想转换这个列,所以我最后只得到字符串的前 n 个词。
我知道我需要拆分字符串然后拼接列表以保留前 n 个单词。然后我可以使用 join 再次加入它们。但是,我 运行 在执行此操作时遇到了麻烦。
我希望以下方法有效:
data = [[1, "A complete sentence must have, at minimum, three things: a subject, verb, and an object. The subject is typically a noun or a pronoun."], [2, "And, if there's a subject, there's bound to be a verb because all verbs need a "], [3, "subject. Finally, the object of a sentence is the thing that's being acted upon by the subject."], [4, "So, you might say, Claire walks her dog. In this complete "]]
df = pd.DataFrame(data, columns = ['id', 'text'])
df['first_three'] = df['text'].str.split()[:3]
但这会对前 3 行执行拆分命令,而不是保留每行的前三个单词。
所以看起来像这样:
first_three
['A', 'complete', 'sentence', 'must', 'have,', 'at', 'minimum,', 'three', 'things:', 'a', 'subject,', 'verb,', 'and', 'an', 'object.', 'The', 'subject', 'is', 'typically', 'a', 'noun', 'or', 'a', 'pronoun.']
['And,', 'if', "there's", 'a', 'subject,', "there's", 'bound', 'to', 'be', 'a', 'verb', 'because', 'all', 'verbs', 'need', 'a']
['subject.', 'Finally,', 'the', 'object', 'of', 'a', 'sentence', 'is', 'the', 'thing', "that's", 'being', 'acted', 'upon', 'by', 'the', 'subject.']
NaN
我希望 first_three 列看起来像这样:
first_three
[A, complete, sentence]
[And, if, there's]
[subject, Finally, the]
[So, you, might]
所以我可以加入他们并继续。
我知道这一定很容易修复,但我似乎找不到解决方案。
非常感谢您的意见。
您可以使用 apply 函数从列表中提取所需数量的元素。
df['first_three'] = df['text'].str.split().apply(lambda x : x[:3])
如果你还想进行一些文本清理,那么你可以这样做:
df['first_three'] = df['text'].str.replace(",", " ")
df['first_three'] = df['first_three'].apply(lambda x : x.split()[:3])
输出
first_three
[A, complete, sentence]
[And, if, there's]
[subject., Finally, the]
我有一列包含字符串。我想转换这个列,所以我最后只得到字符串的前 n 个词。
我知道我需要拆分字符串然后拼接列表以保留前 n 个单词。然后我可以使用 join 再次加入它们。但是,我 运行 在执行此操作时遇到了麻烦。
我希望以下方法有效:
data = [[1, "A complete sentence must have, at minimum, three things: a subject, verb, and an object. The subject is typically a noun or a pronoun."], [2, "And, if there's a subject, there's bound to be a verb because all verbs need a "], [3, "subject. Finally, the object of a sentence is the thing that's being acted upon by the subject."], [4, "So, you might say, Claire walks her dog. In this complete "]]
df = pd.DataFrame(data, columns = ['id', 'text'])
df['first_three'] = df['text'].str.split()[:3]
但这会对前 3 行执行拆分命令,而不是保留每行的前三个单词。
所以看起来像这样:
first_three
['A', 'complete', 'sentence', 'must', 'have,', 'at', 'minimum,', 'three', 'things:', 'a', 'subject,', 'verb,', 'and', 'an', 'object.', 'The', 'subject', 'is', 'typically', 'a', 'noun', 'or', 'a', 'pronoun.']
['And,', 'if', "there's", 'a', 'subject,', "there's", 'bound', 'to', 'be', 'a', 'verb', 'because', 'all', 'verbs', 'need', 'a']
['subject.', 'Finally,', 'the', 'object', 'of', 'a', 'sentence', 'is', 'the', 'thing', "that's", 'being', 'acted', 'upon', 'by', 'the', 'subject.']
NaN
我希望 first_three 列看起来像这样:
first_three
[A, complete, sentence]
[And, if, there's]
[subject, Finally, the]
[So, you, might]
所以我可以加入他们并继续。 我知道这一定很容易修复,但我似乎找不到解决方案。 非常感谢您的意见。
您可以使用 apply 函数从列表中提取所需数量的元素。
df['first_three'] = df['text'].str.split().apply(lambda x : x[:3])
如果你还想进行一些文本清理,那么你可以这样做:
df['first_three'] = df['text'].str.replace(",", " ")
df['first_three'] = df['first_three'].apply(lambda x : x.split()[:3])
输出
first_three
[A, complete, sentence]
[And, if, there's]
[subject., Finally, the]