将列表添加到现有的 csv
Add list to existing csv
感谢您的关注:
我正在尝试将计算值列表(平均值)作为新列添加到现有的 csv。
这是我的 MWE:
import csv
import re
import pandas as pd
import oseti
import numpy as np
# handle csv data
df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()
dtype_before = type(df["text"])
text_list = df["text"].tolist()
# create df for sentiment analysis
list_sa = (np.mean(list(map(analyzer.analyze,text_list))).tolist())
df_sa = pd.DataFrame (list_sa, columns = ['sa_mean'])
print (df_sa)
这部分有效(虽然我收到警告:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
)
并正确打印出值(因为我是新手,所以我想确保它看起来像我想要的那样)。
打印的结果看起来像这样:
sa_mean
0 0.000000
1 0.000000
2 0.000000
3 -0.018519
4 0.037037
但是,如果我不打印而是尝试将其作为新列获取到最初加载的 csv ('filepath/text.csv'),我不确定如何处理它(是否有必要使其成为 DataFrame 或一个系列?)
我试过这个(而不是最后一个打印行
df["new_column"] = df_sa
df.to_csv("text.csv", index=False)
但是,我收到一个错误 - 仍然创建了 csv,但我想了解是否有问题:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
我不太确定为什么会发生这种情况以及如何解决。
提前致谢!
编辑:
print(list_sa) 看起来像这样:
[0.0, 0.0, 0.0, -0.018518518518518517, 0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.012345679012345678, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.024691358024691357]
通过 np.mean
使用列表理解并分配给新列,此处不需要 df_sa
:
df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()
df['new_column'] = [np.mean(analyzer.analyze(x)) for x in df['text']]
或创建 lambda 函数:
df['new_column'] = df['text'].apply(lambda x: np.mean(analyzer.analyze(x)))
df.to_csv("text.csv", index=False)
是否可以判断是哪个语句产生了警告?您可能必须 运行 逐行,或者在它们之间打印(如果 运行 脚本)。
我怀疑是
np.mean(list(map(analyzer.analyze,text_list))
警告意味着您(或您的代码调用的东西)正在尝试从长度不同的列表中创建一个数组。例如:
In [245]: alist = [[1,2,3],[4,5],[6]]
In [246]: alist
Out[246]: [[1, 2, 3], [4, 5], [6]]
In [247]: np.array(alist)
<ipython-input-247-7512d762195a>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
np.array(alist)
Out[247]: array([list([1, 2, 3]), list([4, 5]), list([6])], dtype=object)
结果是一维数组,对象数据类型。它不能从这样的列表中创建一个二维数组。
尝试在该列表上做 mean 会产生相同的警告,因为它首先必须创建一个数组:
In [248]: np.mean(alist)
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:163: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
arr = asanyarray(a)
Out[248]:
array([0.33333333, 0.66666667, 1. , 1.33333333, 1.66666667,
2. ])
警告不会像错误那样提供回溯,但它会显示引发警告的操作。平均值也不对 - 列表是 'flattened' 但除数是 3!
正如 jezrael 所建议的,我们可以通过以下方式获取子列表的方法:
In [249]: [np.mean(x) for x in alist]
Out[249]: [2.0, 4.5, 6.0]
感谢您的关注:
我正在尝试将计算值列表(平均值)作为新列添加到现有的 csv。
这是我的 MWE:
import csv
import re
import pandas as pd
import oseti
import numpy as np
# handle csv data
df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()
dtype_before = type(df["text"])
text_list = df["text"].tolist()
# create df for sentiment analysis
list_sa = (np.mean(list(map(analyzer.analyze,text_list))).tolist())
df_sa = pd.DataFrame (list_sa, columns = ['sa_mean'])
print (df_sa)
这部分有效(虽然我收到警告:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
) 并正确打印出值(因为我是新手,所以我想确保它看起来像我想要的那样)。 打印的结果看起来像这样:
sa_mean
0 0.000000
1 0.000000
2 0.000000
3 -0.018519
4 0.037037
但是,如果我不打印而是尝试将其作为新列获取到最初加载的 csv ('filepath/text.csv'),我不确定如何处理它(是否有必要使其成为 DataFrame 或一个系列?)
我试过这个(而不是最后一个打印行
df["new_column"] = df_sa
df.to_csv("text.csv", index=False)
但是,我收到一个错误 - 仍然创建了 csv,但我想了解是否有问题:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
我不太确定为什么会发生这种情况以及如何解决。
提前致谢!
编辑:
print(list_sa) 看起来像这样:
[0.0, 0.0, 0.0, -0.018518518518518517, 0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.012345679012345678, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.024691358024691357]
通过 np.mean
使用列表理解并分配给新列,此处不需要 df_sa
:
df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()
df['new_column'] = [np.mean(analyzer.analyze(x)) for x in df['text']]
或创建 lambda 函数:
df['new_column'] = df['text'].apply(lambda x: np.mean(analyzer.analyze(x)))
df.to_csv("text.csv", index=False)
是否可以判断是哪个语句产生了警告?您可能必须 运行 逐行,或者在它们之间打印(如果 运行 脚本)。
我怀疑是
np.mean(list(map(analyzer.analyze,text_list))
警告意味着您(或您的代码调用的东西)正在尝试从长度不同的列表中创建一个数组。例如:
In [245]: alist = [[1,2,3],[4,5],[6]]
In [246]: alist
Out[246]: [[1, 2, 3], [4, 5], [6]]
In [247]: np.array(alist)
<ipython-input-247-7512d762195a>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
np.array(alist)
Out[247]: array([list([1, 2, 3]), list([4, 5]), list([6])], dtype=object)
结果是一维数组,对象数据类型。它不能从这样的列表中创建一个二维数组。
尝试在该列表上做 mean 会产生相同的警告,因为它首先必须创建一个数组:
In [248]: np.mean(alist)
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:163: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
arr = asanyarray(a)
Out[248]:
array([0.33333333, 0.66666667, 1. , 1.33333333, 1.66666667,
2. ])
警告不会像错误那样提供回溯,但它会显示引发警告的操作。平均值也不对 - 列表是 'flattened' 但除数是 3!
正如 jezrael 所建议的,我们可以通过以下方式获取子列表的方法:
In [249]: [np.mean(x) for x in alist]
Out[249]: [2.0, 4.5, 6.0]