如何从 CSV 文件中删除一些带有注释的行以将数据加载到 DataFrame？

Question

有一个比较大的 CSV 数据文件（大约 80Mb）。当我在 MS Excel 中打开它时，我看到它包含 100 列和多行数据。但是，第一行不是列名，而是一个 web link。此外，最后两行是一些评论。所以，现在我想将这些数据加载到 pandas DataFrame:

import pandas as pd
df = pd.read_csv('myfile.csv')

然后我想读取一个名为 Duration 的列（我看到它存在于 CSV 文件中）并从它的值中删除一个词 years：

Duration = map(lambda x: float(x.rstrip('years')), df['Duration'])

它给我这个错误：

AttributeError: 'float' object has no attribute 'rstrip'

如果我在 MS Excel 中打开文件并删除第一行（网页 link）和最后两行（评论），那么代码就可以工作了！

那么，如何在 Python 中自动清理此 CSV 文件（以仅提取具有值的列）？

更新： 当我写 print df.head() 时，它输出：

have mixed types. Specify dtype option on import or set low_memory=False.

我是否需要为所有 100 列指定类型？如果我不知道先验类型怎么办。

更新： 我无法附加文件，但作为示例，您可以查看 this one。下载文件 2015-2016.

Answer 1

要跳过第一行，您可以使用 read_csv 中的 skiprows 选项。如果最后两行不太棘手（即它们会导致一些解析错误），您可以使用 .iloc 忽略它们。最后，rstrip 的矢量化版本可通过 Duration 列的 str 属性获得，假设它包含字符串。

示例见以下代码：

import pandas as pd
from StringIO import StringIO
content = StringIO("""http://www.example.com
col1,col2,Duration
1,11,5 years
2,22,4 years
3,33,2 years
# Some comments in the
# last two lines here.
""")
df = pd.read_csv(content, skiprows=1).iloc[:-2]
df['Duration'] = df.Duration.str.rstrip('years').astype(float)
print df

输出：

  col1  col2 Duration
0    1    11       5 
1    2    22       4 
2    3    33       2

如果阅读速度不是问题，您也可以使用read_csv中的skip_footer=2选项跳过最后两行。这将导致 read_csv 使用 Python 解析器引擎而不是更快的 C 引擎。

Answer 2

pd.read_csv() 中有一些您应该使用的参数：

df = pdread_csv('myfile.csv', skiprows=1, skip_footer=2)

我查看了您在评论中提供的 link 并尝试导入它。我看到了两种混合数据类型（id 和 desc），所以我明确地为这两列设置了 dtype。此外，通过观察，页脚包含 'Total'，因此我排除了任何以字母 T 开头的行。除了 headers 之外，有效行应以 id 列的整数开头。如果引入其他非T开头的页脚，读取时会报错

如果您先下载并解压zip文件，您可以进行如下操作：

file_loc = ...  # Specify location where you saved the unzipped file.
df = pd.read_csv(file_loc, skiprows=1, skip_blank_lines=True, 
                 dtype={'id': int, 'desc': str}, comment='T')

这将从 emp_length 列中删除 year 或 years，尽管您仍然保留文本类别。

df['emp_length'] = df.emp_length.str.replace(r'( years|year)', '')

如何从 CSV 文件中删除一些带有注释的行以将数据加载到 DataFrame？

How to delete some rows with comments from a CSV file to load data to DataFrame?

python

csv

bigdata

dataframe

pandas