使用 pandas、NaN 问题自动绘制超过 100 个 .txt 文件

Question

下午好

我正在尝试导入 100 多个单独的 .txt 文件，其中包含我要绘制的数据。我想自动化这个过程，因为对每个单独的文件进行相同的迭代是最乏味的。

我已经阅读了如何读取多个 .txt 文件，并找到了一个不错的 explanation. However, following the example all my data gets imported as NaNs. I read up some more and found a more reliable way of importing .txt files, namely by using pd.read_fwf() as can be seen here。

虽然我现在至少可以看到我的数据，但我不知道如何绘制它，因为数据位于用 \t 分隔的一列中，例如

0延伸(mm)\t载荷(kN)\t机器延伸(mm)\t预紧延伸

1 0.000000\t\t\t

2 0.152645\t0.000059312\t.....

...等等

我曾尝试在 pd.read_csv() 和 pd.read_fwf() 中使用不同的分隔符，包括 ' '、'\t' 和 '-s+'，但现在有用。

当然这会导致问题，因为现在我无法绘制数据。说到，我也不确定如何在数据框中绘制数据。我想在同一个散点图上分别绘制每个 .txt 文件的数据。

本人对stack overflow一窍不通，题型不符合规范的请见谅。我在下面附上我的代码，但遗憾的是我无法附上我的 .txt 文件。每个 .txt 文件包含大约一千行数据。我附上了所有文件的一般格式的图片。 General format of the .txt files.

import numpy as np
import pandas as pd
from matplotlib import pyplot as pp
import os
import glob

# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")

# get the file names
leggername = [i for i in glob.glob("*.txt")]

# put everything in a dataframe
df = [pd.read_fwf(legger) for legger in leggername]
df

编辑：我现在得到的 DataFrame 输出是：

[时间(s)\t载荷(kN)\t机器延伸(mm)\t延伸
0
1 0.000000\t\t\t
2
3 0.152645\t0.000059312\t-...
4
…………
997 76.0173\t0.037706\t0.005...
998
999 76.1699\t0.037709\t\t
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 南南南南
2 南南南南
3 南南南南
4 南南南南
………………
第997章南南南南
第998话南南南南
999 南南南南南
1000 南南南南
1001 南南南南南

[1002 行 x 4 列]，时间(s)\t载荷(kN)\t机器延伸(mm)\t延伸
0
1 0.000000\t\t\t
2
3 0.128151\t0.000043125\t-...
4
…………
997 63.8191\t0.034977\t-0.00...
998
999 63.9473\t0.034974\t\t
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 南南南南
2 南南南南
3 南南南南
4 南南南南
………………
第997章南南南南
第998话南南南南
999 南南南南南
1000 南南南南
1001 南南南南南

[1002 行 x 4 列]，时间(s)\t载荷(kN)\t机器延伸(mm)\t延伸
0
1 0.000000\t\t\t
2
3 0.174403\t0.000061553\t0...
4
…………
997 86.8529\t0.036093\t-0.00...
998
999 87.0273\t\t-0.0059160\t-...
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 南南南南
2 南南南南
3 南南南南
4 南南南南
………………
第997章南南南南
第998话南南南南
999 南南南南南
1000 南南南南
1001 南南南南南

...等等

Answer 1

基本要点是跳过第一个数据行（其中只有一个值），然后使用 pd.read_csv 读取各个文件，使用制表符作为分隔符，并将它们堆叠在一起。

然而，还有一个更棘手的问题：数据文件原来是 UTF-16 编码的（二进制数据在偶数位置显示 NUL 字符），但是没有字节顺序标记 (BOM) 来指示这一点。因此，您无法在 read_csv 中指定编码，而必须手动将每个文件读取为二进制文件，然后使用 UTF-16 将其解码为字符串，然后将该字符串提供给 read_csv。由于后者需要文件名或 IO 流，文本数据需要先放入 StringIO 对象（或先将更正后的数据保存到磁盘，然后读取更正后的文件；这可能不是一个坏主意） .

import pandas as pd
import os
import glob
import io

# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")

dfs = []
for filename in glob.glob("*.txt"):
    with open(filename, 'rb') as fp:
        data = fp.read()  # a single file should fit in memory just fine
    # Decode the UTF-16 data that is missing a BOM
    string = data.decode('UTF-16')
    # And put it into a stream, for ease-of-use with `read_csv`
    stream = io.StringIO(string) 

    # Read the data from the, now properly decoded, stream
    # Skip the single-value row, and use tabs as separators
    df = pd.read_csv(stream, sep='\t', skiprows=[1])

    # To keep track of the individual files, add an "origin" column
    # with its value set to the corresponding filename
    df['origin'] = filename
    dfs.append(df)

# Concate all dataframes (default is to stack the rows)
df = pd.concat(dfs)


# For a quick and dirty plot, you can enjoy the power of Seaborn
import seaborn as sns
# Use appropriate (full) column names, and use the 'origin' 
# column for the hue and symbol
sns.scatterplot(data=df, x='Time (s)', y='Machine Extension (mm)', hue='origin', style='origin')

Seaborn's scatterplot documentation.

使用 pandas、NaN 问题自动绘制超过 100 个 .txt 文件

Automising the plot of more than a 100 .txt files using pandas, NaN problems

python

matplotlib

pandas