我想为我的机器学习算法考虑 python 中数据的特征集（向量）。我该怎么做？

Question

我有以下形式的数据

   Class           Feature set list
   classlabel1 -    [size,time]      example:[6780.3,350.00]
   classlabel2 -    [size,time]
   classlabel3 -    [size,time]
   classlabel4 -    [size,time]

如何将此数据保存在 excel sheet 中以及如何使用此功能集训练模型？目前我正在研究 SVM 分类器。

我已尝试将要素集列表保存在数据框中，并将此数据框保存到 csv 文件中。但是大小和时间被分成两个不同的列。

数据帧通过以下方式保存在 csv 文件中：

col 0    col1        col2
62309   396.5099154  label1

我想训练和测试组合的特征向量 [size,time]。这可能吗？这是正确的方法吗？如果可以，我该怎么做？

Answer 1

由于大小和时间是不同的特征，您应该将它们分成 2 个不同的列，以便您的模型可以为它们中的每一个设置单独的权重，即

# data.csv
size      time      label
6780.3    3,350.00  classLabel1
...

如果您想将您拥有的数据转换为上述格式，您可以使用 pandas.read_excel 并使用 ast 将字符串列表转换为 python 列表对象。

import pandas as pd
import ast

df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]

size = [x[0] for x in size_time]                                                                                                                                                                          
time = [x[1] for x in size_time]                                                                                                                                                                          
label = df["Class"]  

new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
#   size  time        label
# 6780.3 350.0  classlabel1

# Save DataFrame to csv
new_df.to_csv("data_fix.csv")

# Use it
x = new_df.drop("label", axis=1)
y = new_df.label

# Further data preparation, such as split the dataset
# into train and test set, etc.
...

希望对您有所帮助

Answer 2

首先回复你的问题：

I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?

将两者结合起来不是正确的做法，因为两者在两个不同的尺度上（如果它们实际上是名字所暗示的那样）并且将它们结合起来会导致它们将提供的信息丢失，所以它们是任何 ML 监督算法的两个完全独立的特征。所以我建议将这两个功能分开处理，而不是合并为一个。

现在让我们进入下一节：

How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.

存储数据：在我看来，您可以以任何您想要的格式存储数据，但我更喜欢以 csv 格式存储数据，因为它方便并且加载数据文件更快。

sample_data.csv

size,time,class_label
100,150,label1
200,250,label2
240,180,label1

下面是从csv中读取数据并训练SVM的代码：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
    warn_bad_lines=True)

# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values

# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))

# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)

# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train) 

# Using the predictor
y_pred = clf.predict(x_test)

我想为我的机器学习算法考虑 python 中数据的特征集（向量）。我该怎么做？

I would like to consider a feature set(vector) for a data in python for my machine learning algorithm. How can I do it?

python

csv

machine-learning

feature-extraction

training-data

sample_data.csv