两次删除 NA 值后出错,第一次是使用 pandas 库,第二次是 R
Error after deleting NA values twice, first by using pandas library second by R
首先,我使用以下 Python 代码删除了 NA 值:
import pandas as pd
a = pd.read_csv("true.csv",low_memory=False)
#print a
b = pd.read_csv("false.csv",low_memory=False)
merged = a.append(b, ignore_index=False)
merged=merged.dropna(axis=1)
merged.to_csv("out.csv", index=False)
之后我使用 Rattle 发现有 2 列是分类的,我只想要数字数据。所以我使用以下代码删除了这些列:
cat("\nSTART\n")
startTime = proc.time()[3]
startTime
#--------------------------------------------------------------
# Step 1: Include Library
#--------------------------------------------------------------
cat("\nStep 1: Library Inclusion")
library(randomForest)
library(FSelector)
#--------------------------------------------------------------
# Step 2: Variable Declaration
#--------------------------------------------------------------
cat("\nStep 2: Variable Declaration")
modelName <- "randomForest"
modelName
InputDataFileName="out.csv"
InputDataFileName
training = 70 # Defining Training Percentage; Testing = 100 - Training
#--------------------------------------------------------------
# Step 3: Data Loading
#--------------------------------------------------------------
cat("\nStep 3: Data Loading")
dataset <- read.csv(InputDataFileName) # Read the datafile
dataset <- dataset[sample(nrow(dataset)),] # Shuffle the data row wise.
#result <- cfs(Features ~ ., dataset)
head(dataset) # Show Top 6 records
nrow(dataset) # Show number of records
names(dataset) # Show fields names or columns names
#--------------------------------------------------------------
# Step 4: Count total number of observations/rows.
#--------------------------------------------------------------
cat("\nStep 4: Counting dataset")
totalDataset <- nrow(dataset)
totalDataset
nums <- sapply(dataset, is.numeric)
dataset<-dataset[ ,nums]
#--------------------------------------------------------------
# Step 5: Choose Target variable
#--------------------------------------------------------------
cat("\nStep 5: Choose Target Variable")
target <- names(dataset)[1] # i.e. RMSD
target
#data(dataset)
result <- cfs(Activity ~ ., dataset)
在上面的代码中,我在最后一行使用 FSelector
进行特征选择。
执行最后一行后出现以下错误:
Error in if (sd(vec1) == 0 || sd(vec2) == 0) return(0) :
missing value where TRUE/FALSE needed
out.csv
https://drive.google.com/open?id=0B3UWvP6zFBQnN3JiamloOWl3T28
最后一行之前
(result <- cfs(Activity ~ ., dataset))
使用
dataset$Activity = factor(dataset$Activity)
执行起来需要一些时间,因为我们有一个非常大的数据集。
首先,我使用以下 Python 代码删除了 NA 值:
import pandas as pd
a = pd.read_csv("true.csv",low_memory=False)
#print a
b = pd.read_csv("false.csv",low_memory=False)
merged = a.append(b, ignore_index=False)
merged=merged.dropna(axis=1)
merged.to_csv("out.csv", index=False)
之后我使用 Rattle 发现有 2 列是分类的,我只想要数字数据。所以我使用以下代码删除了这些列:
cat("\nSTART\n")
startTime = proc.time()[3]
startTime
#--------------------------------------------------------------
# Step 1: Include Library
#--------------------------------------------------------------
cat("\nStep 1: Library Inclusion")
library(randomForest)
library(FSelector)
#--------------------------------------------------------------
# Step 2: Variable Declaration
#--------------------------------------------------------------
cat("\nStep 2: Variable Declaration")
modelName <- "randomForest"
modelName
InputDataFileName="out.csv"
InputDataFileName
training = 70 # Defining Training Percentage; Testing = 100 - Training
#--------------------------------------------------------------
# Step 3: Data Loading
#--------------------------------------------------------------
cat("\nStep 3: Data Loading")
dataset <- read.csv(InputDataFileName) # Read the datafile
dataset <- dataset[sample(nrow(dataset)),] # Shuffle the data row wise.
#result <- cfs(Features ~ ., dataset)
head(dataset) # Show Top 6 records
nrow(dataset) # Show number of records
names(dataset) # Show fields names or columns names
#--------------------------------------------------------------
# Step 4: Count total number of observations/rows.
#--------------------------------------------------------------
cat("\nStep 4: Counting dataset")
totalDataset <- nrow(dataset)
totalDataset
nums <- sapply(dataset, is.numeric)
dataset<-dataset[ ,nums]
#--------------------------------------------------------------
# Step 5: Choose Target variable
#--------------------------------------------------------------
cat("\nStep 5: Choose Target Variable")
target <- names(dataset)[1] # i.e. RMSD
target
#data(dataset)
result <- cfs(Activity ~ ., dataset)
在上面的代码中,我在最后一行使用 FSelector
进行特征选择。
执行最后一行后出现以下错误:
Error in if (sd(vec1) == 0 || sd(vec2) == 0) return(0) :
missing value where TRUE/FALSE needed
out.csv https://drive.google.com/open?id=0B3UWvP6zFBQnN3JiamloOWl3T28
最后一行之前
(result <- cfs(Activity ~ ., dataset))
使用
dataset$Activity = factor(dataset$Activity)
执行起来需要一些时间,因为我们有一个非常大的数据集。