Python csv 逐个单元格迭代以找出更大的值并对该行执行删除
Python csv cell by cell iteration to find out a greater value and perform a delete on the row
我尝试执行的操作类似于此 mysql 删除语句:
DELETE FROM ABCD WHERE val_2001>val_2000*1.5 OR val_2001>val_1999*POW(1.5,2);
其中val_2001、val_2000、val_1999都是列名。所以查询正在执行这 3 个操作:
1. Comparing col-b with col-a
2. OR operation with comparing col-b with col-1999(constant)
3. Deleting the whole row from the table if the condition satisfies.
将其写在 python 中(而不是 mysql,因为它是一个 csv 并且避免上传到数据库)。
我现在的代码如下:
df = pd.read_csv("singleDataFile.csv")
for values in xrange(2000,2016):
val2 = values+1
df['val_'+str(val2)] = df['val_'+str(val2)].where((df['val_'+str(val2)]>df['val_'+str(values)]*1.5) | (df['val_'+str(val2)]<df['val_'+str(values)]*0.75))
print(df)
尝试了替代方法:
df = pd.read_csv("singleDataFile.csv")
cols = [ 'val_{}'.format(c) for c in range(2000, 2018)]
df = pd.DataFrame(df, columns = cols)
df[(df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)] = 'NULL'
在这两种情况下,它都以 :
结尾
getting an error with TypeError: can't multiply sequence by non-int of type 'float'
然而,在这两种方式中,它甚至都没有尝试删除整行。如何做到这一点?
CSV TABLE 片段:
val_2000 val_2001 val_2002 val_2003
100 112.058663384525 119.070787312921 117.033250060214
100 118.300395256917 124.655238202362 128.723125524235
100 109.333236619151 116.785836024946 117.390803371386
100 120.954175930764 126.099776250454 124.491022271481
100 107.776153227575 105.560100052722 108.07490649383
100 151.596517146962 306.608812920781 124.610273175528
注意:在 val_2000 之前有一些列,例如索引行和一些名称行,也不应考虑进行迭代。
看来你需要any
for check at least one True
, then invert by ~
and filter by boolean indexing
:
#convert all values to float
df = df.astype(float)
#if some bad values (like strings in numeric) replace them to NaN
#df = df.apply(pd.to_numeric, errors='coerce')
print ((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75))
val_2000 val_2001 val_2002 val_2003
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False True True
print (~((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)).any(1))
0 True
1 True
2 True
3 True
4 True
5 False
dtype: bool
df = df[~((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)).any(1)]
print (df)
val_2000 val_2001 val_2002 val_2003
0 100 112.058663 119.070787 117.033250
1 100 118.300395 124.655238 128.723126
2 100 109.333237 116.785836 117.390803
3 100 120.954176 126.099776 124.491022
4 100 107.776153 105.560100 108.074906
IIUC 你需要:
const = ['val_'+ str(x) for x in range(1995,2000)]
print (const)
['val_1995', 'val_1996', 'val_1997', 'val_1998', 'val_1999']
for x in const:
df[x] = 1
我尝试执行的操作类似于此 mysql 删除语句:
DELETE FROM ABCD WHERE val_2001>val_2000*1.5 OR val_2001>val_1999*POW(1.5,2);
其中val_2001、val_2000、val_1999都是列名。所以查询正在执行这 3 个操作:
1. Comparing col-b with col-a 2. OR operation with comparing col-b with col-1999(constant) 3. Deleting the whole row from the table if the condition satisfies.
将其写在 python 中(而不是 mysql,因为它是一个 csv 并且避免上传到数据库)。 我现在的代码如下:
df = pd.read_csv("singleDataFile.csv")
for values in xrange(2000,2016):
val2 = values+1
df['val_'+str(val2)] = df['val_'+str(val2)].where((df['val_'+str(val2)]>df['val_'+str(values)]*1.5) | (df['val_'+str(val2)]<df['val_'+str(values)]*0.75))
print(df)
尝试了替代方法:
df = pd.read_csv("singleDataFile.csv")
cols = [ 'val_{}'.format(c) for c in range(2000, 2018)]
df = pd.DataFrame(df, columns = cols)
df[(df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)] = 'NULL'
在这两种情况下,它都以 :
结尾getting an error with TypeError: can't multiply sequence by non-int of type 'float'
然而,在这两种方式中,它甚至都没有尝试删除整行。如何做到这一点?
CSV TABLE 片段:
val_2000 val_2001 val_2002 val_2003 100 112.058663384525 119.070787312921 117.033250060214 100 118.300395256917 124.655238202362 128.723125524235 100 109.333236619151 116.785836024946 117.390803371386 100 120.954175930764 126.099776250454 124.491022271481 100 107.776153227575 105.560100052722 108.07490649383 100 151.596517146962 306.608812920781 124.610273175528
注意:在 val_2000 之前有一些列,例如索引行和一些名称行,也不应考虑进行迭代。
看来你需要any
for check at least one True
, then invert by ~
and filter by boolean indexing
:
#convert all values to float
df = df.astype(float)
#if some bad values (like strings in numeric) replace them to NaN
#df = df.apply(pd.to_numeric, errors='coerce')
print ((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75))
val_2000 val_2001 val_2002 val_2003
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False True True
print (~((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)).any(1))
0 True
1 True
2 True
3 True
4 True
5 False
dtype: bool
df = df[~((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)).any(1)]
print (df)
val_2000 val_2001 val_2002 val_2003
0 100 112.058663 119.070787 117.033250
1 100 118.300395 124.655238 128.723126
2 100 109.333237 116.785836 117.390803
3 100 120.954176 126.099776 124.491022
4 100 107.776153 105.560100 108.074906
IIUC 你需要:
const = ['val_'+ str(x) for x in range(1995,2000)]
print (const)
['val_1995', 'val_1996', 'val_1997', 'val_1998', 'val_1999']
for x in const:
df[x] = 1