csv import - 如何巧妙地检查列名是否为"correct"?
csv import - how to ingeniously check that the name of columns are "correct"?
我正在尝试根据以 csv 格式手动编写的数据为电网建模。
例如,我有一个应该称为 'DEPART 1'
的列。
我经常可以找到 'Départ 1'
、'DEP1'
、'depart 1'
、' DEPART 1 '
或许多其他可能性...
没错,我正在导入它:
import_net_data = pd.read_excel(path_file, sheet_name=None)
我希望能够识别出接近“正式名称”的列(可能通过忽略空格,maj ...)
有没有正确的方法:
- 替换那些不正确的字符串中的任何一个(不给出所有
可能性)由正确的
- 检查那些列名是否只出现一次
这里需要使用模糊字符串匹配。对于 python,作为一个选项,您可以查看字符串的 thefuzz package it's calculate Levenshtein distance。
举个例子:
from thefuzz import fuzz
st = 'DEPART 1'
strs = [ 'Départ 1', 'DEP1','depart 1',' DEPART 1 ']
for s in strs:
l_d= fuzz.ratio(st.lower(), s.lower()) # Levenshtein distance
print(st, s, '|', 'Levenshtein distance: ', l_d, 'is the same: ', l_d > 60)
输出:
DEPART 1 Départ 1 | Levenshtein distance: 88 is the same: True
DEPART 1 DEP1 | Levenshtein distance: 67 is the same: True
DEPART 1 depart 1 | Levenshtein distance: 100 is the same: True
DEPART 1 DEPART 1 | Levenshtein distance: 89 is the same: True
查看更多信息:https://www.datacamp.com/community/tutorials/fuzzy-string-python
使用它你可以实现你的目标。
"替换任何不正确的字符串":
import pandas as pd
from thefuzz import fuzz
st = 'DEPART 1'
df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(df)
cols = []
for column in df.columns:
if fuzz.ratio(st.lower(), column.lower()) > 60:
cols.append(st)
else:
cols.append(column)
df.columns = cols
print(df)
输出:
Columns: [DEPART 1, DEP1, depart 1, depart 1, not even close]
Columns: [DEPART 1, DEPART 1, DEPART 1, DEPART 1, not even close]
“检查列名的出现次数”:
import pandas as pd
import collections
df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(collections.Counter(df.columns))
输出:
Counter({'depart 1': 2, 'DEPART 1': 1, 'DEP1': 1, 'not even close': 1})
我建议你使用正则表达式来识别这些列名称之间合适的模式,并将它们替换为正式名称。
您可以使用 re
library to do so. Combine it with regex101 website 找到适合所有情况的最佳正则表达式。
这里是一个解决这个特殊情况的小代码示例:
import re
official_name = "depart 1"
column_names = [
"Départ 1",
"DEP1",
"depart 1",
" DEPART 1 ",
" depart 1"]
regex = "\s*[d^D][e^E^é^É][p^P]\D*\s*1\s*"
for name in column_names:
print(name)
result = re.search(regex, name)
if result:
print("Replace with {0}".format(official_name))
else:
print("Could not find the regex pattern")
它输出这个:
Départ 1
Replace with depart 1
DEP1
Replace with depart 1
depart 1
Replace with depart 1
DEPART 1
Replace with depart 1
depart 1
Replace with depart 1
我正在尝试根据以 csv 格式手动编写的数据为电网建模。
例如,我有一个应该称为 'DEPART 1'
的列。
我经常可以找到 'Départ 1'
、'DEP1'
、'depart 1'
、' DEPART 1 '
或许多其他可能性...
没错,我正在导入它:
import_net_data = pd.read_excel(path_file, sheet_name=None)
我希望能够识别出接近“正式名称”的列(可能通过忽略空格,maj ...)
有没有正确的方法:
- 替换那些不正确的字符串中的任何一个(不给出所有 可能性)由正确的
- 检查那些列名是否只出现一次
这里需要使用模糊字符串匹配。对于 python,作为一个选项,您可以查看字符串的 thefuzz package it's calculate Levenshtein distance。
举个例子:
from thefuzz import fuzz
st = 'DEPART 1'
strs = [ 'Départ 1', 'DEP1','depart 1',' DEPART 1 ']
for s in strs:
l_d= fuzz.ratio(st.lower(), s.lower()) # Levenshtein distance
print(st, s, '|', 'Levenshtein distance: ', l_d, 'is the same: ', l_d > 60)
输出:
DEPART 1 Départ 1 | Levenshtein distance: 88 is the same: True
DEPART 1 DEP1 | Levenshtein distance: 67 is the same: True
DEPART 1 depart 1 | Levenshtein distance: 100 is the same: True
DEPART 1 DEPART 1 | Levenshtein distance: 89 is the same: True
查看更多信息:https://www.datacamp.com/community/tutorials/fuzzy-string-python
使用它你可以实现你的目标。
"替换任何不正确的字符串":
import pandas as pd
from thefuzz import fuzz
st = 'DEPART 1'
df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(df)
cols = []
for column in df.columns:
if fuzz.ratio(st.lower(), column.lower()) > 60:
cols.append(st)
else:
cols.append(column)
df.columns = cols
print(df)
输出:
Columns: [DEPART 1, DEP1, depart 1, depart 1, not even close]
Columns: [DEPART 1, DEPART 1, DEPART 1, DEPART 1, not even close]
“检查列名的出现次数”:
import pandas as pd
import collections
df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(collections.Counter(df.columns))
输出:
Counter({'depart 1': 2, 'DEPART 1': 1, 'DEP1': 1, 'not even close': 1})
我建议你使用正则表达式来识别这些列名称之间合适的模式,并将它们替换为正式名称。
您可以使用 re
library to do so. Combine it with regex101 website 找到适合所有情况的最佳正则表达式。
这里是一个解决这个特殊情况的小代码示例:
import re
official_name = "depart 1"
column_names = [
"Départ 1",
"DEP1",
"depart 1",
" DEPART 1 ",
" depart 1"]
regex = "\s*[d^D][e^E^é^É][p^P]\D*\s*1\s*"
for name in column_names:
print(name)
result = re.search(regex, name)
if result:
print("Replace with {0}".format(official_name))
else:
print("Could not find the regex pattern")
它输出这个:
Départ 1
Replace with depart 1
DEP1
Replace with depart 1
depart 1
Replace with depart 1
DEPART 1
Replace with depart 1
depart 1
Replace with depart 1