csv import - 如何巧妙地检查列名是否为"correct"?

csv import - how to ingeniously check that the name of columns are "correct"?

我正在尝试根据以 csv 格式手动编写的数据为电网建模。 例如,我有一个应该称为 'DEPART 1' 的列。 我经常可以找到 'Départ 1''DEP1''depart 1'' DEPART 1 ' 或许多其他可能性...

没错,我正在导入它:

import_net_data = pd.read_excel(path_file, sheet_name=None)

我希望能够识别出接近“正式名称”的列(可能通过忽略空格,maj ...)

有没有正确的方法:

这里需要使用模糊字符串匹配。对于 python,作为一个选项,您可以查看字符串的 thefuzz package it's calculate Levenshtein distance

举个例子:

from thefuzz import fuzz


st = 'DEPART 1'
strs = [ 'Départ 1', 'DEP1','depart 1',' DEPART 1 ']

for s in strs:
    l_d= fuzz.ratio(st.lower(), s.lower()) # Levenshtein distance
    print(st, s, '|', 'Levenshtein distance: ', l_d, 'is the same: ', l_d > 60)

输出:

DEPART 1 Départ 1 | Levenshtein distance:  88   is the same:  True
DEPART 1 DEP1     | Levenshtein distance:  67   is the same:  True
DEPART 1 depart 1 | Levenshtein distance:  100  is the same:  True
DEPART 1 DEPART 1 | Levenshtein distance:  89   is the same:  True

查看更多信息:https://www.datacamp.com/community/tutorials/fuzzy-string-python

使用它你可以实现你的目标。

"替换任何不正确的字符串":

import pandas as pd
from thefuzz import fuzz

st = 'DEPART 1'

df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(df)

cols = []
for column in df.columns:
    if fuzz.ratio(st.lower(), column.lower()) > 60:
        cols.append(st)
    else:
        cols.append(column)

df.columns = cols

print(df)

输出:

Columns: [DEPART 1, DEP1, depart 1, depart 1, not even close]
Columns: [DEPART 1, DEPART 1, DEPART 1, DEPART 1, not even close]

“检查列名的出现次数”:

import pandas as pd
import collections

df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])

print(collections.Counter(df.columns))

输出:

Counter({'depart 1': 2, 'DEPART 1': 1, 'DEP1': 1, 'not even close': 1})

我建议你使用正则表达式来识别这些列名称之间合适的模式,并将它们替换为正式名称。

您可以使用 re library to do so. Combine it with regex101 website 找到适合所有情况的最佳正则表达式。

这里是一个解决这个特殊情况的小代码示例:

import re

official_name = "depart 1"

column_names = [
    "Départ 1",
    "DEP1",
    "depart 1",
    " DEPART 1 ",
    " depart      1"]
    
regex = "\s*[d^D][e^E^é^É][p^P]\D*\s*1\s*"

for name in column_names:
    print(name)
    result = re.search(regex, name)
    if result:
        print("Replace with {0}".format(official_name))
    else:
        print("Could not find the regex pattern")

它输出这个:

Départ 1
Replace with depart 1
DEP1
Replace with depart 1
depart 1
Replace with depart 1
 DEPART 1 
Replace with depart 1
 depart      1
Replace with depart 1