解析 CSV 数据以计算具有重复值的行

Question

我有一个 csv 文件 (data.csv):

data
cn=Clark Kent,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Peter Parker,ou=News,ou=news,ou=employee,dc=company,dc=com
cn=Mary Jane,ou=News_HQ,ou=news,ou=employee,dc=company,dc=com
cn=Oliver Twist,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Mary Poppins,ou=Ice Cream,ou=ice cream,dc=company,dc=com
cn=David Tenant,ou=userMger,ou=ice cream,ou=employee,dc=company,dc=com
cn=Pepper Jack,ou=users,ou=store,ou=employee,dc=company,dc=com
cn=Eren Jaeger,ou=Store,ou=store,ou=employee,dc=company,dc=com
cn=Monty Python,ou=users,ou=store,dc=company,dc=com
cn=John Smith,ou=userMger,ou=store,ou=employee,dc=company,dc=com
cn=Anne Potts,ou=Sprinkles_HQ,ou=sprinkles,dc=company,dc=com
cn=Harry Styles,OU=Sprinkles,ou=sprinkles,ou=employee,dc=company,dc=com
cn=James Bond,ou=Sprinkles_HQ,ou=employee,dc=company,dc=com
cn=Harry Potter,ou=users,ou=sprinkles,ou=employee,dc=company,dc=com

我需要将数据解析到可以统计 ou 中有多少行具有相同名称的点。因此，例如，如果有 Sprinkles_HQ、Sprinkles 或 sprinkles，它们应该算作相同。如果一行有 Sprinkles_HQ 和 sprinkles（两个同名），该行仍应算作一个（而不是两个）。

我想要的输出与此类似：

News, 4
Ice Cream, 2
Store, 4
Sprinkles, 4

我采取的第一步是读取我的 csv 文件，将我的 csv 转换为数据框。我使用 Pandas:

做到了这一点

#open file
file = open(directory)

#read csv and the column I want
df = pd.read_csv(file, usecols=['data'])
#make into a dataframe
rowData = pd.DataFrame(df)

然后为了让我更容易解析我的数据，我将每一行分隔成逗号分隔值。然后将这些值转换为列表列表（每一行都是一个列表）。然后删除任何 None 值。然后我需要将所有以 'OU=' 开头的数据移动到它自己的列表中，如果任何数据有 'user' 或 'userMger' 或 'employee'，我将删除那些列表中的值。这是我现在的代码：

#splits the dataframe into comma separate values
lines =rowData['data'].str.split(",", expand=True)

#makes dataframe into a list of lists
a = lines.values.tolist()

#make my list of lists into a single list
employeeList = []
for i in range(len(a)):
    for j in range(len(a[0])):
        #there are some None values once converted to a list
        if a[i][j] != None: 
           employeeList.append(a[i][j])

#list for storing only OUs
ouList = []

#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
    if employeeList[i].startswith('OU='):
        ouList.append(employeeList[i])

#need to iterate in reverse as I am removing items from the list
#here I remove the other items
for i in reversed(range(len(ouList))):
     if ouList[i].endswith('users') or ouList[i].endswith('userMger') or ouList[i].endswith('employee'):
        ouList.remove(ouList[i])
        
#my list now only contains specific OUs        
print(ouList)

我相信我在正确的轨道上，我的代码还没有删除列表中每个列表中的任何重复项，例如 Sprinkles_HQ、Sprinkles 或 sprinkles .在创建 employeelist 列表之前，我需要找到一种方法来删除重复项，并将它们附加到新列表中。这将使我更容易数数。

我研究了如何删除列表列表中的重复项。我尝试使用一些类似的东西：

new_list = []
for elem in a:
    if a not in new_list:
        new_list.append(elem)

但这并没有考虑开头相同的单词。我尝试使用 startswith 和 .lower()，因为有大小写，但对我还不起作用：

new_list=[]
for i in range(len(a)):
    for j in range(len(a[0])):
        if a[i][j].lower().startswith(a[i][j].lower()) not in new_list:
           new_list.append(a[i][j])

如有任何建议，我们将不胜感激。

Answer 1

我想出的解决方案是分段的。我的第一个问题是外壳，我需要所有的东西都是小写的。所以在我将项目附加到 employeeList 之后，我添加了这段代码：

for i in range(len(employeeList)):
    for j in range(len(employeeList[i])):
        employeeList[i][j] = employeeList[i][j].lower()

这使我的 employeeList 中的所有内容都小写了。

现在，一旦我解决了这个问题，我就需要从单个列表中更改 ouList 的输出，并将其保存为列表列表。所以只有 ou= 的所有行都将在 ouList.

中

#list for storing only OUs
ouList = []

#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
    ouList.append([])
    for j in range(len(employeeList[i])):
        if employeeList[i][j].startswith('ou='):
           ouList[i].append(employeeList[i][j])

然后我需要删除所有以 users、userMger 或 employee 结尾的项目。我反向迭代并使用 .endswith() 实现了这一点，没有任何错误。

#need to iterate in reverse as I am removing items from the list
for i in reversed(range(len(ouList))):
     for j in reversed(range(len(ouList[i]))):
        if (ouList[i][j].endswith('users')
        or ouList[i][j].endswith('usermger')
        or ouList[i][j].endswith('employee')):
           ouList[i].remove(ouList[i][j])

然后为了去除 ou= 或不必要的字符串，我使用了 re（又名正则表达式或正则表达式）。然后我将这些新值附加到另一个名为 ouListStrip

的列表中

#stripping ou= and other strings
ouListStrip = []
for i in range(len(ouList)):
     ouListStrip.append([])
     for j in range(len(ouList[i])):
       ou = re.sub("ou=|_hq", "", ouList[i][j])
       ouListStrip[i].append(ou)

这个列表输出这个：

[['news'], ['news', 'news'], ['news', 'news'], ['news'], ['ice cream', 'ice cream'], ['ice cream'], ['store'], ['store', 'store'], ['store'], ['store'], ['sprinkles', 'sprinkles'], ['sprinkles', 'sprinkles'], ['sprinkles'], ['sprinkles']]

现在我只有一个列表列表，我现在可以删除子列表中的重复项。我通过使用 not in 并将它们附加为列表列表来实现这一点。

no_repeats = []
for i in range(len(ouListStrip)):
     no_repeats.append([])
     for j in range(len(ouListStrip[i])):
       if ouListStrip[i][j] not in no_repeats[i]:
          no_repeats[i].append(ouListStrip[i][j])

no_repeats 输出：

[['news'], ['news'], ['news'], ['news'], ['ice cream'], ['ice cream'], ['store'], ['store'], ['store'], ['store'], ['sprinkles'], ['sprinkles'], ['sprinkles'], ['sprinkles']]

最后，我将我的列表项列表合并为一个列表：

allOUs = []
for i in range(len(no_repeats)):
    for j in range(len(no_repeats[i])):
        allOUs.append(no_repeats[i][j])

allOUs 输出：

['news', 'news', 'news', 'news', 'ice cream', 'ice cream', 'store', 'store', 'store', 'store', 'sprinkles', 'sprinkles', 'sprinkles', 'sprinkles']

然后我将这个列表放入字典并使用 .count():

计算其中的项目

dict_of_counts = {item:allOUs.count(item) for item in allOUs}

输出：

{'news': 4, 'ice cream': 2, 'store': 4, 'sprinkles': 4}

为了使它在视觉上与我想要的相似：

for key, value in dict_of_counts.items():
    print(key,',',value)

输出：

news , 4
ice cream , 2
store , 4
sprinkles , 4

解析 CSV 数据以计算具有重复值的行

Parse CSV data to get a count on rows with duplicate values

python

csv

sorting

dataframe

pandas