无法比较包含字符串的 2 Python 个集合
Unable to compare 2 Python sets that contains strings
我创建了 2 个 python 集,这些集是从 2 个不同的 CSV 文件创建的,其中包含一些 stings。
我正在尝试匹配 2 个集合,以便它 return 是 2 个集合的交集(两个集合中的公共字符串应该 returned)。
我的代码是这样的:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import nltk
#using content mmanager to open and read file
#converted the text file into csv file at the source using Notepad++
with open(r'skills.csv', 'r', encoding="utf-8-sig") as f:
myskills = f.readlines()
#converting mall the string in the list to lowercase
list_of_myskills = map(lambda x: x.lower(), myskills)
set_of_myskills = set(list_of_myskills)
#print(type(nodup_filtered_content))
print(set_of_myskills)
#open and read by line from the text file
with open(r'list_of_skills.csv', 'r') as f2:
#using readlines() instead of read(), becasue it reads line by line (each
line as a string obj in the python list)
contents_f2 = f2.readlines()
#converting mall the string in the list to lowercase
list_of_skills = map(lambda x: x.lower(), contents_f2)
#converting into sets
set_of_skills = set(list_of_skills)
print(set_of_skills)
这是我正在使用的函数:
def set_compare(set1,set2):
if(set1 & set2):
return print('The matching skills are: '(set1 & set2))
else:
print("No matching skills")
在我运行代码之后:
set_compare(set_of_skills,set_of_myskills)
输出:
No matching skills
'skills.csv'的内容是:
{'critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate,'}
文件内容 'list_of_skills.csv':
{'assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently,'}
虽然我可以实际看到匹配的关键字,但我不明白为什么我没有得到输出。
也没有收到任何错误
比较两组字符串不会比较这些字符串的子字符串。你的程序本质上是在做
foo = {'ABC', 'DEF', 'GHI'}
bar = {'AB', 'CD', 'DE', 'FG', 'HI'}
foo.intersection(bar) # returns {}
仅仅因为不同集合中的字符串之间共享字符并不意味着这些集合有交集。字符串 'ABC'
在第一个而不是第二个,字符串 'AB'
在第二个而不是第一个,等等
有点不清楚您究竟在尝试比较两个 csv 之间的交集。你想找到两者中的单个单元格吗?他们也必须在列中匹配吗?如果您提供有关预期输出的更多信息,那么我可以编辑此答案以提供更多信息。
[编辑]
根据您的评论,看起来您想要的是用逗号拆分那些巨大的字符串,以便集合的元素成为单个单元格。目前,这些集合每个都只有一个元素,每个元素只是一根巨大的绳子,里面有很多技能。如果你更换
list_of_myskills = map(lambda x: x.lower(), myskills)
和
list_of_myskills = [y.strip().lower() for x in myskills for y in x.split(',')]
并相应地替换其他类似的行,那么您可能会更接近您的期望。
这有效:更改 .csv 文件以包含以“,”分隔的技能词。每个文件一行。
import pandas as pd
myskills = pd.read_csv("skills.csv",header=None)
set_of_my_skills = set(myskills.iloc[0,])
list_of_skills = pd.read_csv("list_of_skills.csv",header=None)
set_of_skills = set(list_of_skills.iloc[0,])
print(set_of_my_skills & set_of_skills)
{'business intelligence', 'design', 'critical thinking', 'data analysis', 'database', 'teamwork'}
skills.csv : critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate
list_of_skills.csv: assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently
我创建了 2 个 python 集,这些集是从 2 个不同的 CSV 文件创建的,其中包含一些 stings。
我正在尝试匹配 2 个集合,以便它 return 是 2 个集合的交集(两个集合中的公共字符串应该 returned)。
我的代码是这样的:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import nltk
#using content mmanager to open and read file
#converted the text file into csv file at the source using Notepad++
with open(r'skills.csv', 'r', encoding="utf-8-sig") as f:
myskills = f.readlines()
#converting mall the string in the list to lowercase
list_of_myskills = map(lambda x: x.lower(), myskills)
set_of_myskills = set(list_of_myskills)
#print(type(nodup_filtered_content))
print(set_of_myskills)
#open and read by line from the text file
with open(r'list_of_skills.csv', 'r') as f2:
#using readlines() instead of read(), becasue it reads line by line (each
line as a string obj in the python list)
contents_f2 = f2.readlines()
#converting mall the string in the list to lowercase
list_of_skills = map(lambda x: x.lower(), contents_f2)
#converting into sets
set_of_skills = set(list_of_skills)
print(set_of_skills)
这是我正在使用的函数:
def set_compare(set1,set2):
if(set1 & set2):
return print('The matching skills are: '(set1 & set2))
else:
print("No matching skills")
在我运行代码之后:
set_compare(set_of_skills,set_of_myskills)
输出:
No matching skills
'skills.csv'的内容是:
{'critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate,'}
文件内容 'list_of_skills.csv':
{'assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently,'}
虽然我可以实际看到匹配的关键字,但我不明白为什么我没有得到输出。
也没有收到任何错误
比较两组字符串不会比较这些字符串的子字符串。你的程序本质上是在做
foo = {'ABC', 'DEF', 'GHI'}
bar = {'AB', 'CD', 'DE', 'FG', 'HI'}
foo.intersection(bar) # returns {}
仅仅因为不同集合中的字符串之间共享字符并不意味着这些集合有交集。字符串 'ABC'
在第一个而不是第二个,字符串 'AB'
在第二个而不是第一个,等等
有点不清楚您究竟在尝试比较两个 csv 之间的交集。你想找到两者中的单个单元格吗?他们也必须在列中匹配吗?如果您提供有关预期输出的更多信息,那么我可以编辑此答案以提供更多信息。
[编辑] 根据您的评论,看起来您想要的是用逗号拆分那些巨大的字符串,以便集合的元素成为单个单元格。目前,这些集合每个都只有一个元素,每个元素只是一根巨大的绳子,里面有很多技能。如果你更换
list_of_myskills = map(lambda x: x.lower(), myskills)
和
list_of_myskills = [y.strip().lower() for x in myskills for y in x.split(',')]
并相应地替换其他类似的行,那么您可能会更接近您的期望。
这有效:更改 .csv 文件以包含以“,”分隔的技能词。每个文件一行。
import pandas as pd
myskills = pd.read_csv("skills.csv",header=None)
set_of_my_skills = set(myskills.iloc[0,])
list_of_skills = pd.read_csv("list_of_skills.csv",header=None)
set_of_skills = set(list_of_skills.iloc[0,])
print(set_of_my_skills & set_of_skills)
{'business intelligence', 'design', 'critical thinking', 'data analysis', 'database', 'teamwork'}
skills.csv : critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate
list_of_skills.csv: assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently