PANDAS 基于筛选项的 DROP ROWS,我的解决方案 - 不满意
PANDAS DROP ROWS based on filtered items, my solution - not satisfied
我正在清理一个域名列表。
我想删除某些符合 "fit" 标准的行。我已经成功确定了第一个标准,第二个标准很容易做到。
但是,我无法删除这些行。我尝试了几种解决方案,但最好的解决方案如下。
from wordsegment import segment
import pandas as pd
def assignname():
dfr = pd.read_csv('data.net.date.csv')
for domainwtld in dfr.domain:
dprice = dfr.price
domainwotld = domainwtld.replace(".net", "")
seperate = wordsegment.segment(domainwotld)
dlnt = (min(seperate, key=len))
slnt = len(dlnt)
if slnt <= 1:
baddomains = domainwtld
a = dfr.loc[dfr['domain'] < (baddomains)]
print (a)
当我 运行 这段代码时,我收到一个输出,在 "baddomains" 中删除第一个项目后,打印 "dfr" 中的整个项目。它会这样做,直到循环完成。
如何根据恶意域名过滤 "original" csv 文件?
from wordsegment import segment
import pandas as pd
url = 'http://download1474.mediafire.com/3ndc8vevwtng/sa4ifz8rixe7m8u/data.net.date+%285%29.csv'
dfr = pd.read_csv(url)
dfr['domain'] = dfr.domain.str.replace(".net", "")
dfr['words'] = df.domain.apply(segment)
good_domains = dfr[dfr.words.apply(lambda words: len(min(words, key=len))) > 1]
bad_domains = dfr[~dfr.domain.isin(good_domains.domain)]
>>> bad_domains
domain price words
2 keeng 700 [keen, g]
14 ymall 777 [y, mall]
22 idisc 850 [i, disc]
26 borsen 877 [borse, n]
38 cellacom 895 [cell, a, com]
51 iwealth 999 [i, wealth]
96 iplayer 1500 [i, player]
116 mcommerce 2000 [m, commerce]
118 apico 2052 [a, pico]
134 epharm 2500 [e, pharm]
139 ionica 2579 [ionic, a]
153 kasiino 2999 [kasi, in, o]
155 alpadia 3000 [al, padi, a]
158 similans 3152 [similan, s]
163 ifuture 3499 [i, future]
>>> bad_domains.domain.tolist()
['keeng',
'ymall',
'idisc',
'borsen',
'cellacom',
'iwealth',
'iplayer',
'mcommerce',
'apico',
'epharm',
'ionica',
'kasiino',
'alpadia',
'similans',
'ifuture']
我正在清理一个域名列表。
我想删除某些符合 "fit" 标准的行。我已经成功确定了第一个标准,第二个标准很容易做到。
但是,我无法删除这些行。我尝试了几种解决方案,但最好的解决方案如下。
from wordsegment import segment
import pandas as pd
def assignname():
dfr = pd.read_csv('data.net.date.csv')
for domainwtld in dfr.domain:
dprice = dfr.price
domainwotld = domainwtld.replace(".net", "")
seperate = wordsegment.segment(domainwotld)
dlnt = (min(seperate, key=len))
slnt = len(dlnt)
if slnt <= 1:
baddomains = domainwtld
a = dfr.loc[dfr['domain'] < (baddomains)]
print (a)
当我 运行 这段代码时,我收到一个输出,在 "baddomains" 中删除第一个项目后,打印 "dfr" 中的整个项目。它会这样做,直到循环完成。
如何根据恶意域名过滤 "original" csv 文件?
from wordsegment import segment
import pandas as pd
url = 'http://download1474.mediafire.com/3ndc8vevwtng/sa4ifz8rixe7m8u/data.net.date+%285%29.csv'
dfr = pd.read_csv(url)
dfr['domain'] = dfr.domain.str.replace(".net", "")
dfr['words'] = df.domain.apply(segment)
good_domains = dfr[dfr.words.apply(lambda words: len(min(words, key=len))) > 1]
bad_domains = dfr[~dfr.domain.isin(good_domains.domain)]
>>> bad_domains
domain price words
2 keeng 700 [keen, g]
14 ymall 777 [y, mall]
22 idisc 850 [i, disc]
26 borsen 877 [borse, n]
38 cellacom 895 [cell, a, com]
51 iwealth 999 [i, wealth]
96 iplayer 1500 [i, player]
116 mcommerce 2000 [m, commerce]
118 apico 2052 [a, pico]
134 epharm 2500 [e, pharm]
139 ionica 2579 [ionic, a]
153 kasiino 2999 [kasi, in, o]
155 alpadia 3000 [al, padi, a]
158 similans 3152 [similan, s]
163 ifuture 3499 [i, future]
>>> bad_domains.domain.tolist()
['keeng',
'ymall',
'idisc',
'borsen',
'cellacom',
'iwealth',
'iplayer',
'mcommerce',
'apico',
'epharm',
'ionica',
'kasiino',
'alpadia',
'similans',
'ifuture']