从 JSON 获取关键字列表

Getting list of keywords from JSON

我遇到了一个问题,我不明白为什么会这样打印出来。

下面是我的代码,由于我是编程新手,所以格式不好请见谅,这是打开一个有一堆关键字的文本文件

import urllib2
import json

f1 = open('CatList.text')
lines = f1.readlines()

for  line in lines:

    url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'

    print(url)

    json_obj = urllib2.urlopen(url)
    data = json.load(json_obj)

    #to write the result
    f2 = open('SubList.text', 'w')

    f2.write(url)

    for item in data['query']:

            for i in data['query']['categorymembers']:


                f2.write((i['title']).encode('utf8')+"\n")

我收到错误:

Traceback (most recent call last):
  File "Test2.py", line 16, in <module>
    json_obj = urllib2.urlopen(url)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 402, in open
    req = meth(req)
  File "/usr/lib/python2.7/urllib2.py", line 1113, in do_request_
    raise URLError('no host given')
urllib2.URLError: <urlopen error no host given>

我不确定这个错误是什么意思,但我试过这个来打印 url。

import urllib2
import json

f1 = open('CatList.text')
f2 = open('SubList.text', 'w')
lines = f1.readlines()

for  line in lines:

    url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'

    print(url)
    f2.write(url+'\n')

我得到的结果很奇怪(以下是部分结果):

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography by place
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography awards and competitions
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography conferences
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography education
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Environmental studies
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Exploration
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geocodes
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographers
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical zones
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geopolitical corridors
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:History of geography
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Land systems
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Landscape
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists
&cmlimit=100

请注意 URL 分为两部分

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists
&cmlimit=100 

而不是

  https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists&cmlimit=100 

我的第一个问题是如何解决这个问题?

其次,这是给我的错误吗?

我的CatList.text如下:

Category:Branches of geography
Category:Geography by place
Category:Geography awards and competitions
Category:Geography conferences
Category:Geography education
Category:Environmental studies
Category:Exploration
Category:Geocodes
Category:Geographers
Category:Geographical zones
Category:Geopolitical corridors
Category:History of geography
Category:Land systems
Category:Landscape
Category:Geography-related lists
Category:Lists of countries by geography
Category:Navigation
Category:Geography organizations
Category:Places
Category:Geographical regions
Category:Surveying
Category:Geographical technology
Category:Geography terminology
Category:Works about geography
Category:Geographic images
Category:Geography stubs

抱歉这么久 post。非常感谢你的帮助。谢谢。

朋友,换行一般用'\n'。同样的道理,在一个文件中,每一行之间都有隐藏的 '\n' 字符。

所以在 lines = f1.readlines() 它在所有行的末尾包含'\n'。就是这个问题。

为避免这种情况,您应该阅读 f1.read.splitlines()

更新以下行

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'  

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line.strip()+'&cmlimit=100'  

您的 line 包含换行符 (\n),将使用 .strip() 删除字符串两端的空格。