从 API 调用创建 pandas 数据框
Creating pandas dataframe from API call
我正在构建一个 API 来检索人口普查数据,但我在格式化输出时遇到了问题。我的问题实际上是两个之一:
1) 如何改进我的 API 调用,使输出更漂亮(最好是数据帧)
或
2) 如何操作我当前获得的列表,使其位于 pandas 数据框中?
这是我目前的情况:
import requests
import pandas as pd
import numpy as np
mytoken = "numbersandletters"
# this is my API key, so unfortunately I can't provide it
def state_data(token, variables, year = 2010, state = "*", survey = "sf1"):
state = [str(i) for i in state]
# make sure the input for state (integers) are strings
variables = ",".join(variables) # squish all the variables into one string
year = str(year)
combine = ["http://api.census.gov/data/", year, "/", survey, "?key=", mytoken, "&get=", variables, "&for=state:"]
# make a list of all the components to construct a URL
incomplete_url = "".join(combine) # the URL without the state tackd on to the end
complete_url = map(lambda i: incomplete_url + i, state) # now the state is tacked on to the end; one URL per state or for "*"
r = []
r = map(lambda i: requests.get(i), complete_url)
# make an API call to each complete_url
data = map(lambda i: i.json(), r)
print r
print data
print type(data)
df = pd.DataFrame(data)
print df
调用该函数的示例如下,输出如下。
state_data(token = mytoken, state = [47, 48, 49, 50], variables = ["P0010001", "P0010001"])
导致:
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]
[[[u'P0010001', u'P0010001', u'state'], [u'6346105', u'6346105', u'47']],
[[u'P0010001', u'P0010001', u'state'], [u'25145561', u'25145561', u'48']],
[[u'P0010001', u'P0010001', u'state'], [u'2763885', u'2763885', u'49']],
[[u'P0010001', u'P0010001', u'state'], [u'625741', u'625741', u'50']]]
<type 'list'>
0 1
0 [P0010001, P0010001, state] [6346105, 6346105, 47]
1 [P0010001, P0010001, state] [25145561, 25145561, 48]
2 [P0010001, P0010001, state] [2763885, 2763885, 49]
3 [P0010001, P0010001, state] [625741, 625741, 50]
而期望的结果是:
P0010001 P0010001 state
0 6346105 6346105 47
1 25145561 25145561 48
2 2763885 2763885 49
3 625741 625741 50
Fwiw,R 中的类似代码如下。我正在将我用 R 编写的库翻译成 Python:
state.data = function(token, state = "*", variables, year = 2010, survey = "sf1"){
state = as.character(state)
variables = paste(variables, collapse = ",")
year = as.character(year)
my.url = matrix(paste("http://api.census.gov/data/", year, "/", survey, "?key=", token,
"&get=",variables, "&for=state:", state, sep = ""), ncol = 1)
process.url = apply(my.url, 1, function(x) process.api.data(fromJSON(file=url(x))))
rbind.dat = data.frame(rbindlist(process.url))
rbind.dat = rbind.dat[, c(tail(seq_len(ncol(rbind.dat)), 1), seq_len(ncol(rbind.dat) - 1))]
rbind.dat
}
所以你有重复的字段,这是无意义的,你的结果只会显示重复的字段之一。
然而,您需要做的就是将 list/iterable
个 dict
对象传递给 pd.DataFrame
构造函数,您将得到结果:
vals = [[[...]]] # the data you provided in your example
df = pd.DataFrame(dict(zip(*v)) for v in vals)
假设这是你的数据:
data = [["P0010001","PCO0020019","state"], ["4779736","1204","01"], ["710231","53","02"], ["6392017","799","04"], ["2915918","924","05"], ["37253956","6244","06"], ["5029196","955","08"], ["3574097","1266","09"], ["897934","266","10"], ["601723","170","11"], ["18801310","4372","12"], ["9687653","1629","13"], ["1360301","251","15"], ["1567582","320","16"], ["12830632","3713","17"]]
然后这有效:
df = pd.DataFrame(data[1:], columns=data[0])
因此您需要弄清楚如何将数据导入该表单。我所做的只是传递一个列表列表 (data[1:]
) 和一个列表 (data[0]
)
我正在构建一个 API 来检索人口普查数据,但我在格式化输出时遇到了问题。我的问题实际上是两个之一:
1) 如何改进我的 API 调用,使输出更漂亮(最好是数据帧)
或
2) 如何操作我当前获得的列表,使其位于 pandas 数据框中?
这是我目前的情况:
import requests
import pandas as pd
import numpy as np
mytoken = "numbersandletters"
# this is my API key, so unfortunately I can't provide it
def state_data(token, variables, year = 2010, state = "*", survey = "sf1"):
state = [str(i) for i in state]
# make sure the input for state (integers) are strings
variables = ",".join(variables) # squish all the variables into one string
year = str(year)
combine = ["http://api.census.gov/data/", year, "/", survey, "?key=", mytoken, "&get=", variables, "&for=state:"]
# make a list of all the components to construct a URL
incomplete_url = "".join(combine) # the URL without the state tackd on to the end
complete_url = map(lambda i: incomplete_url + i, state) # now the state is tacked on to the end; one URL per state or for "*"
r = []
r = map(lambda i: requests.get(i), complete_url)
# make an API call to each complete_url
data = map(lambda i: i.json(), r)
print r
print data
print type(data)
df = pd.DataFrame(data)
print df
调用该函数的示例如下,输出如下。
state_data(token = mytoken, state = [47, 48, 49, 50], variables = ["P0010001", "P0010001"])
导致:
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]
[[[u'P0010001', u'P0010001', u'state'], [u'6346105', u'6346105', u'47']],
[[u'P0010001', u'P0010001', u'state'], [u'25145561', u'25145561', u'48']],
[[u'P0010001', u'P0010001', u'state'], [u'2763885', u'2763885', u'49']],
[[u'P0010001', u'P0010001', u'state'], [u'625741', u'625741', u'50']]]
<type 'list'>
0 1
0 [P0010001, P0010001, state] [6346105, 6346105, 47]
1 [P0010001, P0010001, state] [25145561, 25145561, 48]
2 [P0010001, P0010001, state] [2763885, 2763885, 49]
3 [P0010001, P0010001, state] [625741, 625741, 50]
而期望的结果是:
P0010001 P0010001 state
0 6346105 6346105 47
1 25145561 25145561 48
2 2763885 2763885 49
3 625741 625741 50
Fwiw,R 中的类似代码如下。我正在将我用 R 编写的库翻译成 Python:
state.data = function(token, state = "*", variables, year = 2010, survey = "sf1"){
state = as.character(state)
variables = paste(variables, collapse = ",")
year = as.character(year)
my.url = matrix(paste("http://api.census.gov/data/", year, "/", survey, "?key=", token,
"&get=",variables, "&for=state:", state, sep = ""), ncol = 1)
process.url = apply(my.url, 1, function(x) process.api.data(fromJSON(file=url(x))))
rbind.dat = data.frame(rbindlist(process.url))
rbind.dat = rbind.dat[, c(tail(seq_len(ncol(rbind.dat)), 1), seq_len(ncol(rbind.dat) - 1))]
rbind.dat
}
所以你有重复的字段,这是无意义的,你的结果只会显示重复的字段之一。
然而,您需要做的就是将 list/iterable
个 dict
对象传递给 pd.DataFrame
构造函数,您将得到结果:
vals = [[[...]]] # the data you provided in your example
df = pd.DataFrame(dict(zip(*v)) for v in vals)
假设这是你的数据:
data = [["P0010001","PCO0020019","state"], ["4779736","1204","01"], ["710231","53","02"], ["6392017","799","04"], ["2915918","924","05"], ["37253956","6244","06"], ["5029196","955","08"], ["3574097","1266","09"], ["897934","266","10"], ["601723","170","11"], ["18801310","4372","12"], ["9687653","1629","13"], ["1360301","251","15"], ["1567582","320","16"], ["12830632","3713","17"]]
然后这有效:
df = pd.DataFrame(data[1:], columns=data[0])
因此您需要弄清楚如何将数据导入该表单。我所做的只是传递一个列表列表 (data[1:]
) 和一个列表 (data[0]
)