获取 json 时如何接受 gdpr cookie?

how to accept gdpr cookies when fetching json?

我正在尝试使用 R 中的查询获取福布斯列表:

billionaires <- jsonlite::fromJSON("https://www.forbes.com/forbesapi/person/billionaires/2020/position/true.json")

我收到这个奇怪的错误:

Error in parse_con(txt, bigint_as_char) : 
  lexical error: invalid char in json text.
                                       <!doctype html> <html lang="en"
                     (right here) ------^

这不起作用,因为我得到的不是 json,而是一个 html 页面,要求 gdpr 同意(我在西班牙)

req <- GET("https://www.forbes.com/forbesapi/person/billionaires/2020/position/true.json")
stop_for_status(req)
json <- content(req, "text")

"\n<!doctype html>\n<html lang="en">\n\t\n\t\t<meta http-equiv="Content-Language" content="en_US">\n\n\t\t<script type="text/javascript">\n\t\t\t(function () {\n\t\t\t\tfunction isValidUrl(toURL) {\n\t\t\t\t\t// Regex taken from welcome ad.\n\t\t\t\t\treturn (toURL || '').match(/^(?:https?:?\/\/)?(?:[^.(){}\\/])?\.?forbes\.com(?:\/|\?|$)/i);\n\t\t\t\t}\n\n\t\t\t\tfunction getUrlParameter(name) {\n\t\t\t\t\tname = name.replace(/[\[]/, '\[').replace(/[\]]/, '\]');\n\t\t\t\t\tvar regex = new RegExp('[\?&]' + name + '=([^&#])');\n\t\t\t\t\tvar results = regex.exec(location.search);\n\t\t\t\t\treturn results === null ? '' : decodeURIComponent(results[1].replace(/\+/g, ' '));\n\t\t\t\t};\n\n\t\t\t\tfunction consentIsSet(message) {\n\t\t\t\t\tconsole.log(message);\n\t\t\t\t\tvar result = JSON.parse(message.data);\n\t\t\t\t\tif(result.message == "submit_preferences"){\n\t\t\t\t\t\tvar toURL = getUrlParameter("toURL");\n\t\t\t\t\t\tif(!isValidUrl(toURL)){\n\t\t\t\t\t\t\ttoURL = "https://www.forbes.com/";\n\t\t\t\t\t\t}\n\t\t\t\t\t\tlocation.href=toURL;\n\t\t\t\t\t}\n\t\t\t\t}\n\n\t\t\t\tvar apiObject = {\n\t\t\t\t\tPrivacyManagerAPI:\n\t\t\t\t\t{\n\t\t\t\t\t\taction: "getConsent",\n\t\t\t\t\t\ttimestamp: new Date().getTime(),\n\t\t\t\t\t\tself: "forbes.com"\n\t\t\t\t\t}\n\t\t\t\t};\n\t\t\t\tvar json = JSON.stringify(apiObject);\n\t\t\t\twindow.top.postMessage(json,"");\n\t\t\t\twindow.addEventListener("message", consentIsSet, false);\n\t\t\t})();\n\t\t\n\n\t\t\n\t\t\t(function () {\n\t\t\t\tvar makeStub = function () {\n\t\t\t\t\tvar TCF_LOCATOR_NAME = '__tcfapiLocator';\n\t\t\t\t\tvar TCF_LOCATOR_ID = '__tcfapiTrustarc';\n\t\t\t\t\tvar win = window;\n\t\t\t\t\tvar queue = [];\n\t\t\t\t\tvar cmpFrame;\n\t\t\t\t\tfunction addFrame() {\n\t\t\t\t\t\tvar doc = win.document;\n\t\t\t\t\t\tvar otherCMP = !!(win.frames[TCF_LOCATOR_NAME]);\n\t\t\t\t\t\tif (!otherCMP) {\n\t\t\t\t\t\t\tif (doc.body) {\n\t\t\t\t\t\t\t\tvar iframe = doc.createElement('iframe');\n\t\t\t\t\t\t\t\tiframe.name = TCF_LOCATOR_NAME;\n\t\t\t\t\t\t\t\tiframe.style.display = 'none';\n\t\t\t\t\t\t\t\tiframe.id = TCF_LOCATOR_ID;\n\t\t\t\t\t\t\t\tiframe.src = 'https://trustarc.mgr.consensu.org/asset/cmpcookie.v2.html';\n\t\t\t\t\t\t\t\tdoc.body.appendChild(iframe);\n\t\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t\tsetTimeout(addFrame, 5);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t\treturn !otherCMP;\n\t\t\t\t\t}\n\t\t\t\t\tfunction tcfAPIHandler() {\n\t\t\t\t\t\tvar args = arguments;\n\t\t\t\t\t\tvar gdprApplies;\n\t\t\t\t\t\tif (!args.length) {\n\t\t\t\t\t\t\t/**\n\t\t\t\t\t\t\t shortcut to get the queue when the full CMP\n\t\t\t\t\t\t\t* implementation loads; it can call tcfapiHandler()\n\t\t\t\t\t\t\t* with no arguments to get the queued arguments\n\t\t\t\t\t\t\t*/\n\t\t\t\t\t\t\treturn queue;\n\t\t\t\t\t\t} else if (args[0] === 'setGdprApplies') {\n\t\t\t\t\t\t\t/\n\t\t\t\t\t\t\t* shortcut to set gdprApplies if the publisher\n\t\t\t\t\t\t\t* knows that they apply GDPR rules to all\n\t\t\t\t\t\t\t* traffic (see the section on "What does the\n\t\t\t\t\t\t\t* gdprApplies value mean" for more\n\t\t\t\t\t\t\t*/\n\t\t\t\t\t\t\tif (args.length > 3 && parseInt(args[1], 10) === 2 && typeof args[3] === 'boolean') {\n\t\t\t\t\t\t\t\tgdprApplies = args[3];\n\t\t\t\t\t\t\t\tif (typeof args[2] === 'function') {\n\t\t\t\t\t\t\t\t\targs[2]('set', true);\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t} else if (args[0] === 'ping') {\n\t\t\t\t\t\t\t/\n\t\t\t\t\t\t\t* Only supported method; give PingReturn\n\t\t\t\t\t\t\t* object as response\n\t\t\t\t\t\t\t*/\n\t\t\t\t\t\t\tvar retr = {\n\t\t\t\t\t\t\t\tgdprApplies: gdprApplies,\n\t\t\t\t\t\t\t\tcmpLoaded: false,\n\t\t\t\t\t\t\t\tcmpStatus: 'stubCMP',\n\t\t\t\t\t\t\t\tapiVersion: '2.0'\n\t\t\t\t\t\t\t};\n\t\t\t\t\t\t\tif (typeof args[2] === 'function') {\n\t\t\t\t\t\t\t\targs[2](retr, true);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t/\n\t\t\t\t\t\t\t* some other method, just queue it for the\n\t\t\t\t\t\t\t* full CMP implementation to deal with\n\t\t\t\t\t\t\t*/\n\t\t\t\t\t\t\tqueue.push(args);\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tfunction postMessageEventHandler(event) {\n\t\t\t\t\t\tvar msgIsString = typeof event.data === 'string';\n\t\t\t\t\t\tvar json = {};\n\t\t\t\t\t\ttry {\n\t\t\t\t\t\t\t/\n\t\t\t\t\t\t\t* Try to parse the data from the event. This is important\n\t\t\t\t\t\t\t* to have in a try/catch because often messages may come\n\t\t\t\t\t\t\t* through that are not JSON\n\t\t\t\t\t\t\t*/\n\t\t\t\t\t\t\tif (msgIsString) {\n\t\t\t\t\t\t\t\tjson = JSON.parse(event.data);\n\t\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t\tjson = event.data;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t} catch (ignore) {}\n\t\t\t\t\t\tvar payload = json.__tcfapiCall;\n\t\t\t\t\t\tif (payload) {\n\t\t\t\t\t\t\twindow.__tcfapi(\n\t\t\t\t\t\t\t\tpayload.command,\n\t\t\t\t\t\t\t\tpayload.version,\n\t\t\t\t\t\t\t\tfunction (retValue, success) {\n\t\t\t\t\t\t\t\t\tvar returnMsg = {\n\t\t\t\t\t\t\t\t\t\t__tcfapiReturn: {\n\t\t\t\t\t\t\t\t\t\t\treturnValue: retValue,\n\t\t\t\t\t\t\t\t\t\t\tsuccess: success,\n\t\t\t\t\t\t\t\t\t\t\tcallId: payload.callId,\n\t\t\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\t};\n\t\t\t\t\t\t\t\t\tif (msgIsString) {\n\t\t\t\t\t\t\t\t\t\treturnMsg = JSON.stringify(returnMsg);\n\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\tevent.source.postMessage(returnMsg, '');\n\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\tpayload.parameter\n\t\t\t\t\t\t\t);\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\t/**\n\t\t\t\t\t Iterate up to the top window checking for an already-created\n\t\t\t\t\t* "__tcfapilLocator" frame on every level. If one exists already then we are\n\t\t\t\t\t* not the master CMP and will not queue commands.\n\t\t\t\t\t*/\n\t\t\t\t\twhile (win) {\n\t\t\t\t\t\ttry {\n\t\t\t\t\t\t\tif (win.frames[TCF_LOCATOR_NAME]) {\n\t\t\t\t\t\t\t\tcmpFrame = win;\n\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t} catch (ignore) {}\n\t\t\t\t\t\t// if we're at the top and no cmpFrame\n\t\t\t\t\t\tif (win === window.top) {\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t\t// Move up\n\t\t\t\t\t\twin = win.parent;\n\t\t\t\t\t}\n\t\t\t\t\tif (!cmpFrame) {\n\t\t\t\t\t\t// we have recur'd up the windows and have found no __tcfapiLocator frame\n\t\t\t\t\t\taddFrame();\n\t\t\t\t\t\twin.__tcfapi = tcfAPIHandler;\n\t\t\t\t\t\twin.addEventListener('message', postMessageEventHandler, false);\n\t\t\t\t\t}\n\t\t\t\t};\n\t\t\t\tmakeStub();\n\t\t\t}());\n\t\t\n\t\n\t\n\t\n\t\t<script async="async" type="text/javascript" crossorigin src='//consent.trustarc.com/notice?domain=forbes_iab2.com&c=teconsent&js=nj&noticeType=bb&gtm=1'>\n\t\n\t\n\t\n\n"

我应该怎么处理?

看来您只需要指定 notice_gdpr_prefs cookie 字段即可。原始值为0,1,2::implied,eu;,但即使值为空,它也是returns数据。似乎只检查 cookie 字段是否存在:

library(httr)
library(jsonlite)

result <- GET("https://www.forbes.com/forbesapi/person/billionaires/2020/position/true.json", 
  add_headers(cookie = "notice_gdpr_prefs=")
)
billionaires = fromJSON(content(result, "text"))

print(billionaires)

kaggle link