bigquery 的意外关键字参数 'type'

Question

所以我试着按照这个例子： http://ajkannan.github.io/gcloud-python/latest/bigquery-usage.html

但是当我尝试创建 table 时：

import os
import subprocess
import sys
from gcloud.bigquery import SchemaField
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "toto.json"
os.environ['GCLOUD_PROJECT'] = 'titi'
from gcloud import pubsub

client = pubsub.Client('titi')

# Imports the Google Cloud client library
from google.cloud import bigquery

# Instantiates a client
bigquery_client = bigquery.Client()

# The name for the new dataset
dataset_name = 'tata'

dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(name='aspire_page')

table.schema = [
     SchemaField(name= 'id', type= 'int', mode= 'nullable'),
     SchemaField(name= 'zip', type= 'string', mode= 'nullable'),
     SchemaField(name= 'html', type= 'string', mode= 'nullable'),
      SchemaField(name= 'url', type= 'string', mode= 'nullable'),
      SchemaField(name= 'categorie', type= 'string', mode= 'nullable'),
     SchemaField(name= 'date', type= 'string', mode= 'nullable'),
     SchemaField(name='name', type= 'string', mode= 'nullable'),

]


table.create()

我有一个：

TypeError                                 Traceback (most recent call last)
<ipython-input-10-30edba459053> in <module>()
     23 
     24 table.schema = [
---> 25      SchemaField(name= 'id', type= 'int', mode= 'nullable'),
     26      SchemaField(name= 'zip', type= 'string', mode= 'nullable'),
     27      SchemaField(name= 'html', type= 'string', mode= 'nullable'),

TypeError: __init__() got an unexpected keyword argument 'type'

而且我不明白为什么 SchemaField 需要一个类型来初始化...

如果有人有想法

感谢和问候

编辑：

即使@andre622 也不工作:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-f177aa490fbb> in <module>()
     29   SchemaField('categorie', 'STRING', mode= 'nullable'),
     30  SchemaField('date', 'STRING', mode= 'nullable'),
---> 31  SchemaField('name', 'STRING', mode= 'nullable'),
     32 ]
     33 

/usr/local/lib/python3.5/dist-packages/google/cloud/bigquery/table.py in schema(self, value)
    113         """
    114         if not all(isinstance(field, SchemaField) for field in value):
--> 115             raise ValueError('Schema items must be fields')
    116         self._schema = tuple(value)
    117 

ValueError: Schema items must be fields

即使有尼克的建议:

import os
import subprocess
import sys
from gcloud.bigquery import SchemaField
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "toto.json"
os.environ['GCLOUD_PROJECT'] = 'titi'
from gcloud import pubsub

client = pubsub.Client('titi')

# Imports the Google Cloud client library
from google.cloud import bigquery

# Instantiates a client
bigquery_client = bigquery.Client()

# The name for the new dataset
dataset_name = 'choual'

# Prepares the new dataset
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(name='aspire_page')

table.schema = [
     SchemaField('id','INTEGER'),
     SchemaField('zip', 'STRING'),
     SchemaField('html', 'STRING'),
     SchemaField('url', 'STRING'),
     SchemaField('categorie', 'STRING'),
     SchemaField('date', 'STRING'),
     SchemaField('name', 'STRING')
]


table.create()

我收到这个错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-191573ca7711> in <module>()
     29      SchemaField('categorie', 'STRING'),
     30      SchemaField('date', 'STRING'),
---> 31      SchemaField('name', 'STRING')
     32 ]
     33 

/usr/local/lib/python3.5/dist-packages/google/cloud/bigquery/table.py in schema(self, value)
    113         """
    114         if not all(isinstance(field, SchemaField) for field in value):
--> 115             raise ValueError('Schema items must be fields')
    116         self._schema = tuple(value)
    117 

ValueError: Schema items must be fields

Answer 1

您不需要为传递给 table 定义的前两个 key-value 对提供密钥。此外，您的数据类型定义应遵循 BigQuery 摄取它们的方式。您的架构应定义为

table.schema = [
 SchemaField('id', 'INTEGER', mode= 'nullable'),
 SchemaField('zip', 'STRING', mode= 'nullable'),
 SchemaField('html', 'STRING', mode= 'nullable'),
  SchemaField('url', 'STRING', mode= 'nullable'),
  SchemaField('categorie', 'STRING', mode= 'nullable'),
 SchemaField('date', 'STRING', mode= 'nullable'),
 SchemaField('name', 'STRING', mode= 'nullable'),
]

Answer 2

取自 github 来源，SchemaField 不带 type，它带 field_type，这就是在 @andre622 的建议之前导致你的错误的原因：

（请注意，以下代码不是我写的。所有代码均属于 Google Inc. 在 Apache 2 许可证下）

"""Describe a single field within a table schema.
:type name: str
:param name: the name of the field.
:type field_type: str
:param field_type: the type of the field (one of 'STRING', 'INTEGER',
                       'FLOAT', 'BOOLEAN', 'TIMESTAMP' or 'RECORD').
:type mode: str
:param mode: the type of the field (one of 'NULLABLE', 'REQUIRED',
                 or 'REPEATED').
:type description: str
:param description: optional description for the field.
:type fields: list of :class:`SchemaField`, or None
:param fields: subfields (requires ``field_type`` of 'RECORD').
"""
def __init__(self, name, field_type, mode='NULLABLE', description=None,
             fields=None):
    self.name = name
    self.field_type = field_type
    self.mode = mode
    self.description = description
    self.fields = fields

当您使用默认模式时，您应该能够使用：

table.schema = [
     SchemaField('id','INTEGER'),
     SchemaField('zip', 'STRING'),
     SchemaField('html', 'STRING'),
     SchemaField('url', 'STRING'),
     SchemaField('categorie', 'STRING'),
     SchemaField('date', 'STRING'),
     SchemaField('name', 'STRING')
]

至于为什么它需要一个类型，它怎么知道你想在该字段中存储什么类型的数据，在 DBMS 中，这允许为每个字段正确分配 space 作为一行将需要最多特定数量的字节。这样就可以通过了解第一行的位置以及每行的大小来进行随机访问。

编辑：

你能试试吗：

table = dataset.table('aspire_page', [
         SchemaField('id','INTEGER'),
         SchemaField('zip', 'STRING'),
         SchemaField('html', 'STRING'),
         SchemaField('url', 'STRING'),
         SchemaField('categorie', 'STRING'),
         SchemaField('date', 'STRING'),
         SchemaField('name', 'STRING')
    ])

也可以尝试使用 bigquery.SchemaField 而不是 SchemaField，在从 gcloud.bigquery 和 google.cloud.bigquery 导入 SchemaField 后，您可能会遇到名称冲突。

bigquery 的意外关键字参数 'type'

unexpected keyword argument 'type' for bigquery

python

google-bigquery

gcloud