将字典建模为 python 中的可查询数据 object

Question

我有一个简单的图书目录字典如下

{ 
  'key':
  {
  'title': str,
  'authors': [ {
                 'firstname': str,
                 'lastname': str 
               }
             ],
  'tags': [ str ],
  'blob': str
  }
}

每个book是字典中的一个字符串键。一本书包含一个 title，并且可能有多个 authors（通常只有一个）。 author 由两个字符串组成，firstname 和 lastname。我们还可以将许多 tags 与一本书关联为小说、文学、艺术、 1900s，等等。每本书作为包含附加数据的 blob 字段。（通常是书本身）。我希望能够根据数据（如作者、标签）搜索给定条目（或一组条目）。

我的主要工作流程是：

给定一个查询，return 与每个条目关联的所有 blob 字段。

我的问题是如何对此建模，使用哪些库或格式来保持给定的约束：

尽量减少数据条数objects（优先选择单条数据object，以简化查询）。
小列（为每个可能的标签创建一个新列可能是疯狂的并导致非常稀疏的数据集）
不要复制 blob 字段（因为它可能很大）。

我的第一个想法是为每个作者创建多行，例如：

{ '123': { 'title': 'A sample book',
           'authors': [ {'firstname': 'John', 'lastname': 'Smith'},
                        {'firstname': 'Foos', 'lastname': 'M. Bar'} ]
           'tags': [ 'tag1', 'tag2', 'tag3' ],
           'blob': '.....'
}

最初会变成两个条目

idx	key	Title	authors_firstname	authors_lastname	tags	blob
0	123	Sample Book	John	Smith	['tag1', 'tag2', 'tag3']	...
1	123	Sample Book	Foos	M. Bar	['tag1', 'tag2', 'tag3']	...

但这仍然会复制 blob，并且仍然需要弄清楚如何处理未知数量的标签（随着数据库的增长）。

Answer 1

您可以使用 TinyDB 来完成您想要的。

首先，将您的字典转换为数据库：

from tinydb import TinyDB, Query
from tinydb.table import Document

data = [{'123': {'title': 'A sample book',
                 'authors': [{'firstname': 'John', 'lastname': 'Smith'},
                             {'firstname': 'Foos', 'lastname': 'M. Bar'}],
                 'tags': ['tag1', 'tag2', 'tag3'],
                 'blob': 'blob1'}},
        {'456': {'title': 'Another book',
                 'authors': [{'firstname': 'Paul', 'lastname': 'Roben'}],
                 'tags': ['tag1', 'tag3', 'tag4'],
                 'blob': 'blob2'}}]

db = TinyDB('catalog.json')
for record in data:
    db.insert(Document(list(record.values())[0], doc_id=list(record.keys())[0]))

现在可以查询了：

Book = Query()
Author = Query()

rows = db.search(Book.authors.any(Author.lastname == 'Smith'))
rows = db.search(Book.tags.all(['tag1', 'tag4']))
rows = db.all()

Given a query, return all blob fields associated to each entry.

blobs = {row.doc_id: row['blob'] for row in db.all()}

>>> blobs
{123: 'blob1', 456: 'blob2'}

将字典建模为 python 中的可查询数据 object

Modeling a dictionary as a queryable data object in python

python

database-design

dictionary