从文本和数字数据字典创建网络 - 训练 GNN

Question

我一直在使用 FUNSD dataset to predict sequence labeling in unstructured documents per this paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding。清理并从 dict 移动到 dataframe 后的数据如下所示：数据集布局如下：

列id是每个词组的唯一标识符 在文档中，显示在 text 列（如节点）
列label标识词组是否被分类为 'question' 或 'answer'
列 linking 表示 WordGroups 是 'linked'（如边），将相应的 'questions' 链接到 'answers'
'box' 列表示位置词组的坐标（x,y 左上角，x,y 右下角）相对于左上角 (0.0).
列 'words' 包含每个单词在词组中，及其位置（方框）。

我的目标是训练一个分类器来识别列 'words' 中使用图形神经网络链接在一起的单词，第一步是能够将我当前的数据集转换为网络。我的问题如下：

有没有办法将列 'words' 中的每一行分成两列 [box_word, text_word]，每行仅针对一个单词，同时复制剩余的其他列相同：[id, label, text, box]，导致最终数据框包含这些列：[box,text,label,box_word, text_word]
我可以标记列 'text' 和 text_word，一个热编码列 label，拆分具有多个数字的列 box 和box_word 分成单独的列，但是如何拆分 up/rearrange 列 'linking' 来定义我的网络图的边缘？
我在Using the dataframe to generate a Network, and use it to train a GNN中走的路线是否正确？

感谢所有help/tips。

Answer 1

编辑: 处理words.

列中的多个条目

您的问题 1 和 2 已在代码中得到解答。实际上非常简单（假设数据格式如屏幕截图所示正确表示）。摘要：

Q1: apply 列上的拆分功能并按 .tolist() 解压缩，以便可以创建单独的列。另见。

Q2：使用列表推导解包额外的列表层并仅保留 non-empty 条边。

Q3：是与否。是的，因为 pandas 擅长组织具有异构类型的数据。例如，lists、dict、int 和 float 可以出现在不同的列中。几个 I/O 函数，例如 pd.read_csv() 或 pd.read_json()，也非常方便。

但是，数据访问存在开销，尤其是迭代行（记录）的开销。因此，直接输入模型的转换数据通常会转换为 numpy.array 或更高效的格式。这样的格式转换任务是数据科学家的唯一职责。

代码和输出

我自己制作样本数据集。不相关的列被忽略了（因为我没有义务也不应该这样做）。

import networkx as nx
import pandas as pd

# data
df = pd.DataFrame(
    data={
        "words": [
            [{"box": [1, 2, 3, 4], "text": "TO:"}, {"box": [7, 7, 7, 7], "text": "777"}],
            [{"box": [1, 2, 3, 4], "text": "TO:"}],
            [{"text": "TO:", "box": [1, 2, 3, 4]}, {"box": [4, 4, 4, 4], "text": "444"}],
            [{"text": "TO:", "box": [1, 2, 3, 4]}],
        ],
        "linking": [
            [[0, 4]],
            [],
            [[4, 6]],
            [[6, 0]],
        ]
    }
)


# Q1. split
def split(el):
    ls_box = []
    ls_text = []
    for dic in el:
        ls_box.append(dic["box"])
        ls_text.append(dic["text"])
    return ls_box, ls_text

# straightforward but receives a deprecation warning
df[["box_word", "text_word"]] = df["words"].apply(split).tolist()
# to avoid that,
ls_tup = df["words"].apply(split).tolist()  # len: 4x2
ls_tup_tr = list(map(list, zip(*ls_tup)))  # len: 2x4
df["box_word"] = ls_tup_tr[0]
df["text_word"] = ls_tup_tr[1]

# Q2. construct graph
ls_edges = [item[0] for item in df["linking"].values if len(item) > 0]
print(ls_edges)  # [[0, 4], [4, 6], [6, 0]]

g = nx.Graph()
g.add_edges_from(ls_edges)
list(g.nodes)  # [0, 4, 6]
list(g.edges)  # [(0, 4), (0, 6), (4, 6)]

Q1输出

# trim the first column for printing
df_show = df.__deepcopy__()
df_show["words"] = df_show["words"].apply(lambda s: str(s)[:10])
df_show

Out[51]: 
        words   linking                      box_word   text_word
0  [{'box': [  [[0, 4]]  [[1, 2, 3, 4], [7, 7, 7, 7]]  [TO:, 777]
1  [{'box': [        []                [[1, 2, 3, 4]]       [TO:]
2  [{'text':   [[4, 6]]  [[1, 2, 3, 4], [4, 4, 4, 4]]  [TO:, 444]
3  [{'text':   [[6, 0]]                [[1, 2, 3, 4]]       [TO:]

从文本和数字数据字典创建网络 - 训练 GNN

Create Network from dictionary of Text and Numerical data - to train GNN

python

dictionary

graph

networkx

multilabel-classification

代码和输出