从节点列表中提取边和社区
Extract edge and communities from list of nodes
我有超过 50k 个节点的数据集,我正在尝试从中提取可能的边和社区。我确实尝试使用一些图形工具,如 gephi、cytoscape、socnet、nodexl 等来可视化和识别边缘和社区,但节点列表对于这些工具来说太大了。因此,我正在尝试编写脚本来确定边缘和社区。其他列是带有 GPS 位置的连接开始日期时间和结束日期时间。
输入:
Id,开始时间,结束时间,gps1,gps2
0022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00904b14b494,1073260804,1073265163,817558,439525
00904b14b494,1073260804,1073263786,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d1406df,1073260807,1073260878,820428,438735
00022d623dfe,1073260810,1073276346,819251,440006
00022d7317d7,1073260810,1073276155,819251,440006
00022d9064bc,1073260810,1073272525,819251,440006
00022d9064bc,1073260810,1073260999,819251,440006
00022d9064bc,1073260810,1073260857,819251,440006
0030650c9eda,1073260811,1073260813,820356,439224
00022d0e0cec,1073260813,1073262843,820187,439271
00022d176cf3,1073260813,1073260962,817721,439564
000c30d8d2e8,1073260813,1073260902,817721,439564
00904b243bc4,1073260813,1073260962,817721,439564
00904b2fc34d,1073260813,1073260962,817721,439564
00904b52b839,1073260813,1073260962,817721,439564
00904b9a5a51,1073260813,1073260962,817721,439564
00904ba8b682,1073260813,1073260962,817721,439564
00022d3be9cd,1073260815,1073261114,819269,439403
00022d80381f,1073260815,1073261114,819269,439403
00022dc1b09c,1073260815,1073261114,819269,439403
00022d36a6df,1073260817,1073260836,820761,438607
00022d36a6df,1073260817,1073260845,820761,438607
003065d2d8b6,1073260817,1073267560,817735,439757
00904b0c7856,1073260817,1073265149,817735,439757
00022de73863,1073260825,1073260879,817558,439525
00904b14b494,1073260825,1073260879,817558,439525
00904b312d9e,1073260825,1073260879,817558,439525
00022d15b1c7,1073260826,1073260966,820353,439280
00022dcbe817,1073260826,1073260966,820353,439280
我正在尝试实现无向加权/未加权图。
使用 Pandas 将数据放入成对的节点列表中,其中每一行代表一条边,基于您的边标准。然后迁移到networkx
对象中进行图分析。
两个节点共享边的条件包括:
- 相同位置 假设这意味着相同
gps1
和 gps2
。
- "Near same start and end time" 这个有点歧义。出于此答案的目的,我已将此标准降低为 "start time in the same 5-second interval"。如果您想在边缘上应用额外的时间条件,那么扩展我在此处采用的
groupby
方法应该不会太难。
由于我们要根据时间戳操作数据,将start
和end
转换为datetime
dtype
:
df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")
df.start.describe()
count 35
unique 11
top 2004-01-05 00:00:13
freq 8
first 2004-01-05 00:00:01
last 2004-01-05 00:00:26
Name: start, dtype: object
df.head()
ID start end gps1 gps2
0 0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03 819251 440006
1 00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10 819213 439954
2 00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40 817526 439458
3 00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50 817558 439525
4 00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25 817558 439525
样本观察发生在彼此几秒内,因此我们将 grouping frequency 设置为仅几秒:
near = "5s"
现在groupby
查找连接节点的位置和开始时间:
edges = (df.groupby(["gps1",
"gps2",
pd.Grouper(key="start",
freq=near,
closed="right",
label="right")],
as_index=False)
.agg({"ID":','.join,
"start":"min",
"end":"max"})
.reset_index()
.rename(columns={"index":"edge",
"start":"start_min",
"end":"end_max"})
)
edges.ID = edges.ID.str.split(",")
edges.head()
:
edge gps1 gps2 ID \
0 0 817526 439458 [00904b4557d3]
1 1 817558 439525 [00022de73863, 00904b14b494, 00904b14b494, 009...
2 2 817558 439525 [00022de73863, 00904b14b494, 00904b312d9e]
3 3 817721 439564 [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...
4 4 817735 439757 [003065d2d8b6, 00904b0c7856]
start_min end_max
0 2004-01-05 00:00:03 2004-01-05 00:18:40
1 2004-01-05 00:00:04 2004-01-05 01:16:50
2 2004-01-05 00:00:25 2004-01-05 00:01:19
3 2004-01-05 00:00:13 2004-01-05 00:02:42
4 2004-01-05 00:00:17 2004-01-05 01:52:40
现在每一行代表一个独特的边缘类别。 ID
是所有共享该边缘的节点列表。将这个列表放入新的节点对结构中有点棘手;我求助于一些老式的嵌套 for 循环。这里可能有一些 Pandas-fu 可以提高效率:
注意:在单例节点的情况下,我为其对分配了一个None
值。如果你不想跟踪单例,就忽略if not len(combos): ...
逻辑。
pairs = []
idx = 0
for e in edges.edge.values:
nodes = edges.loc[edges.edge==e, "ID"].values[0]
attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]
combos = list(combinations(nodes, 2))
if not len(combos):
pair = [e, nodes[0], None]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
else:
for combo in combos:
pair = [e, combo[0], combo[1]]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)
pairs_df.head()
:
edge nodeA nodeB gps1 gps2 start_min \
0 0 00904b4557d3 None 817526 439458 2004-01-05 00:00:03
1 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
2 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
3 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
4 1 00904b14b494 00904b14b494 817558 439525 2004-01-05 00:00:04
end_max
0 2004-01-05 00:18:40
1 2004-01-05 01:16:50
2 2004-01-05 01:16:50
3 2004-01-05 01:16:50
4 2004-01-05 01:16:50
现在数据可以适合 networkx
对象:
import networkx as nx
g = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)
# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
g[test_A][test_B]["start_min"]
# output:
Timestamp('2004-01-05 00:00:25')
对于社区检测,有多种选择。考虑 networkx
community algorithms, as well as the community
模块,它基于本机 networkx
功能构建。
我读到你的问题主要涉及将数据处理成适合网络分析的格式。由于这个答案已经足够长了,我会留给你去追求社区检测策略——有几种方法可以与我在这里链接到的模块一起使用。
我有超过 50k 个节点的数据集,我正在尝试从中提取可能的边和社区。我确实尝试使用一些图形工具,如 gephi、cytoscape、socnet、nodexl 等来可视化和识别边缘和社区,但节点列表对于这些工具来说太大了。因此,我正在尝试编写脚本来确定边缘和社区。其他列是带有 GPS 位置的连接开始日期时间和结束日期时间。
输入:
Id,开始时间,结束时间,gps1,gps2
0022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00904b14b494,1073260804,1073265163,817558,439525
00904b14b494,1073260804,1073263786,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d1406df,1073260807,1073260878,820428,438735
00022d623dfe,1073260810,1073276346,819251,440006
00022d7317d7,1073260810,1073276155,819251,440006
00022d9064bc,1073260810,1073272525,819251,440006
00022d9064bc,1073260810,1073260999,819251,440006
00022d9064bc,1073260810,1073260857,819251,440006
0030650c9eda,1073260811,1073260813,820356,439224
00022d0e0cec,1073260813,1073262843,820187,439271
00022d176cf3,1073260813,1073260962,817721,439564
000c30d8d2e8,1073260813,1073260902,817721,439564
00904b243bc4,1073260813,1073260962,817721,439564
00904b2fc34d,1073260813,1073260962,817721,439564
00904b52b839,1073260813,1073260962,817721,439564
00904b9a5a51,1073260813,1073260962,817721,439564
00904ba8b682,1073260813,1073260962,817721,439564
00022d3be9cd,1073260815,1073261114,819269,439403
00022d80381f,1073260815,1073261114,819269,439403
00022dc1b09c,1073260815,1073261114,819269,439403
00022d36a6df,1073260817,1073260836,820761,438607
00022d36a6df,1073260817,1073260845,820761,438607
003065d2d8b6,1073260817,1073267560,817735,439757
00904b0c7856,1073260817,1073265149,817735,439757
00022de73863,1073260825,1073260879,817558,439525
00904b14b494,1073260825,1073260879,817558,439525
00904b312d9e,1073260825,1073260879,817558,439525
00022d15b1c7,1073260826,1073260966,820353,439280
00022dcbe817,1073260826,1073260966,820353,439280
我正在尝试实现无向加权/未加权图。
使用 Pandas 将数据放入成对的节点列表中,其中每一行代表一条边,基于您的边标准。然后迁移到networkx
对象中进行图分析。
两个节点共享边的条件包括:
- 相同位置 假设这意味着相同
gps1
和gps2
。 - "Near same start and end time" 这个有点歧义。出于此答案的目的,我已将此标准降低为 "start time in the same 5-second interval"。如果您想在边缘上应用额外的时间条件,那么扩展我在此处采用的
groupby
方法应该不会太难。
由于我们要根据时间戳操作数据,将start
和end
转换为datetime
dtype
:
df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")
df.start.describe()
count 35
unique 11
top 2004-01-05 00:00:13
freq 8
first 2004-01-05 00:00:01
last 2004-01-05 00:00:26
Name: start, dtype: object
df.head()
ID start end gps1 gps2
0 0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03 819251 440006
1 00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10 819213 439954
2 00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40 817526 439458
3 00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50 817558 439525
4 00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25 817558 439525
样本观察发生在彼此几秒内,因此我们将 grouping frequency 设置为仅几秒:
near = "5s"
现在groupby
查找连接节点的位置和开始时间:
edges = (df.groupby(["gps1",
"gps2",
pd.Grouper(key="start",
freq=near,
closed="right",
label="right")],
as_index=False)
.agg({"ID":','.join,
"start":"min",
"end":"max"})
.reset_index()
.rename(columns={"index":"edge",
"start":"start_min",
"end":"end_max"})
)
edges.ID = edges.ID.str.split(",")
edges.head()
:
edge gps1 gps2 ID \
0 0 817526 439458 [00904b4557d3]
1 1 817558 439525 [00022de73863, 00904b14b494, 00904b14b494, 009...
2 2 817558 439525 [00022de73863, 00904b14b494, 00904b312d9e]
3 3 817721 439564 [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...
4 4 817735 439757 [003065d2d8b6, 00904b0c7856]
start_min end_max
0 2004-01-05 00:00:03 2004-01-05 00:18:40
1 2004-01-05 00:00:04 2004-01-05 01:16:50
2 2004-01-05 00:00:25 2004-01-05 00:01:19
3 2004-01-05 00:00:13 2004-01-05 00:02:42
4 2004-01-05 00:00:17 2004-01-05 01:52:40
现在每一行代表一个独特的边缘类别。 ID
是所有共享该边缘的节点列表。将这个列表放入新的节点对结构中有点棘手;我求助于一些老式的嵌套 for 循环。这里可能有一些 Pandas-fu 可以提高效率:
注意:在单例节点的情况下,我为其对分配了一个None
值。如果你不想跟踪单例,就忽略if not len(combos): ...
逻辑。
pairs = []
idx = 0
for e in edges.edge.values:
nodes = edges.loc[edges.edge==e, "ID"].values[0]
attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]
combos = list(combinations(nodes, 2))
if not len(combos):
pair = [e, nodes[0], None]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
else:
for combo in combos:
pair = [e, combo[0], combo[1]]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)
pairs_df.head()
:
edge nodeA nodeB gps1 gps2 start_min \
0 0 00904b4557d3 None 817526 439458 2004-01-05 00:00:03
1 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
2 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
3 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
4 1 00904b14b494 00904b14b494 817558 439525 2004-01-05 00:00:04
end_max
0 2004-01-05 00:18:40
1 2004-01-05 01:16:50
2 2004-01-05 01:16:50
3 2004-01-05 01:16:50
4 2004-01-05 01:16:50
现在数据可以适合 networkx
对象:
import networkx as nx
g = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)
# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
g[test_A][test_B]["start_min"]
# output:
Timestamp('2004-01-05 00:00:25')
对于社区检测,有多种选择。考虑 networkx
community algorithms, as well as the community
模块,它基于本机 networkx
功能构建。
我读到你的问题主要涉及将数据处理成适合网络分析的格式。由于这个答案已经足够长了,我会留给你去追求社区检测策略——有几种方法可以与我在这里链接到的模块一起使用。