在 Chapel 中计算矩阵的 rowSums
Calculate rowSums in Chapel for a matrix
继续我的教堂冒险...
我有一个矩阵A.
var idx = {1..n};
var adom = {idx, idx};
var A: [adom] int;
//populate A;
var rowsums: [idx] int;
填充行和的最有效方法是什么?
最有效 解决方案很难定义。但是,这是一种计算 rowsums
的方法,它既并行又优雅:
config const n = 8; // "naked" n would cause compilation to fail
const indices = 1..n; // tio.chpl:1: error: 'n' undeclared (first use this function)
const adom = {indices, indices};
var A: [adom] int;
// Populate A
[(i,j) in adom] A[i, j] = i*j;
var rowsums: [indices] int;
forall i in indices {
rowsums[i] = + reduce(A[i, ..]);
}
writeln(rowsums);
这是利用 A
的 + reduction over array slices。
请注意,forall
和 + reduce
都为上面的程序引入了并行性。如果 indices
的大小足够小,则仅使用 for
循环可能会更有效,从而避免产生任务。
一些提示
to make the code actually run-live SEQ
和 PAR
模式:
除了一些实施细节外,上述@bencray 的假设 关于 PAR
设置的假定管理费用,这可能有利于在 SEQ
设置中进行纯串行处理, 未经过实验证实。 在这里也应该注意,由于显而易见的原因,分布式模式未在 live <TiO>-IDE
上进行测试,而即使不是微小规模的分布式实施也比 运行 的具有科学意义的实验更矛盾。
事实很重要
A rowsums[]
处理,即使在 2x2
的最小可能规模下,也在 SEQ
模式比 256x256
模式下的 PAR
模式更慢。
干得好,chapel 团队,在 [=12= 中最大限度地利用紧凑型硅资源的最佳对齐确实很酷]!
有关准确 运行 时间性能的记录,(参考自行记录的表格)如下,或者不要犹豫访问 live-IDE-运行(参考.' 以上 ) 并自行实验。
读者可能还会在小规模实验中识别出外部噪音,因为 O/S- 和托管-IDE-相关进程会干预资源使用并影响 <SECTION-UNDER-TEST>
运行通过不利 CPU/Lx-CACHE/memIO/process/et al 冲突的时间性能,事实上排除了这些测量用于某些广义解释。
希望所有人都能享受 chapel 可爱的 [TIME]
结果
在不断增长的 [EXPSPACE]
规模计算环境中展示的
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;
//nst max_idx = 123456; // seems to be too fat for <TiO>-IDE to allocate <TiO>-- /wrappers/chapel: line 6: 24467 Killed
const max_idx = 4096;
//nst max_idx = 8192; // seems to be too long for <TiO>-IDE to let it run [SEQ] part <TiO>-- The request exceeded the 60 second time limit and was terminated
//nst max_idx = 16384; // seems to be too long for <TiO>-IDE to let it run [PAR] part too <TiO>-- /wrappers/chapel: line 6: 12043 Killed
const indices = 1..max_idx;
const adom = {indices, indices};
var A: [adom] int;
[(i,j) in adom] A[i, j] = i*j; // Populate A[,]
var rowsums: [indices] int;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
for i in indices { // SECTION-UNDER-TEST--
rowsums[i] = + reduce(A[i, ..]); // SECTION-UNDER-TEST--
} // SECTION-UNDER-TEST--
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop();
/*
<SECTION-UNDER-TEST> took 8973 [us] to run in [SEQ] mode for 2 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 28611 [us] to run in [SEQ] mode for 4 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 58824 [us] to run in [SEQ] mode for 8 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 486786 [us] to run in [SEQ] mode for 64 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 1019990 [us] to run in [SEQ] mode for 128 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 2010680 [us] to run in [SEQ] mode for 256 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 4154970 [us] to run in [SEQ] mode for 512 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8260960 [us] to run in [SEQ] mode for 1024 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 15853000 [us] to run in [SEQ] mode for 2048 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 33126800 [us] to run in [SEQ] mode for 4096 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took n/a [us] to run in [SEQ] mode for 8192 elements on <TiO>-IDE
============================================ */
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
forall i in indices { // SECTION-UNDER-TEST--
rowsums[i] = + reduce(A[i, ..]); // SECTION-UNDER-TEST--
} // SECTION-UNDER-TEST--
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop();
/*
<SECTION-UNDER-TEST> took 12131 [us] to run in [PAR] mode for 2 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8095 [us] to run in [PAR] mode for 4 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8023 [us] to run in [PAR] mode for 8 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8156 [us] to run in [PAR] mode for 64 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 7990 [us] to run in [PAR] mode for 128 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8692 [us] to run in [PAR] mode for 256 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 15134 [us] to run in [PAR] mode for 512 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 16926 [us] to run in [PAR] mode for 1024 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 30671 [us] to run in [PAR] mode for 2048 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 105323 [us] to run in [PAR] mode for 4096 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 292232 [us] to run in [PAR] mode for 8192 elements on <TiO>-IDE
============================================ */
writeln( rowsums,
"\n <SECTION-UNDER-TEST> took ", aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] to run in [SEQ] mode for ", max_idx, " elements on <TiO>-IDE",
"\n <SECTION-UNDER-TEST> took ", aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] to run in [PAR] mode for ", max_idx, " elements on <TiO>-IDE"
);
这就是 chapel 如此出色的原因
感谢您为 HPC 开发和改进如此出色的计算工具。
继续我的教堂冒险...
我有一个矩阵A.
var idx = {1..n};
var adom = {idx, idx};
var A: [adom] int;
//populate A;
var rowsums: [idx] int;
填充行和的最有效方法是什么?
最有效 解决方案很难定义。但是,这是一种计算 rowsums
的方法,它既并行又优雅:
config const n = 8; // "naked" n would cause compilation to fail
const indices = 1..n; // tio.chpl:1: error: 'n' undeclared (first use this function)
const adom = {indices, indices};
var A: [adom] int;
// Populate A
[(i,j) in adom] A[i, j] = i*j;
var rowsums: [indices] int;
forall i in indices {
rowsums[i] = + reduce(A[i, ..]);
}
writeln(rowsums);
这是利用 A
的 + reduction over array slices。
请注意,forall
和 + reduce
都为上面的程序引入了并行性。如果 indices
的大小足够小,则仅使用 for
循环可能会更有效,从而避免产生任务。
一些提示
to make the code actually run-live SEQ
和 PAR
模式:
除了一些实施细节外,上述@bencray 的假设 关于 PAR
设置的假定管理费用,这可能有利于在 SEQ
设置中进行纯串行处理, 未经过实验证实。 在这里也应该注意,由于显而易见的原因,分布式模式未在 live <TiO>-IDE
上进行测试,而即使不是微小规模的分布式实施也比 运行 的具有科学意义的实验更矛盾。
事实很重要
A rowsums[]
处理,即使在 2x2
的最小可能规模下,也在 SEQ
模式比 256x256
模式下的 PAR
模式更慢。
干得好,chapel 团队,在 [=12= 中最大限度地利用紧凑型硅资源的最佳对齐确实很酷]!
有关准确 运行 时间性能的记录,(参考自行记录的表格)如下,或者不要犹豫访问 live-IDE-运行(参考.' 以上 ) 并自行实验。
读者可能还会在小规模实验中识别出外部噪音,因为 O/S- 和托管-IDE-相关进程会干预资源使用并影响 <SECTION-UNDER-TEST>
运行通过不利 CPU/Lx-CACHE/memIO/process/et al 冲突的时间性能,事实上排除了这些测量用于某些广义解释。
希望所有人都能享受 chapel 可爱的 [TIME]
结果
在不断增长的 [EXPSPACE]
规模计算环境中展示的
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;
//nst max_idx = 123456; // seems to be too fat for <TiO>-IDE to allocate <TiO>-- /wrappers/chapel: line 6: 24467 Killed
const max_idx = 4096;
//nst max_idx = 8192; // seems to be too long for <TiO>-IDE to let it run [SEQ] part <TiO>-- The request exceeded the 60 second time limit and was terminated
//nst max_idx = 16384; // seems to be too long for <TiO>-IDE to let it run [PAR] part too <TiO>-- /wrappers/chapel: line 6: 12043 Killed
const indices = 1..max_idx;
const adom = {indices, indices};
var A: [adom] int;
[(i,j) in adom] A[i, j] = i*j; // Populate A[,]
var rowsums: [indices] int;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
for i in indices { // SECTION-UNDER-TEST--
rowsums[i] = + reduce(A[i, ..]); // SECTION-UNDER-TEST--
} // SECTION-UNDER-TEST--
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop();
/*
<SECTION-UNDER-TEST> took 8973 [us] to run in [SEQ] mode for 2 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 28611 [us] to run in [SEQ] mode for 4 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 58824 [us] to run in [SEQ] mode for 8 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 486786 [us] to run in [SEQ] mode for 64 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 1019990 [us] to run in [SEQ] mode for 128 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 2010680 [us] to run in [SEQ] mode for 256 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 4154970 [us] to run in [SEQ] mode for 512 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8260960 [us] to run in [SEQ] mode for 1024 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 15853000 [us] to run in [SEQ] mode for 2048 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 33126800 [us] to run in [SEQ] mode for 4096 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took n/a [us] to run in [SEQ] mode for 8192 elements on <TiO>-IDE
============================================ */
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
forall i in indices { // SECTION-UNDER-TEST--
rowsums[i] = + reduce(A[i, ..]); // SECTION-UNDER-TEST--
} // SECTION-UNDER-TEST--
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop();
/*
<SECTION-UNDER-TEST> took 12131 [us] to run in [PAR] mode for 2 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8095 [us] to run in [PAR] mode for 4 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8023 [us] to run in [PAR] mode for 8 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8156 [us] to run in [PAR] mode for 64 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 7990 [us] to run in [PAR] mode for 128 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8692 [us] to run in [PAR] mode for 256 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 15134 [us] to run in [PAR] mode for 512 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 16926 [us] to run in [PAR] mode for 1024 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 30671 [us] to run in [PAR] mode for 2048 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 105323 [us] to run in [PAR] mode for 4096 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 292232 [us] to run in [PAR] mode for 8192 elements on <TiO>-IDE
============================================ */
writeln( rowsums,
"\n <SECTION-UNDER-TEST> took ", aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] to run in [SEQ] mode for ", max_idx, " elements on <TiO>-IDE",
"\n <SECTION-UNDER-TEST> took ", aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] to run in [PAR] mode for ", max_idx, " elements on <TiO>-IDE"
);
这就是 chapel 如此出色的原因
感谢您为 HPC 开发和改进如此出色的计算工具。