6502 的快速签名 16 位除以 7

Question

我正在为 6502 cpu 开发一个汇编语言程序，我发现我需要一个尽可能快的除以七例程，特别是一个可能需要 16-位分红.

找到的例程我很熟悉here，但是把那里找到的除七例程概括起来比较复杂，粗略地考察一下一般算法（使用整数除法）

x/7 ~= (x + x/8 + x/64 ... )/8

表示要处理 16 位范围，可能需要超过 100 个周期才能完成，因为 6502 的单个累加器寄存器和 6502 上各个内存位移相对较慢。

我认为查找 table 可能会有帮助，但在 6502 上，我当然只能查找 256 字节或更少的 table。为此，我们可以假设存在两个 256 字节的查找 tables，xdiv7 和 xmod7，当使用一个无符号的单字节值作为 table 的索引时，它们可以快速获得一个字节分别除以 7 或模 7 的结果。但是，我不确定如何利用这些来查找完整 16 位范围的值。

与此同时，我还需要一个模 7 算法，尽管理想情况下，可以用除法计算出的任何解决方案也会产生 mod7 结果。如果需要额外的预计算 tables，我可以添加这些，只要所有 tables 的总内存需求不超过大约 3k。

虽然我最终需要一个带符号的除法算法，但一个无符号的除法算法就足够了，因为我可以根据需要将其推广到带符号的例程。

如有任何帮助，我们将不胜感激。

Answer 1

注意：正如 @Damien_The_Unbeliever 在评论中指出的那样，upperHigh 和 lowerLow table 是相同的。所以他们可以合并成一个table。但是，这种优化会使代码更难阅读，解释也更难编写，因此合并 table 留作 reader.

的练习。

下面的代码显示了如何在将 16 位无符号值除以 7 时生成商和余数。解释代码的最简单方法（IMO）是一个例子，所以让我们考虑除以 0xa732 by 7. 预期结果为：

quotient = 0x17e2
remainder = 4

我们首先将输入视为两个 8 位值，upper 字节和 lower 字节。 upper字节是0xa7，lower字节是0x32。

我们从 upper 字节计算商和余数：

0xa700 / 7 = 0x17db
0xa700 % 7 = 3

所以我们需要三个 tables:

upperHigh存放商的高字节：upperHigh[0xa7] = 0x17
upperLow存放商的低字节：upperLow[0xa7] = 0xdb
upperRem 存储余数：upperRem[0xa7] = 3

然后我们从 lower 字节计算商和余数：

0x32 / 7 = 0x07
0x32 % 7 = 1

所以我们需要两个 tables:

lowerLow存放商的低字节：lowerLow[0x32] = 0x07
lowerRem 存储余数：lowerRem[0x32] = 1

现在我们需要assemble最终答案。余数是两个余数之和。由于每个余数都在 [0,6] 范围内，因此总和在 [0,12] 范围内。所以我们可以使用两个13字节的查找将和转换为最后的余数和进位。

商的低字节是进位与 lowerLow 和 upperLow table 中的值之和。请注意，总和可能会产生进位到高字节。

商的高字节是进位与upperHightable.

的值之和

因此，要完成示例：

remainder = 1 + 3 = 4              // simple add (no carry in)
lowResult = 0x07 + 0xdb = 0xe2     // add with carry from remainder
highResult = 0x17                  // add with carry from lowResult

实现它的汇编代码包括 7 个 table 查找、一个无进位加法指令和两个有进位加法指令。

#include <stdio.h>
#include <stdint.h>

uint8_t upperHigh[256];  // index:(upper 8 bits of the number)  value:(high 8 bits of the quotient)
uint8_t upperLow[256];   // index:(upper 8 bits of the number)  value:(low  8 bits of the quotient)
uint8_t upperRem[256];   // index:(upper 8 bits of the number)  value:(remainder when dividing the upper bits by 7)
uint8_t lowerLow[256];   // index:(lower 8 bits of the number)  value:(low  8 bits of the quotient)
uint8_t lowerRem[256];   // index:(lower 8 bits of the number)  value:(remainder when dividing the lower bits by 7)
uint8_t carryRem[13]    = { 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 };
uint8_t combinedRem[13] = { 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5 };

void populateLookupTables(void)
{
    for (uint16_t i = 0; i < 256; i++)
    {
        uint16_t upper = i << 8;
        upperHigh[i] = (upper / 7) >> 8;
        upperLow[i] = (upper / 7) & 0xff;
        upperRem[i] = upper % 7;

        uint16_t lower = i;
        lowerLow[i] = lower / 7;
        lowerRem[i] = lower % 7;
    }
}

void divideBy7(uint8_t upperValue, uint8_t lowerValue, uint8_t *highResult, uint8_t *lowResult, uint8_t *remainder)
{
    uint8_t temp = upperRem[upperValue] + lowerRem[lowerValue];
    *remainder = combinedRem[temp];
    *lowResult = upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp];
    uint8_t carry = (upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp]) >> 8;  // Note this is just the carry flag from the 'lowResult' calcaluation
    *highResult = upperHigh[upperValue] + carry;
}

int main(void)
{
    populateLookupTables();

    uint16_t n = 0;
    while (1)
    {
        uint8_t upper = n >> 8;
        uint8_t lower = n & 0xff;

        uint16_t quotient1  = n / 7;
        uint16_t remainder1 = n % 7;

        uint8_t high, low, rem;
        divideBy7(upper, lower, &high, &low, &rem);
        uint16_t quotient2 = (high << 8) | low;
        uint16_t remainder2 = rem;

        printf("n=%u q1=%u r1=%u q2=%u r2=%u", n, quotient1, remainder1, quotient2, remainder2);
        if (quotient1 != quotient2 || remainder1 != remainder2)
            printf(" **** failed ****");
        printf("\n");

        n++;
        if (n == 0)
            break;
    }
}

Answer 2

在Unsigned Integer Division Routines 8 位除以 7:

;Divide by 7 (From December '84 Apple Assembly Line)
;15 bytes, 27 cycles
  sta  temp
  lsr
  lsr
  lsr
  adc  temp
  ror
  lsr
  lsr
  adc  temp
  ror
  lsr
  lsr

大约 100 个周期的估计非常准确：104 个周期到最后一个 ror，总共 106 个周期不包括 rts，整个函数 112 个周期。
注意：在为 C64 组装并使用 VICE 仿真器为 C64 后，我发现算法失败，例如 65535 给出 9343，正确答案是 9362。

   ; for 16 bit division  by 7
   ; input:
  ;   register A is low byte
  ;   register X is high byte
  ; output 
  ;   register A is low byte
  ;   register X is high byte
  ;
  ; memory on page zero
  ; temp     is on page zero, 2 bytes
  ; aHigh    is on page zero, 1 byte
  --
  sta temp
  stx temp+1
  stx aHigh
  --
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  ---
  adc temp
  tax
  lda aHigh
  adc temp+1
  sta aHigh
  txa
  --
  ror aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  --
  adc temp
  tax
  lda aHigh
  adc temp+1
  sta aHigh
  txa
  --
  ror aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a     -- 104 cycles
  ;-------
  ldx aHigh  ; -- 106
  rts        ; -- 112 cycles

Answer 3

另一种方法是将除法转换为乘法。

为了计算乘数，我们有兴趣取倒数。我们基本上是这样做的：

d = n*(1/7)

为了使事情更准确，我们乘以 2 的方便次方。2^16 效果很好：

d = floor(n*floor(65536/7)/65536)

乘数为：floor(65536/7) 即9362。结果的大小将是：

ceiling(log2(65535*9362)) = 30 bits (4 bytes rounded up)

然后我们可以丢弃低两个字节除以 65536，或者只使用高 2 个字节作为最终结果。

为了计算出我们需要的实际旋转和加法，我们检查因子 9362 的二进制表示：

10010010010010

注意位模式的重复。因此，一个有效的方案是计算：

((n*9*256/4 + n*9)*8 + n)*2 = 9362*n

计算n*9只需要ceiling(log2(65535*9)) = 20位（3字节）。

在伪汇编中是：

LDA number          ; lo byte
STA multiply_nine
LDA number+1        ; high byte
STA multiply_nine+1
LDA #0              
STA multiply_nine+2 ; 3 byte result

ASL multiply_nine   ; multiply by 2
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine   ; multiply by 2 (4)
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine   ; multiply by 2 (8)
ROL multiply_nine+1
ROL mulitply_nine+2

CLC                 ; not really needed as result is only 20 bits, carry always zero
LDA multiply_nine
ADC number
STA multiply_nine
LDA multiply_nine+1
ADC number+1
STA multiply_nine+1
LDA multiply_nine+2
ADC #0
STA multiply_nine+2 ; n*9

我把剩下的练习留给 OP。请注意，不必乘以 256，因为这只是一个完整的字节向上移动。

6502 的快速签名 16 位除以 7

Fast signed 16-bit divide by 7 for 6502

assembly

6502

division

integer-division

micro-optimization